Monday, March 28, 2005

Mining CVS Data

I've been reading a book on Data Mining and was wondering how I might actually apply what I've read. Fixing some bugs today got me thinking about mining CVS data. If your cvs comments can be linked to defects (i.e. the comment contains a defect number) you might be able to make some useful observations.

Code Complexity
If a "chunk" of code has lots of defects assigned to it, that might indicate a high level of complexity. A "chunk" could be lots of things: a function, a class, a module/package, anything. My intuition is that the more bugs, the more difficult the code is to understand, maintain and test.

Probabalistic Reasoning
Given a "commit" into cvs, what is the probability it will contain a bug? What factors play into the decision?
  • Number of lines changed.
  • Number of previous bugs in this chunk?
  • Did the current author make the previous, change, or is she working on this code for the first time?
  • Number of dependancies introduces (i.e. a new import statment in a java class).

1 comment:

Anonymous said...

Many people are looking at ways to increase productivity through data mining techniques.

I have written a paper on this a while ago you may find it interesting, the focus was manufacturing but you may find ideas you wish to pursue.