Bug Prediction

06 Oct 2013

One of the benefits of having a repl inside your CI server is that you get a programmatic access to the history of your repositories. There is an entire conference dedicated to mining this history. Whether you use Perforce, svn, mercurial or, God forbid, ClearCase you can use the same API to analyze your commits.

Let's use this API to implement a Google's bug prediction algorithm.


First we need to get a history of commits. We can get it from a build configuration where commits were detected. And configuration can be found by its id:

Build configuration externalId

(def idea (.findBuildTypeByExternalId tc/project-manager "BugPrediction_Idea"))

Once we have a build configuration, we can get all its commits:

(def changes (.getAllModifications tc/vcs-history idea))

From each commit we can get a comment, a date when commit was made and changed files, that's all we need to implement an algorithm.


Let's define a predicate to distinguish bug-fixing commits from regular ones:

(defn bug-fix? [m]
  (-> m
      (.contains "fix")))

The predicate assumes that a commit fixes a bug if it contains fix in a commit message.

We will also need two helper functions:

(defn vcs-time [m]
  "Returns a time commit was made"
  (.. m getVcsDate getTime))

(defn files [m]
  "Returns names of the files changed by commit"
  (map (memfn getFileName) (.getChanges m)))


Now we can define a score function for a bug-fixing commit:

(defn score-modification
  [m min-vcs-time max-vcs-time]
  (let [t (vcs-time m)
        normalized-time (/ (* (- t min-vcs-time) 1.0)
                           (- max-vcs-time min-vcs-time))]
    (/ 1 (+ 1 (Math/exp (+ (* -12 normalized-time) 12))))))

It takes a modification, the time of the earliest and the latest commits in the history and returns a bug-fixing score for a modification. The greater a score, the more likely commit's files will have some bugs in the future.

Having a score function for a single commit we can write a function that takes a sequence of commits and calculates a total score for each file:

(defn score-files
  "Returns a sequence of pairs [file name, its bug-fix score] for a
  given sequence of modifications. Result is sorted in descendant
  order and includes only files matched by pred, or all the files if
  pred is not specified."
  ([ms] (score-files ms (fn [f] true)))
  ([ms pred]
     (let [vcs-times (map vcs-time ms)
           min-vcs-time (apply min vcs-times)
           max-vcs-time (apply max vcs-times)
           score-mod-files (fn [m]
                             (let [files (filter pred (files m))
                                   score (score-modification m min-vcs-time max-vcs-time)]
                               (zipmap files (repeat score))))]
       (sort-by #(% 1) >
                (reduce #(merge-with + %1 %2)
                        (map score-mod-files (filter bug-fix? ms)))))))

Now we can get top 20 files with highest bug score, ignoring tests and non-java files:

(take 20 (score-files changes #(and (.endsWith % ".java") (not (.contains % "/test/")))))

Cool, eh? Just 5 functions and you can predict bugs!

Futher improvements

We can implement a more precise bug-fix? predicate using an integration with issue trackers. It would also be nice to mark bug-prone files in UI and to recalculate score by schedule or every time new commit is detected.

clojure TeamCity