Archive for September, 2008

Developers performance and random variation

Monday, September 8th, 2008

Is it possible to rank the relative performance of developers in a team without falling foul of random variation?
Ben Goldacre wrote a very interesting piece (as usual) in The Guardian this week-end (Sat 6 September edition) about the silliness of national “studies” which failed to take a vetted statistical sample before coming to any conclusion. You can read the piece on Ben Goldacre’s blog here.
However, this is most interesting when we discover random variation at work in other domains than research and it stroke me as particularly applicable to evaluating developers performance in a team.

Individual performance

In order to evaluate performance, we must define measure points in all the work produced by the individual. These metrics will include quantity-based indicators like the number of lines of code written, the number of bugs solved, or the number of bugs introduced (for negative performance) but they also would need to account for quality-based indicators as percentage of code written that is covered by tests, package and class cohesion or number of dependencies introduced.
The problem with the latter set of metrics is that it is very difficult to predict what impact on the overall system quality they will have. Nonetheless, the practice is to keep them as sweet as possible.

Individual performance, therefore can be measured absolutely from the code produced and can be compared to previous performance for the same individual. However, is it enough to use these metrics? How has the developer coped with the work that he had been assigned? Has he been able to complete the task in the most efficient way?

Feature-relative performance

Given 2 new features A and B to develop (or say bugs A and B to fix) with respective complexity Ca and Cb, how does an individual developer implement the code required to support them?
In an ideal world, every developer would code the same way on Monday at 9am, that he is on Wednesday at 12.30pm or Friday at 5pm… but there is not a single way to solve a problem, there is not a single algorithm that sorts a table, nor is there a single spelling convention for naming methods, and developers will choose the one that suits their state of mind at the very moment they need to implement it; and that doesn’t even encompass the boredom parameter which will see developers implement the famous Hello World! in every fashion possible just to make sure they keep themselves entertained!

With each feature having a different impact on the system and the developer being equally likely to implement the same feature in two different ways at different times, what is the value of those metrics when comparing them feature by feature?
And what is the value of comparing them between developers?

Team-relative performance

And here we introduce random variation: when you are trying to evaluate the relative performance of members of a team, how can you make sure you have comparable metrics to compare them with?
Let’s see with an example where we consider 2 developers of same competence that we want to evaluate against each other:

Iteration #1
Developer A has developed features F1 and F2 of complexity C1 and C2 and has achieved metrics M1 and M2. Developer B has developed features F3 and F4 of respective complexities C3 and C4. Because they are of same competence, we can assume that they have been allocated equivalent tasks and that C1+C2 ≈ C3+C4.

Iteration #2
Developer A is given to develop features F5 and F6 (complexities C5 and C6) and to fix bug B1 (Cb1) found on F3; he completes F5 and fixes B1, but doesn’t complete F6 by himself. Developer B completes features F7 and F8 (C7 and C8), fixes bugs B2 and B3 found in F4 (Cb2 and Cb3) and helps developer A complete F6.
Again, the tasks have been allocated on the assumption that C5+C5+Cb1 ≈ C7+C8+Cb2+Cb3.

Notice how I didn’t mention metrics in Iteration #2; that is where the problems actually begin. While we can easily measure our metrics for F5, F7 and F8, and allocate them to the performance measurement of each developer, there is an issue with F6 to determine how to share the performance between the developers.
Moreover the bugs solves incur new metrics being calculated on F3 and F4 as composite metrics of the pre-existing code with the fix code; therefore M3′=M3+ΔMb1 and M4′=M4+ΔMb2+ΔMb3.
It is likely at this point that we want to allocate only the delta performance to each developers for bugs. We probably want to remove this delta to the previously allocated metric for the feature.

We can also note that at the end of the Iteration #2, we could either conclude that the complexity estimation for the tasks was wrong or that developer B is more skilled than developer A.
By experience, I can safely say that neither conclusion can be drawn automatically.

As a wrap-up of this small theoretical example, we try to determine the composite performance of each developer:

  • Developer A: Mx = M1+M2+M5+∂M6+ΔMb1
  • Developer B: My = M3+M4+∂M6+ΔMb1-ΔMb1+ΔMb2-ΔMb2
    My = M3+M4+∂M6

With the performance adjustments (negative deltas due to bugs), it becomes clear that the conclusions that we could possibly have drawn after Iteration #2 cannot be drawn with any level of confidence without looking at the bigger picture.

Normalisation of the random variation

The previous example shows that if we want to be able to measure relative performance accurately, we need to find ways to normalise our input performance data. In the example, we used negative deltas of performance to impact an individual’s performance over subsequent iterations. We could also sample our metrics on features with a given complexity. Evidently, the best possible measurement would be to allocate the same set of features to each developer, but that would be counter-productive and difficult to implement in a real-world project.

Finally, we need to acknowledge that the complexity estimation for a given feature might be substantially wrong and that solving a bug for that feature could uncover a whole new range of complexity, rendering ineffective our negative delta adjustment; we therefore would need to mitigate errors and changes in complexity in our performance evaluation.

There is probably more to it than that, and I happily welcome comments and discussion. How do you measure your team’s performance?

Sphere: Related Content