2014-01-01

My policy on open source surveys: ask the infrastructure, not the people

An email trickling into my inbox reminds me to repeat my existing stance on requests to complete surveys about open source software development: I don't do them.

chairlift

The availability of the email address of developers in OSS projects may make people think  that they could gain some insight by asking those developers questions as part of some research project, but consider this
  1. You won't be the first person to have thought of this -and tried to conduct a survey.
  2. The only people answering your survey will be people who either enjoy filling in surveys, or who haven't been approached, repeatedly before.
  3. Therefore your sample set will be utterly unrealistic, consisting of people new to open source (and not yet bored of completing surveys), or who like filling in surveys.
  4. Accordingly any conclusions you come to could be discounted based on the unrepresentative, self-selecting sample set.
The way to innovate in understanding open source projects -and so to generate defensible results-  is to ask the infrastructure: the SCM tools, the mailing list logs, the JIRA/bugzilla issue trackers. There are APIs for all of this.

Here then are some better ideas than yet-another-surveymonkey email to get answers whose significance can be disputed:
  1. Look at the patch history for a project and identify the bodies of code with the highest rate of change -and the lowest. Why the differences? Is the code with the highest velocity the most unreliable, or merely the most important?
  2. Look at the stack traces in the bug reports. Do they correlate with the modules in (1)?
  3. Does the frequency of stack traces against a source module increase after the patch to that area ships? or does it decrease? That is, do patches actually reduce the #of defects, or as Brooks said in the Mythical Man Month, simply move around. 
  4. Perform automated complexity analysis  on source. Are the most complex bits the least reliable? What is their code velocity?
  5. Is the amount of a discussion on a patch related to the complexity of the destination or the code in the patch?
  6. Does that complexity of a project increase of decrease over time?
  7. Does the code coverage of a project increase or decrease over time?
See? Lots of things you could do -by asking the machines. This is the data-science way, not asking surveys against a partially-self-selecting set of subjects and hoping that it is in some way representative of the majority of open source software projects and developers.

[photo: ski lifts in the cloud, Austria, december 2013]

1 comment:

Comments are usually moderated -sorry.