When you install a Debian system, if you agree to collaborate sending data about package usage, debian popularity-contest package is installed and sets up a cron job to periodically submit the list of packages installed and the access time of relevant files. This information is anonymized and it is accesible on http://popcon.debian.org, where anybody can query the database to obtain data and graphs about the packages that you like, statistics per architecture, and much more.
I knew about popularity-contest package but I never cared about retrieving information from that data. But this semester at Master of Libre Software I enrolled to MSWL-Communities subject (about data mining), and I remembered that few months ago I saw a nice graphic comparing vim, emacs an nano on Luis Cañas’ blog, and then I went to popcon site to learn if I could extract something interesting with my (still) rudimentary knowledge and abilities for data mining.
First I tried to compare OpenOffice and LibreOffice installations, and I got this graph (quality is not good but you can click to get the popcon graph and search query):
But we should take into account that LibreOffice packages are available only on wheezy and sid but not on Debian stable (squeeze) where OpenOffice is the default suite.
In fact, we should take into account much more things. Via the Debian Project News newsletter I found a blogpost from Joey Hess about several problems of popcon. (Joey Hess is a longterm Debian Developer, if you want to learn more about him you can visit his blog or this interview from “People behind Debian” series from Raphael Hertzog). Joey Hess argue that we cannot compare numbers about packages installed by default on Debian with others installed only by user request. He introduces other interesting comments about popcon and its usage for deciding to include or not a certain Debian package in future releases, and their consequences and risks of making Debian too homogeneous, loosing one of the most valuable things: you can find very very different software packages, for any purpose you can imagine (even for the rare systems that throw small numbers at statistics).
Following these reminders, I went back to popcon and plotted other graphs about appications not installed by default on Debian systems.
For example, what about control version systems? Here you have the statistics of “regular usage” of some well know control version systems, from January 2010 until now:
Below you can find other example, this time the contest is for the desktop recorders “RecordMyDesktop” and “Istanbul”. I have plotted installations and regular usage (votes) for both packages, since January 2010.
It is interesting to see that although in February 2011 there is a notable increment of Istanbul installations, the usage did not suffer the same growth, and it seems that there were uninstallations later? Maybe many people wanted to taste the new version of the package released with squeeze and but later did not use it regularly.
As conclusion, I can remark some things that I learned reading and playing around with popcon:
- You should be careful in order to analyse the data provided by popularity-contest, and, at least, take into account other information about the packages you are searching (for example, in which releases are they present, when did they released new versions, and if they are installed by default with the Debian system).
- I agree with Joey Hess in that “popularity contests” is not a pleasant thing (“the tyranny of the majority”). But plotting graphs with just one web query and playing around with popcon stats is fun!
- As all the stuff coming from Debian community, popularity contest is free software, it is designed by experts and in constant review and democracy-based reform-ability. So in the future maybe we can find even better stats tools from Debian community.
- You can always find up-to-date numbers, graphs and downloadable data for your do-it-yourself research about package usage since new data is received every week.