I have never liked statistics, but at this point maybe you already know that I am a curious person. And the course “Communities” at the Master of Libre Software that I am attending at Universidad Rey Juan Carlos injected me some poison. I learned some things:
- One is that as libre software projects provide the source code, and in many cases, many other resources of information about the project itself, we can find reliable data to measure several aspects and try to find out conclusions, patterns, prove some theories…
- Other thing is that there are several tools that somebody already wrote to help this kind of analysis, and these tools are libre software themselves (kudos for all the developers)
- As conclusion, anybody with the appropiate knowledge and time may perform a research about libre software projects or replicate the results of other people in order to prove (or not) their consistence.
I feel myself with lack of knowledge and time, but I did not want to pass away to other things without playing a little bit with the data and the tools.
So I decided to count source lines of code of the Linux Kernel version that I am using in my Debian Wheezy laptop (yes, it had Ubuntu, but later I changed it, maybe in other post I will explain about it) and in this post I will talk about the tools for count lines of code and the results.
About counting tools
(From the manual page): sloccount counts the physical source lines of code (SLOC) contained in descendants of the specified set of directories. It automatically determines which files are source code files, and it automatically determines the computer language used in each file. By default it summarizes the SLOC results and presents various estimates (such as effort and cost to develop), but its output can be controlled by various options.
Sloccount is written by David A. Wheeler and you can find more information about it at http://www.dwheeler.com/sloccount.
The version that I have installed is 2.26-4 and the results for the 2.6.38-2 linux kernel is below:
(From the manual page): cloc counts physical lines of source code in the given files (may be archives such as compressed tarballs or zip files) and/or recursively below the given directories. Counts blank lines, comment lines, and physical lines of source code in many programming languages. It is written entirely in Perl, using only modules from the standard distribution.
Cloc is written by Al Danial and is Copyright (C) 2006-2010 Northrop Grumman Corporation, released under the GNU GPL version 2 or (at your option) any later version. You can find more information about it at http://cloc.sourceforge.net/
Here you find the results about linux kernel version 2.6.38- (I found cloc much slower than sloccount):
(From the Ohcount Sourceforge page): Ohcount is the source code line counter that powers Ohloh. Ohcount supports over 70 programming languages. Ohcount can also detect popular open source licenses such as GPL determine if code targets a particular programming API, such as Win32 or KDE. You can find more information at http://sourceforge.net/projects/ohcount/
Here you are my results with ohcount:
Since the results are quite different depending on the tool used to count source lines of code, it would be nice that when we see any estimation about SLOC in a program, the author explains (maybe with a footnote) how he or se got that result.
From sloccount, I liked that it brings automatically percentages of use of languages (BTW, it is overwhealming the use of C in the Linux Kernel), and also shows the number of lines separated by folders (and then, we can see that the “drivers” folder is the biggest of the Kernel) .
Both cloc and ohcount give stats about number of files, and also split the SLOC with figures about comment lines and blank lines (this I also liked!). However, the sum of SLOC of ohcount was very different from the others. Maybe this is due to the number of files considered: cloc ignores 3612 files (duplicates?) while ohcount examines all the files.