Matt is at it again: in his recent comment on why IBM doesn’t like the GPL (which is wrong in many ways, but that’s food for another long post and it’s late now), he restates his comment about 72% of sf.net projects using GPL. This tickled me to do some statistics myself, which proved once again why sf.net numbers are fundamentally worthless.
Even though download statistics are almost junk, that’s still the most prominent number you can get by crawling Sourceforge, so I’ve been using http://sourceforge.net/top/topalltime.php as a starting point, doing some URL hacking and a small script to turn the 66,771 projects currently listed into a CSV file. Even though my spreadsheet has been cutting results to the top 65,535 results (leaving the bottom 1200 projects out), initial findings are somewhat surprising. Consider this:
- Despite having 144.990 projects registered, only 66,771 or so (46%, roughly) have actual download statistics. Even though not every project is using the Sourceforge.net distribution platform, this roughly means that 50% of the actual projects are irrelevant;
- Download statistics are impressive to say the least. Suffice to say that nearly 2/3 of the projects (that is 62%, to be more precise) account for a mere 0.5% of the total downloads (how’s that as a long tail?). Add to that 62% the 50% we discarded before and you come up with a whopping 81% of the total number of projects hosted on sf.net being nearly pointless. For the record, 20.06% of the accounted projects have less than 100 downloads;
- the top five projects alone (eMule, Azureus, Ares Galaxy, Bittorrent and DC++) account for roughly 30% of the overall downloads. 36% of downloads, considering the top 100 projects, are P2P related, and the percentage bumps up to 41% if you take into account MP3 tools, encoders, and other music
The above scenario depicts sf.net as a huge code dump mostly relevant for P2P youngsters, which is good to prove my point about any Sourceforge-based statistics on real business stuff such as license adoption being flawed and nearly worthless. I must confess, though, that I’ve been surprised by the numbers and I’m sure there is so much more stuff to mine in this impressive amount of data, so much that I’m seriously considering abusing the sf.net servers, write a DOAPizer and start some serious number crunching considering project categories and community dynamics. Fun stuff indeed, if only days were 36 hours long…