Lies, damn lies and Sourceforge statistics

Matt is at it again: in his recent comment on why IBM doesn’t like the GPL (which is wrong in many ways, but that’s food for another long post and it’s late now), he restates his comment about 72% of projects using GPL. This tickled me to do some statistics myself, which proved once again why numbers are fundamentally worthless.

Even though download statistics are almost junk, that’s still the most prominent number you can get by crawling Sourceforge, so I’ve been using as a starting point, doing some URL hacking and a small script to turn the 66,771 projects currently listed into a CSV file. Even though my spreadsheet has been cutting results to the top 65,535 results (leaving the bottom 1200 projects out), initial findings are somewhat surprising. Consider this:

  1. Despite having 144.990 projects registered, only 66,771 or so (46%, roughly) have actual download statistics. Even though not every project is using the distribution platform, this roughly means that 50% of the actual projects are irrelevant;
  2. Download statistics are impressive to say the least. Suffice to say that nearly 2/3 of the projects (that is 62%, to be more precise) account for a mere 0.5% of the total downloads (how’s that as a long tail?). Add to that 62% the 50% we discarded before and you come up with a whopping 81% of the total number of projects hosted on being nearly pointless. For the record, 20.06% of the accounted projects have less than 100 downloads;
  3. the top five projects alone (eMule, Azureus, Ares Galaxy, Bittorrent and DC++) account for roughly 30% of the overall downloads. 36% of downloads, considering the top 100 projects, are P2P related, and the percentage bumps up to 41% if you take into account MP3 tools, encoders, and other music stealing management tools.

The above scenario depicts as a huge code dump mostly relevant for P2P youngsters, which is good to prove my point about any Sourceforge-based statistics on real business stuff such as license adoption being flawed and nearly worthless. I must confess, though, that I’ve been surprised by the numbers and I’m sure there is so much more stuff to mine in this impressive amount of data, so much that I’m seriously considering abusing the servers, write a DOAPizer and start some serious number crunching considering project categories and community dynamics. Fun stuff indeed, if only days were 36 hours long…



4 thoughts on “Lies, damn lies and Sourceforge statistics”

  1. It’s his next paragraph that I find more interesting:

    “So IBM hasn’t figured out what the rest of us know with ever-increasing certitude: it’s possible to monetize open source directly. Ironically, it becomes easier the more freedom that imbues the software. Even more ironically, this is so because companies like IBM don’t want to touch software that is free – it threatens their proprietary software.”

    So here we are, arguing for the GPL, and the argument is on how it lets you directly monetize open source. Basically – “Hey IBM, open source your stuff under GPL and you can make more money off of it!”. I wonder how they would directly monetize another GPL product. How would IBM directly monetize the kernel?

    He describes the GPL as having more freedom – which is odd if it’s all about directly monetizing and protecting your own product from being abused; and then he says something that is completely true. People don’t want to touch the GPL, it threatens their own proprietary work.

    GPL is increasingly about business strategy and defending your product – not about open source. It’s a great license for that – if I want to defend my product I’m going to be GPL’ing it. If I want to provide something to the community for maximum use, I’m definitely not. Then I’d have the choice of either BSD’ing it, or using AL and losing out on any GPL zealots (but choosing BSD would probably have lost them anyway).

  2. Your point about active vs. inactive projects on is well taken. We are currently tracking over 40 projects related to data management, data analytics and business intelligence, though not all are available through SourceForge. Of these, approximately half were announced in 2005; in the six years prior in which we were looking at open source data management and analytics projects, there were only a handful.

    Out of these projects…

    – four have merged into another project and are still quite active, and still available as stand-alone products
    – one has been purchased by a proprietary vendor, though is still somewhat active
    – 10 projects haven’t released anything new in over a year, five of these only had one release of beta files
    – two never released any files, and have been removed from our list after stagnating in “planning” for two years
    – one of the first ETL projects was very active for two years, but has languished since 2003
    – one project has been removed from SourceForge
    – one appears to have gone proprietary

Comments are closed.