Over this past month, I've managed to get a package, with little to no users, to accumulate over one million downloads 🚀.
It didn't cost any money, no laws were broken (I think) and it took little to no effort.
Here's what you need to know about the downloads statistic on NPM.
🔮 The illusion of downloads
If you've ever looked at using a new package from NPM, the chances are you've considered the "Weekly Downloads" statistic.
It's the first metric displayed on the page - so it must be useful information for the user... right?
A third of people who responded to this poll seemed to think so and even go as far as to say it has a large influence in their decision to adopt a new package.
But here's the thing, it isn't a useful metric for the following two reasons:
- there is a loose (at best) relationship between users and download counts
- the system is easily exploitable
What is a download
This was pretty well discussed on the NPM Blog but to summarise, it's any successful download of a package (tarball) from NPMs registry.
NPM have openly stated that this this statistic has no consideration for the source (IP, user agent, etc). Meaning all downloads are equal, whether it be from:
- A user adding a new package to their project
- A CI run installing dependencies
- A bot downloading the package repeatedly to create the illusion of popularity (there's some foreshadowing for you)
As you can imagine, this means a project with frequent CI runs is likely to have more of an influence on download statistics than any set of individuals (especially when taking npm client caching into consideration).
Registries
The abundance of registries is another reason why download counts aren't an accurate reflection of usage. NPM download counts only include downloads to the official NPM registry, and not registries such as unpkg and github.
🧑💻 Exploiting the system
Disclaimer: I've documented this to bring light to how easily exploitable download statistics are. However, I strongly advise that you don't do this as it is both dishonest an unnecessary drain on NPM Inc's resources.
If you've read everything up until this point, you'll know that there isn't a need for any kind of "genius hacker exploit".
Instead, all we need is some way of downloading a package many times.
Running a script locally with some kind of cron job should do just fine - but that isn't too exciting... let's use serverless!
You can check out the full repo here.
Creating a script
For the Lambda, I created a function which takes the following arguments:
-
package
- the package to download -
probability
- the likelihood of a download for a given run
The latter argument is intended to add noise - simulating the variable nature of downloads over time.
A "coin flip" takes place each run, with the probability
argument being used to weight the chance of success. If the flip is successful, the package is downloaded.
export const handler = async ({ package, probability }) => {
// Simulate coin flip
if (Math.random() > probability) {
// Flip fail
return;
}
// Flip success
await downloadPackage({ package });
};
Triggering the Lambda
To get this script running routinely, a CloudWatch event was set up that triggers at a rate of once a minute.
// Terraform example
resource "aws_cloudwatch_event_rule" "lambda_trigger_rule" {
name = "trigger-npm-install"
description = "Trigger an NPM install"
schedule_expression = "rate(1 minute)"
}
Example CloudWatch Event Rule in Terraform.
In order to do something when this event is triggered, an event target is set up, pointing to the Lambda with our required arguments.
resource "aws_cloudwatch_event_target" "lambda" {
arn = aws_lambda_function.install_package_lambda.arn
rule = aws_cloudwatch_event_rule.lambda_trigger_rule.name
input = jsonencode({
package = "is-introspection-query"
probability = 0.8
})
}
Example CloudWatch Event Target in Terraform.
🚀 The result
After deploying this for the duration of a week, the result is... well actually not that impressive; it turns out there aren't as many seconds in a week as I had expected 🤔.
But alas, after some tweaks, we hit just under 1 million downloads per week!
Yes thats right, a package with literally 0 users has more downloads than the likes of urql
and mobx
.
Are you seeing the problem now?
Download stats don't work
Here's the thing, naive download statistics are useless at best and misleading at worst.
The large graph on NPM's site, the culture of celebrating download counts online, the third party sites which show package download "trends". These all contribute to this narrative that NPM download counts provide some kind of insight into a packages popularity, and they just don't.
Even ignoring the potential for malicious actors (like myself) the abundance of registries and caching implementations make these statistics less than useful.
"Popularity"
Fortunately, NPM has a saving grace - the popularity statistic! Let's just replace the download count with some of the more useful statistics... right?
Well no - turns out the popularity statistic seems to be the downloads statistic in disguise. As you can see below, my package managed to surpass @prisma/engines
in terms of popularity.
Here's a quick comparison of the two packages side-by-side.
@prisma/engines | is-introspection-query | |
---|---|---|
weekly downloads | ~100,000 | ~800,000 |
stars | 264 | 0 |
forks | 35 | 0 |
contributors | 26 | 1 |
users | probably not 0 | definitely 0 |
Conclusion
If there's one thing you take away from this discussion, it's that downloads alone aren't a useful metric.
While I've no doubt that NPM could create a popularity metric that aggregates a number of different attributes of a package (npms.io has already done it), from now on, I'm going to do a little more background research before trusting the downloads and popularity metrics on NPM 🕵️.
Hopefully, you found this interesting! If you have any thoughts or comments, feel free to drop them below or hit me up on twitter - @andyrichardsonn
Disclaimer: All thoughts and opinions expressed in this article are my own.
Top comments (12)
I don't get the logic here
Your argument is based on your exploit, which makes an illogical jump
I can spoof downloads -> Everyone spoofs -> the metric cannot be trusted
Unless you can proof a significant portion of users spoof (like in appstore reviews), then the argument is moot
You don't need a significant portion to spoof for the metric to be useless.
Let's say you need a package, you go look at exxpress. It has 30M downloads, therefore it's probably the popular package you wanted. So you're good to go with
npm install exxpress
right?Sorry if I've made this unclear, there are two main points I'm trying to emphasise. Just in case you missed it 👇
In terms of the exploit side of things, my personal opinion is that data which can be manipulated to this degree shouldn't be given any weight.
Edit: I've updated the conclusion to remove the emphasis on the latter point - hope that clears up any confusion!
I agree... You could argue the same with security:
I can create an exploit in an npm package -> everyone does -> npm is fundamentally unsafe.
That is kind of true, but no one stops using npm for this reason. I guess the same goes for every metric. Who guarantees you that a project github stars dont come from a clickfarm ?
When using a npm package, you're trusting its author, to some extend.
(This is fun to see that npm doesnt even try to protect itself against this, though)
That's not what the post is getting at (to my reading). The equivalent would be, I think:
There happens to be wild fluctuations in the number of exploits accidentally appearing in npm packages due to cosmic rays -> I can demonstrate getting an exploit into an hyperbolic number of npm packages to prove a point -> npm is fundamentally unsafe.
It's talking about how the metric is useless even in telling you how many unique users downloaded a package, or how often something caches it or runs a build job.
AFAIK the popularity metric that npm uses is from npms.io 😅
Do you have a source for this?
Looking at the package on npms.io it has a much lower popularity rating (3% compared to 14% on NPM).
Edit: just to clarify, I also had this assumption but assumed I was misremembering after seeing the difference.
The source I have is that I worked at a competitor when that change happened and was friends with folks at npm at the time. You're correct that there's apparently now deviation, and I'm not sure what that is - if npm continued using the original scoring and npms moved on, if npm moved on, or something else.
Slight correction - in 2017 we switched from a Cloudflare caching to a simple CNAME to the npm registry. So downloads via Yarn are counted just like any download from npm.
Thanks for this! Fixing this now 👍
Edit: Removed mention of yarn in registries section.
same thing happening with my package .
npmjs.com/package/viwerjs-ang
its rapidly increasing downloads but i can't able to belive in that statistics.
Very interesting article!
I would've thought NPM would at least check the IPs 🤷♀