Have I been Pwned goes open source

The question isn’t “Does someone have your user IDs and passwords?” I guarantee you someone has. Don’t believe me? Check for yourself at Have I Been Pwned (HIBP). I’ll wait. Now, do you believe me?
People check the free HIBP site at a rate of almost 1 billion requests per month. It collects data from all the many personal security breaches that happen every week or two. Last year alone we saw dozens of data breaches. Moving forward, HIBP will now also receive compromised passwords discovered in the course of FBI investigations.
Why is the FBI getting involved? Because Bryan A. Vorndran, the FBI’s Assistant Director, Cyber Division, said, “We are excited to be partnering with HIBP on this important project to protect victims of online credential theft. It is another example of how important public/private partnerships are in the fight against cybercrime.”
The FBI passwords will be provided in SHA-1 and NTLM hash pairs; HIBP doesn’t need them in plain text. They’ll be fed into the system as they’re made available by the Bureau. To do that, HIBP is adding on a new, open-source program, Pwned Passwords, to let the data flow easily into HIBP.
HIBP founder Troy Hunt, security expert and Microsoft Regional Director, explained he’s open-sourcing the code because “The philosophy of HIBP has always been to support the community, now I want the community to help support HIBP.” HIBP is written in .NET and runs on Azure.
With a billion searches a month, I’m sure Hunt can use all the help he can get. He started planning to open-source HIBP in August 2020. Hunt quickly discovered this wasn’t easy. He wrote:
I knew it wouldn’t be easy, but I also knew it was the right thing to do for the longevity of the project. What I didn’t know is how non-trivial it would be for all sorts of reasons you can imagine and a whole heap of others that aren’t immediately obvious. One of the key reasons is that there’s a heap of effort involved in picking something up that’s run as a one-person pet project for years and moving it into the public domain. I had no idea how to manage an open-source project, establish the licensing model, coordinate where the community invests effort, take contributions, redesign the release process, and all sorts of other things I’m sure I haven’t even thought of yet. This is where the .NET Foundation comes in.
The .NET Foundation isn’t part of Microsoft. It’s an open-source independent 501(c) non-profit organization.
Hunt’s starting with the Pwned Password code because it’s relatively easy. The reasons for this include:
- 
It’s a very simple codebase consisting of Azure Storage, a single Azure Function, and a Cloudflare worker. 
- 
It has its own domain, Cloudflare account, and Azure services so it can easily be picked up and open-sourced independently to the rest of HIBP. 
- 
It’s entirely non-commercial without any API costs or Enterprise services like other parts of HIBP (I want community efforts to remain in the community). 
- 
The data that drives Pwned Passwords is already freely available in the public domain via the downloadable hash sets. 
Thus, Hunt could “proverbially ‘lift and shift’ Pwned Passwords into open-source land in a pretty straightforward fashion which makes it the obvious place to start. It’s also great timing because as I said earlier, it’s now an important part of many online services and this move ensures that anybody can run their own Pwned Passwords instance if they so choose.”
Hunt hopes “that this encourages greater adoption of the service both due to the transparency that opening the code base brings with it and the confidence that people can always ‘roll their own’ if they choose. Maybe they don’t want the hosted API dependency, maybe they just want a fallback position should I ever meet an early demise in an unfortunate jet ski accident. This gives people choices.”
At one time Hunt had considered selling HIBP. With this open-source move, this no longer appears to be the case.
The HIBP code is being kept on GitHub. It’s licensed under the BSD 3-Clause license.
The overall plan is:
- 
There’s an authenticated endpoint that’ll receive SHA-1 and NTLM hash pairs of passwords. The hash pair will also be accompanied by a prevalence indicating how many times it has been seen in the corpus that led to its disclosure. 
- 
Upon receipt of the passwords, the SHA-1 hashes need to be extracted into the existing Azure Blob Storage construct. This is nothing more than 16^5 different text files (because each SHA-1 hash is queried by a 5 character prefix), each containing the 35-byte SHA-1 hash suffix of each password previously seen and the number of times it’s been seen. 
- 
“Extracted into” means either adding a new SHA-1 hash and its prevalence or updating the prevalence where the hash has been seen before. 
- 
Both the SHA-1 and NTLM hashes must be added to a downloadable corpus of data for use offline and as per the previous point, this will mean creating some new entries and updating the counts on existing entries. Due to the potential frequency of new passwords and the size of the downloadable corpora (up to 12.5GB zipped at present), my thinking is to make this a monthly process. 
- 
After either the file in blob storage or the entire downloadable corpus is modified, the corresponding Cloudflare cache item must be invalidated. This is going to impact the cache hit ratio which then impacts performance and the cost of the services on the origin at Azure. We may need to limit the impact of this by defining a rate at which cache invalidation can occur (i.e. not more than once per day for any given cache item). 
That said, as Hunt admits, this is very much a work in progress: “I don’t have all the answers on how things will proceed from here.” But, with the help of you, the FBI, and the .NET Foundation, HIBP promises to be more useful than ever.
Related Stories:
READ MORE HERE

