Have I got your attention? It’s a sensationalist title, but this is important and developers/administrators still get it wrong.
Both online and professionally, I encounter technical people still turning to traditional hashing algorithms like SHA or, Schneier forbid, MD5 when making decisions about scrambling user credentials. Even this recent question on Stack Overflow Exchange has yielded inaccurate answers. While choosing something like SHA-256 with salt isn’t necessarily a bad decision, it’s not the right decision – which, when it comes to cryptography, is critical to maintain the integrity of the system as a whole.
Attention-grabbing title aside, I hope that I can explain why people need to carefully choose their cryptographic primitives when making decisions about securing data at rest.
Dislaimer: I am not a cryptographer or mathematician, just a concerned internet citizen. If you have feedback, please feel free to contribute and I’ll amend my commentary as needed.
Obfuscation and Hashing
It should be self-evident to see why storing the user password
monkeys123
as
monkeys123
in an application database is a bad idea. The aforementioned Stack Overflow question actually answers this general problem pretty well, read it if you don’t immediately see why this is a bad idea.
A long time ago the best practice to mitigate this risk was to cryptographically hash passwords with an algorithm like MD5 such that the password
monkeys123
becomes
5a44cf4723d8fab4394952e331d6efad
inside of an application’s database. Essentially what we’re trying to achieve is a fairly specific set of requirements for our password-scrambling scheme:
- it is infeasible to obtain the original value of the hash
- it is infeasible to find a message that hashes to a certain value
- it takes a long time to calculate (why?)
- somewhat less important but still a requirement, hashed values should be as unique as possible to avoid collisions
As many of these requirements overlap with the strengths of some hashing algorithms, schemes like SHA have become popular and commonplace in web frameworks.
However, a hashing algorithm is an interesting beast. Although the layman’s “information security” synapses will fire when cryptographic hashes are mentioned, hashing algorithms have strengths (and weaknesses) very specific to the use case.
It’s worthwhile now to briefly dip into the technicalities of cryptographic hashing.
Aside: Why Prefer CPU-Intensive Operations?
This is sort of a weird requirement until you grok the difference between hashing passwords and large files.
When you hash a file, you’re looking at possibly megabytes of data to crunch through bit-by-bit in order to come up with a unique fingerprint. That’s a lot of entropy that the hashing algorithm has to work with to create pseudorandom values.
When you hash a password, you’re creating a unique fingerprint for a string that is going to top out for the average person at about 10 or 20 characters. Thus, while extremely large, the domain of all possible values for the average password is significantly smaller than the entropy swirling about in a 10-megabyte file.
Furthermore, attackers can usually rely on a wordlist to serve as a base for password guesses, thereby reducing the unique values that they must brute force even further (it’s easier to start guessing at ‘password123’ rather than ‘aaaaaaaaa’.)
What this means for your hashed password is that you don’t want to enable a bad actor to be able to rapidly traverse millions of unique values quickly to try and pin down your pre-hashed credential because of this key reason:
- The original hash values come from an extremely small domain of possible combinations, which makes guessing at input values a viable strategy.
Note that because passwords will tend to always have this attribute (for the foreseeable future), computationally expensive is a requirement we want no matter what cryptographic primitive we choose. As the domain of our unique value is drawn from human-retainable set of characters, we must ensure that computing values is prohibitively expensive for the lengths and complexities that a person can remember.
Hashing and You
According to the internet hive mind, the ideal cryptographic hashing function has the following properties: 1
- it is easy to compute the hash value for any given message
- it is infeasible to generate a message for has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash.
In short, we want to throw a hunk of information into a woodchipper and watch a unique, unguessable fingerprint exit the other end. There are some use cases for which these attributes are exceptionally well-suited:
- Checksumming - want to confirm a file’s integrity after you’ve downloaded it? Hash it, and you’ll quickly get a bit sequence that will get thrown wildly off if any part of your file differs from the original.
- Uniqueness - as evidenced by the success of
git
’s internal storage mechanisms, storing commit blobs as values of a hash key is effective because commits aren’t going to be the same, and hashing those commits creates unique hashes. - Bucketing - possibly using a poor term here, but the idea is to easily create ‘databases’ for unwieldy, large files by identifying them by unique keys (for example, you can hash all executables in a *nix system’s
/bin/
and come up with a few-kilobyte text file that can serve as a known good state for pristine executables)
However, you’ll note that these strengths are somewhat at odds with some of the desirable traits for a password scrambling scheme:
- Speed - hashing functions should be fast. Heck, the successor to SHA-2 (keccak) explicitly touts the fact that is hashes quickly and is designed to outperform its predecessors.
- Integrity - while convenient (and necessary) for password hashing to be deterministic, one hugely important trait of a good hashing algorithm is that the output should vary wildly when one bit is off. Although good for storing passwords, this requirement is low on our list, but high on the requirements for a good hashing algorithm.
You’ll also note that because our domain of possible values is limited (necessitating the aforementioned high cost of calculating a hash), rainbow tables have to combated with salt. The fact we’re already compensating for the attributes of a hash is a bad sign.
Hopefully it’s evident that the intended use case for a hash does not perfectly fit our ideal algorithm for cryptographically scrambling passwords.
Enter The Key Derivation Function
Built upon other cryptographic primitives (including hashing), a key derivation function (KDF) is meant to combine a non-secret value (read: salt) and a secret value (your password) and derive a value from calculations upon these values. Their intended use case is to take a pass(word/phrase) and derive a set-length key for later use in encryption (for example, handing the password “monkeys123” to a KDF and getting back a 256-bit long value to use in AES.)
Interestingly, this also means that a KDF isn’t desgined from the get-go to be a password-scrambling algorithm (in the same way that hashing algorithms aren’t), but they are much better suited for the purpose.
We still meet some crucial requirements for our use case with a KDF, in that:
- return values are fixed length
- original values are unrecoverable
We also gain this significant attribute from a KDF:
- The resultant hash is very expensive to calculate
In fact, some KDFs like scrypt are specifically designed to be prohibitively expensive to calculate.
To give you an idea of just how different these operations can be, I wrote a small ruby program to perform MD5, SHA1, SHA2, SHA3, scrypt, and Bcrypt password digests with increasing iterations. Take a look at the y-axis on the chart below, then enable the bcrypt and scrypt plots and take a look at the scale again..
md5 | |
sha1 | |
sha2 | |
sha3 | |
pbkdf2 |
Note #1: These numbers may be inaccurate, but the order of magnitude differences are what is important here. If you want to validate my claims, I’ve posted a gist just for you I’ve been using to generate the flot data.
Note #2: You’ll note that SHA-3 (or keccak) is slower than its predecessors despite my previous mention of the algorithm. There’s a couple reasons this could be. My guess is that the gem I’m using is probably an early, unoptimized implementation, or my script is skewed somehow. If anyone has any thoughts, I’d love to hear them.
Note #3: The surprisingly fast performance of pbkdf2 may be due to the fact that I implemented my data-generation script with dumb iterations rather than the parameterized iteration option for the rubygem function, but I didn’t delve too far into this odd result.
This script ran over the course of three days generating data on a 2GB Linode instance. The data isn’t terribly scientific, but it gets the point across: KDF functions are dramatically slower than hashing functions, which isn’t a nicety of our desired traits, but a requirement.
That’s a pretty drastic difference.
The tl;dr
- Stop associating “hashed” with “secure” when it comes to passwords. If you’re storing user credentials MD5-ed without salt, you’ve put an thumb tack in front of a steamroller - just a minor annoyance that won’t offer you any real security. You have to do it right or hashing means nothing.
- Start relying on key derivation functions for password storage.2 Are they perfect? Will they be the best practice in five years? Probably not, but they’re your best option when rolling your own user credential storage.
- Don’t sacrifice security for performance. A slightly slower
malloc
in OpenSSL would have been much preferred, in my opinion, than the disaster that is Heartbleed.