Tyblog

Technology, open source, unsolicited opinions & digital sovereignty
blog.tjll.net

« Please stop hashing passwords »

  • 20 April, 2014
  • 1,921 words
  • 10 minutes read time

Have I got your attention? It’s a sensationalist title, but this is important and developers/administrators still get it wrong.

Both online and professionally, I encounter technical people still turning to traditional hashing algorithms like SHA or, Schneier forbid, MD5 when making decisions about scrambling user credentials. Even this recent question on Stack Overflow Exchange has yielded inaccurate answers. While choosing something like SHA-256 with salt isn’t necessarily a bad decision, it’s not the right decision – which, when it comes to cryptography, is critical to maintain the integrity of the system as a whole.

Attention-grabbing title aside, I hope that I can explain why people need to carefully choose their cryptographic primitives when making decisions about securing data at rest.

Dislaimer: I am not a cryptographer or mathematician, just a concerned internet citizen. If you have feedback, please feel free to contribute and I’ll amend my commentary as needed.


Obfuscation and Hashing

It should be self-evident to see why storing the user password

monkeys123

as

monkeys123

in an application database is a bad idea. The aforementioned Stack Overflow question actually answers this general problem pretty well, read it if you don’t immediately see why this is a bad idea.

A long time ago the best practice to mitigate this risk was to cryptographically hash passwords with an algorithm like MD5 such that the password

monkeys123

becomes

5a44cf4723d8fab4394952e331d6efad

inside of an application’s database. Essentially what we’re trying to achieve is a fairly specific set of requirements for our password-scrambling scheme:

As many of these requirements overlap with the strengths of some hashing algorithms, schemes like SHA have become popular and commonplace in web frameworks.

However, a hashing algorithm is an interesting beast. Although the layman’s “information security” synapses will fire when cryptographic hashes are mentioned, hashing algorithms have strengths (and weaknesses) very specific to the use case.

It’s worthwhile now to briefly dip into the technicalities of cryptographic hashing.

Aside: Why Prefer CPU-Intensive Operations?

This is sort of a weird requirement until you grok the difference between hashing passwords and large files.

When you hash a file, you’re looking at possibly megabytes of data to crunch through bit-by-bit in order to come up with a unique fingerprint. That’s a lot of entropy that the hashing algorithm has to work with to create pseudorandom values.

When you hash a password, you’re creating a unique fingerprint for a string that is going to top out for the average person at about 10 or 20 characters. Thus, while extremely large, the domain of all possible values for the average password is significantly smaller than the entropy swirling about in a 10-megabyte file.

Furthermore, attackers can usually rely on a wordlist to serve as a base for password guesses, thereby reducing the unique values that they must brute force even further (it’s easier to start guessing at ‘password123’ rather than ‘aaaaaaaaa’.)

What this means for your hashed password is that you don’t want to enable a bad actor to be able to rapidly traverse millions of unique values quickly to try and pin down your pre-hashed credential because of this key reason:

Note that because passwords will tend to always have this attribute (for the foreseeable future), computationally expensive is a requirement we want no matter what cryptographic primitive we choose. As the domain of our unique value is drawn from human-retainable set of characters, we must ensure that computing values is prohibitively expensive for the lengths and complexities that a person can remember.

Hashing and You

According to the internet hive mind, the ideal cryptographic hashing function has the following properties: 1

  1. it is easy to compute the hash value for any given message
  2. it is infeasible to generate a message for has a given hash
  3. it is infeasible to modify a message without changing the hash
  4. it is infeasible to find two different messages with the same hash.

In short, we want to throw a hunk of information into a woodchipper and watch a unique, unguessable fingerprint exit the other end. There are some use cases for which these attributes are exceptionally well-suited:

However, you’ll note that these strengths are somewhat at odds with some of the desirable traits for a password scrambling scheme:

You’ll also note that because our domain of possible values is limited (necessitating the aforementioned high cost of calculating a hash), rainbow tables have to combated with salt. The fact we’re already compensating for the attributes of a hash is a bad sign.

Hopefully it’s evident that the intended use case for a hash does not perfectly fit our ideal algorithm for cryptographically scrambling passwords.

Enter The Key Derivation Function

Built upon other cryptographic primitives (including hashing), a key derivation function (KDF) is meant to combine a non-secret value (read: salt) and a secret value (your password) and derive a value from calculations upon these values. Their intended use case is to take a pass(word/phrase) and derive a set-length key for later use in encryption (for example, handing the password “monkeys123” to a KDF and getting back a 256-bit long value to use in AES.)

Interestingly, this also means that a KDF isn’t desgined from the get-go to be a password-scrambling algorithm (in the same way that hashing algorithms aren’t), but they are much better suited for the purpose.

We still meet some crucial requirements for our use case with a KDF, in that:

We also gain this significant attribute from a KDF:

In fact, some KDFs like scrypt are specifically designed to be prohibitively expensive to calculate.

To give you an idea of just how different these operations can be, I wrote a small ruby program to perform MD5, SHA1, SHA2, SHA3, scrypt, and Bcrypt password digests with increasing iterations. Take a look at the y-axis on the chart below, then enable the bcrypt and scrypt plots and take a look at the scale again..

500
1000
1500
2000
2500
3000
3500
0.00
0.05
0.10
0.15
0.20
0.25
md5
sha1
sha2
sha3
pbkdf2
Iterations
Seconds

Note #1: These numbers may be inaccurate, but the order of magnitude differences are what is important here. If you want to validate my claims, I’ve posted a gist just for you I’ve been using to generate the flot data.

Note #2: You’ll note that SHA-3 (or keccak) is slower than its predecessors despite my previous mention of the algorithm. There’s a couple reasons this could be. My guess is that the gem I’m using is probably an early, unoptimized implementation, or my script is skewed somehow. If anyone has any thoughts, I’d love to hear them.

Note #3: The surprisingly fast performance of pbkdf2 may be due to the fact that I implemented my data-generation script with dumb iterations rather than the parameterized iteration option for the rubygem function, but I didn’t delve too far into this odd result.

This script ran over the course of three days generating data on a 2GB Linode instance. The data isn’t terribly scientific, but it gets the point across: KDF functions are dramatically slower than hashing functions, which isn’t a nicety of our desired traits, but a requirement.

That’s a pretty drastic difference.

The tl;dr

  1. https://en.wikipedia.org/wiki/Cryptographic_hash_function 

  2. There’s a fair bit of dispute over the ideal KDF for storing passwords, but generally speaking, any of the three covered in this topic are popularly accepted options.