How user passwords should be stored in a database? Some systems keep them in cleartext, some encrypt them, some use hash functions. What must be done to ensure proper security for our users? Which hashing algorithms are good and which are bad?
Passwords in cleartext
Why to even bother doing anything with the passwords? Why not leave them in cleartext? There are a few good reasons:
- password confirms identity
If someone logs in using John's password, it can be assumed it was John as John should be the only person that knows his password. From some perspective using a password in IT system is like a signature on a document. It should really be kept safe. IT system developers and administrators should do the same - protect users passwords. Otherwise, they would be meaningless. - database administrators would know them
As John must be the only person knowing his password, the database administrators cannot know it either so it cannot be stored in cleartext.
- Why do we care about the administrators? They can see John's data in the database anyway.
- Yeah, they can but without John's password they cannot log into the system as John and transfer John's money pretending to be John.
Without knowing his password, if the administrators do something bad, it will be possible to distinguish it between John's action and inside job. - hackers would know them after a successful attack
But let's not make administrators bad persons. Although, they should be trustworthy, there is a risk of a hacker attack and a data leak. If a hacker gets access to the database, he/she could find John's password in it. If it was stored in cleartext, the hacker could read it, use it to log into the system as John and do awful things to John's money.
So now we are sure, hopefully, that user passwords should not be stores in cleartext. How about encryption?
Encrypted passwords
Encryption is a process of changing original data to different data using an encryption key in such a way that restoring the original data from the encrypted data is practically possible only with the encryption key.
If you know the encryption key, you can decrypt the data.
Even though it seems like the way to go, it is not.
Let's assume, we will encrypt passwords in our system for all users. We have to create an encryption key. That is not a problem, we can randomize it, encrypt user's passwords in the database and ... we have to store the encryption key somewhere so we could encrypt passwords for the users they will create their accounts tomorrow.
Storing the key in the database violates the whole purpose of encryption - if database administrators or hackers get access to the database, they can read both: encrypted John's password and the encryption key so they will be able to decrypt the password.
Other option is that we will put it in the application code but then the developers will know it. If we use OS environment variable, system administrators will know it.
We can use a key store but it complicates the architecture and it has it's own disadvantages.
Actually, we unnecessarily focused on encryption. Something as complex as encryption is not needed for that purpose. A few minutes ago I told you that the encryption key allows decrypting data. We do not need that, really. John knows his password, we do not have to and we cannot.
We need something that will transform the John's password in a way that restoring the original one will be impossible. Let's meet hashing.
Hashed passwords
Hashing is a process of changing original data to different data in such a way that restoring the original data is impossible.
Sometimes people are surprised that it is possible to create an algorithm that transforms data without a possibility to restore the original data. But actually it is very easy to find one.
For example if original data is a number, we can sum up all digits. If I tell you that the result is 15, you have no possibility to calculate that the original number was 12345.
It is one of hash function requirements - must be irreversible.
Probably you feel that something is not right here. Why does anybody care whether the original data was 12345 or 35412, or 91221 if all of them result in 15? You are correct, that is not good. This is why there is an additional requirement if the hash function is used for hashing passwords - it should not allow hash collisions.
It means that there should not be two different passwords which transform to the same hash. Obviously, summing up all digits is a bad candidate for hashing passwords.
A hash function must also be deterministic. It does not matter how many times and when the hash is computed, it must always be the same for the same input data.
And the forth requirement that is important for us - hash function should have a high entropy.
It means that if we compute hashes from two very similar passwords, the hashes cannot be similar. In other words - a small change in the original data, cause a big change in a hash. Without this property, a hacker could check how close he/she is from guessing the original data.
There is one more. Not all hash functions must have this property, but it is important for hashing passwords.
The function should be slow. Yes, slow. The slower, the better. It substantially blocks brute force attack.
Validating hashed passwords
Wait a minute. If the password is hashed and hashing is irreversible, how can the system validate whether the password entered by a user is correct?
That is easy, there is no magic. The system has to hash whatever the user provided and compare it to the hashed password in the database for that user. If both matches, the password is correct. Otherwise it is incorrect.
Validating passwords is not a big deal but there is a feature that cannot work the same as with cleartext passwords. It is a reminding password functionality. If a user forgets his/her password, there is no way to restore it. But again, it is not a problem because the user can set a new password, the system does not have to send the current one.
Hash function examples
We have quickly discussed five properties of a good function for hashing passwords. They don't seem easy to meet but we do not have devise them.
Some smart people have already invented them. There are a number o hash functions but not all of them fulfill requirements for hashing passwords. Some of them were considered strong but at some point in time security issues were found.
Currently recommended by experts are: PBKDF2, bcrypt and scrypt. Some other functions have been considered unsafe for hashing passwords over time like: famous MD5 and SHA-1, SHA-2, SHA-3. They do not meet the fifth property - they are incredibly fast which makes them vulnerable to brute force attacks.
MS LAN Manager hash was widely used in old versions of Windows operating system to store passwords and unfortunately it is still used by some SMB 3-rd party implementations. This is a big no-no because of it's weak implementation. It limits the password length to 14 characters which should be enough to announce a crime but it has a bigger problem. The password is split to two 7 characters blocks and hashed separately. So cracking the password is much easier - it is like cracking two passwords with 7 characters.
Attacks
We are almost done with the concept of hashing except one thing. Before I tell you what it is about, let's dive a little into more technicalities.
As most of the systems, you probably have USERS table in the database. Among other columns like username, there is a password column. As we already know that passwords cannot be stored as cleartext, we hash them.
Do you see anything strange in the table? A hacker could tell which users have the same passwords by looking for the same hashes. As the hash function is deterministic the following statement is true - if the original passwords are the same, the hashed passwords are also the same. It gives an unwanted opportunity to the hacker. He/she can find the most common hashed password in the database, find all users that have the same one and try to steal the original passwords from these users using sociotechnology. If any of this attempt succeeds, the hacker would have a password for all these users that used the same one.
There is also another hacking technique that is possible to apply. Imagine that a number of users is fairly big. Let's say a few millions. The hacker use the hash function to hash popular passwords like abc123, Password, admin, 123456. Then he/she could check whether the hash from USERS table match any of the hashes of popular passwords. If there are millions of users, there is a high enough probability that some users used the most common passwords in the world.
Can users that never set popular passwords sleep well? Hmm ... not necessarily. Why not to generate hashes for all the passwords up to 6 characters, or 8 or 10? Yeah, I know, it would require billions of years on an old laptop, or even more. But wait, what if the hacker rent a data center or use hacked computers for distributed computations and yeah ... he/she can even wait for the result a year.
Of course the cost of such operation would be enormous and cracking one account does justify the cost and the effort. On the other hand, the result would be a list of trillions of passwords mapped to their hashes so after all that effort the hacker could crack passwords in any system that uses the same hash function. Do you feel the power the hacker would have?
You are not alone, such concept of pre-generating mappings with some ingenious enhancements are widely known as Rainbow Tables. They are traded, shared, exchanged all over the world.
You may find them for some hash functions very easily like LM, NTLM, MD5, SHA1 ( https://project-rainbowcrack.com/table.htm). Take a closer look at them, they can crack passwords up to 9 or 10 characters. These are quite long passwords and still can be cracked. There are even online tools to do that without downloading whole Rainbow Tables (https://crackstation.net/). This one supports really big number of hash functions.
Are you scared as a user that you password can be easily cracked? Please don't be, just be aware of that and choose strong long passwords. As a software developer, take it really seriously. The system you build should resist Rainbow Attacks even if users use shorter passwords. How? Let's meet salt.
Salt
Do you remember the USERS table we talked about a few minutes ago? I did not design that beautiful table for one image, no no no. This time instead of just hashing the passwords, I create a new column with a random text. Random for each row. And I call it "salt". Now, I concatenate the salt and the password and then I apply the hash function on the result.
Thanks to that trick, users that initially had the same passwords, have different hashes now so the hacker cannot say which users chose the same passwords.
Moreover, Rainbow Attacks are no longer practical because if we generate long salts, longer than 10 characters, even though the user password is only 6 characters long, the hashed text is over 16 characters long but Rainbow Tables I showed you support up to 10 characters passwords. So the hacker will not be able to easily use Rainbow Attack.
I will just mention that some hash functions do not require storing salts separately as they include them in the output hash which does not change the security but makes programming a little easier.
Practical Advice
When developing a software system or just maintaining one, make sure you do not store user passwords in cleartext nor encrypted. They must be hashed.
Use secure hashing functions that are currently widely accepted for hashing passwords like bcrypt, scrypt or PBKDF2. If you use Java, I would consider the last one as necessary libraries are shipped with JDK. At the moment of creating this material, bcrypt and scrypt require using external libraries which is of course not a blocker - if you want you can still use them.
Never hash passwords with functions which are considered insecure like MD5, SHA-1, SHA-2, SHA-3, LM, NTLM, CRC32. They might be fine for validating correctness of data transfer but not for hashing passwords. They were compromised by finding methods to generate hash collisions or they internally use too short keys.
Take advantage of salts generated separately for each user. Use long salts - scrypt and bcrypt used in Linux have 128 bit salts and they seem strong enough.