Tip: Storing MD5 Values (and other string/binary representations)

A common occurrence I have noticed in MySQL apps is that MD5 values are stored as 32 byte values rather than 16. Just to ‘rehash’, an MD5 value is a 16 byte hexadecimal value, typically used as a unique fixed-length signature of a string, useful for identifying unique strings or one-way encryption of passwords. The binary representation takes 16 bytes, though a human readable hexadecimal version takes twice as many.

The same goes for any of the other hashing techniques. They tend to output a friendly hex format, useful in a number of cases like in Javascript or within a particular format such as CSV or TSV (the random binary bytes would mess up the delimiting of data). When you’re looking to store these values though, most of the time it makes sense to have them in their shorter binary representation.

Another common example is IP addresses, I often see VARCHAR(16) for IPv4 addresses. Perhaps when IPv6 is more commonplace we will see VARCHAR(64) instead. IPv4 addresses are 32-bit values and can be stored as an UNSIGNED INT (4 bytes), while IPv6 addresses are 128-bit. There isn’t a native 16-byte integer type in MySQL so a BINARY(16) or two UNSIGNED BIGINT fields would do, though perhaps software will address this as IPv6 gains adoption.

When doing lookups on these kinds of fields, you want them as small as possible so that they can fit neatly into indexes and less processing time is spent evaluating them.

The following is a simple test to compare speeds of a CHAR(32) MD5 column versus a BINARY(16)

The MD5 values that are inserted are deliberately left-padded with 0’s to emphasise the fact that field lengths do make a difference when searching on a field, regardless of whether the field is indexed or not. This is because we’re only populating the table with ~2^20 rows, whereas random MD5s have 2^128 possible values. If we just used random MD5s then MySQL would only have to examine the 1st byte or two due of our small dataset and there would be negligible difference in our small sample. Over millions of runs, or a larger dataset… the difference grows.

Output may be similar to