The power of base convertion & short MD5 hashes
  1. 1. The story
  2. 2. The practice
    1. 2.1. Base convertion
  3. 3. Getting short MD5 hashes in PHP
  4. 4. Can we do more?
    1. 4.1. Encryption
    2. 4.2. A second of Base64
      1. 4.2.1. Compression?
      2. 4.2.2. The code

The story

A few days ago I was writing a plugin for my blog engine that was adding a nice feature of subscribing to new comments in some post or thread. It also included a few scripts to allow users manage their subscriptions (to very least to let them unsubscribe from threads).
Obviously, I needed some means to authenticate users so only the real owner of an e-mailbox can manage his subscriptions. For this I chose to use my favourite method which is easy to handle:

  1. Have a secret phrase known to server-side only.
  2. Other info that we want to link together (for example, e-mail address).
  3. Now simply hash all this in one string using MD5 and get an irreversible secret phrase that, however, can be checked for validity – and only we can generate valid phrases because nobody but the server knows the secret ingredient – item #1.

In fact, this is very straightforward:

PHP
$secret 'blowhe@d';
$email 'proger@localhost';
$ticket md5("$secret:$email");  // => 791e79bcf93cbd108a4e18090e28f3c1
echo "The link to manage your mailbox is: subscriptions.php?email=$email&ticket=$ticket";

This method is so beautiful that it even doesn’t require a database – in fact, it doesn’t require anything! Of course it has certain downsides such as these:

  1. Once secret phrase is revealed anybody can start generating tickets even without having access to the server (provided they also know the exact structure of the hashing phrase).
  2. Once secret is changed it will invalidate all previously given tickets.
  3. MD5 hash is quite long (32 chars) when it appears in plain/text e-mail bodies.

However, first two items are fine if this approach is used for short-living targets (such as login session) – even more, the system can automatically change the secret phrase, say, each midnight and thus even if the old one was somehow revealed it’ll only be valid for a day at most.

But the third point differs. Actually, IMO, it’s pretty lame to see links which query part is longer than the location itself.
And, besides my present problem, generating a shorter MD5 hash seemed like a goodtarget to release my brain at :D

Here the story ends and practice starts.

The practice

But a bit of a theory first. Each MD5 digest is a 16-byte long binary value. When converting it to string a commonly used algorithm is simple: represent each byte in hexadecimal form (char case-insensitive of course) and go:

PHP
function RawToMD5($raw) {
  
$result '';
  foreach (
str_split($raw) as $i => $byte) {
    
$result .= str_pad(dechex(ord($byte)), 2'0'STR_PAD_LEFT);
  }
  return 
$result;
}

This results in 16 * 2 = 32 characters of hash. We can say that by converting raw hash into text we’re performing a base convertion between 10 and 16 notations.
But the alphabet of hex notation only consists of 16 chars. Can we make it longer so thatthe resulting hash string will be shorter?

Yes, we can!

The key point here is base convertion: we can treat raw 16-byte MD5 hash as individual integers (for example, pairs of bytes, double words or more) and re-encode them using a custom alphabet.

Base convertion

Now we’ve got to the main part. We need to determine the length of individual components and how long our alphabet has to be to provide the shortest string combination.
Let’s skip right to the point and split 16-byte hash into double words. How many letters we need?The formula is: root of index LENGTH (256 ^ 4). Its components are:

LENGTH
Desired length of string that can represent full range of a byte (256 values).
256^4
Max byte value raised to the power of 4 where 4 represents integer’s length (in our case it’s double word).

Now a few quick calculations («#r» means «root of index #»):

  1. 5r(256^4) ≈84.4 – in other words, to represent a double word as a string of 5 characters we need at most 85 characters.
  2. 6r(256^4) ≈40.3 – if it’s 6 chars long we can manage with just 41 chars in our alphabet.

We will use the above two because they look most appropriate for our needs. However, for demonstration purposes here are more calculations:

  1. 9r(256^8) ≈138.2 – an MD5 hash can be represented as just 18 chars long if we can allow 139 different symbols in our alphabet.
  2. 10r(256^8) ≈84.4 – the power of math! It’s the same as item #1.

Okay, let us proceed to the alphabets then. The first calculation shows us that if we get 85 symbols we will have MD5 hash 20 characters long. The following symbols can apply:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"№;:?/'[]<>

Most of them are not safe to use in URLs and will require encoding as %XX which will make hashes even longer than they currently are.

So we’ll carry on the next one. 41 chars? That’s much easier! 0123456789abcdefghijklmnopqrstuvwxyzABCDE – just Latin letters and digits. Neat! This is what we will use further to get MD5 hashes just 24 characters long (of course, case-sensitive for «ABCDE»).

Getting short MD5 hashes in PHP

For info on how to convert numbers between notations see this article on Wikipedia.

General idea:

  1. Since PHP 5.0 it’s possible to get raw version of MD5 hashes by passing PHPtrue as second parameter to md5 or md5_file functions. So we first retrieve this raw hash (in form of 16-chars long string).
  2. Then we convert each octet there to a string using the alphabet we’ve decided on above.
  3. While doing that, don’t forget to pad resulting string with a symbol standing for zero in that alphabet (in our case it’s character «0»).

Fixed 2 April 2013 – now compatible with PHP 5.4 (removed «String offset cast occurred» notice).

PHP
/* Functions for converting between notations and short MD5 generation.
 * No license (public domain) but backlink is always welcome :)
 * By Proger_XP. http://proger.i-forge.net/Short_MD5/OMF
 */
define('MD5_24_ALPHABET''0123456789abcdefghijklmnopqrstuvwxyzABCDE');

function 
MD5_24($str) {
  return 
RawToShortMD5(MD5_24_ALPHABETmd5($strtrue));
}

function 
MD5File_24($file) {
  return 
RawToShortMD5(MD5_24_ALPHABETmd5_file($filetrue));
}

function 
MD5_RawTo24($str) {
  return 
RawToShortMD5(MD5_24_ALPHABET$str);
}

function 
MD5_32to24($hash32) {
  return 
RawToShortMD5(MD5_24_ALPHABETMD5ToRaw($hash32));
}

function 
MD5_24toRaw($hash24) {
  return 
ShortToRawMD5(MD5_24_ALPHABET$hash24);
}

function 
MD5_24to32($hash24) {
  return 
RawToMD5(ShortToRawMD5(MD5_24_ALPHABET$hash24));
}

function 
RawToShortMD5($alphabet$raw) {
  
$result '';
  
$length strlen(DecToBase($alphabet2147483647));

  foreach (
str_split($raw4) as $dword) {
    
$dword ord($dword[0]) + ord($dword[1]) * 256 ord($dword[2]) * 65536 ord($dword[3]) * 16777216;
    
$result .= str_pad(DecToBase($alphabet$dword), $length$alphabet[0], STR_PAD_LEFT);
  }

  return 
$result;
}

  function 
DecToBase($alphabet$dword) {
    
$rem = (int) fmod($dwordstrlen($alphabet));
    if (
$dword strlen($alphabet)) {
      return 
$alphabet[$rem];
    } else {
      return 
DecToBase($alphabet, ($dword $rem) / strlen($alphabet)).$alphabet[$rem];
    }
  }

function 
ShortToRawMD5($alphabet$short) {
  
$result '';
  
$length strlen(DecToBase($alphabet2147483647));

  foreach (
str_split($short$length) as $chunk) {
    
$dword BaseToDec($alphabet$chunk);
    
$result .= chr($dword 0xFF) . chr($dword >> 0xFF) . chr($dword >> 16 0xFF) . chr($dword >> 24);
  }

  return 
$result;
}

  function 
BaseToDec($alphabet$str) {
    
$result 0;
    
$prod 1;

    for (
$i strlen($str) - 1$i >= 0; --$i) {
      
$weight strpos($alphabet$str[$i]);
      if (
$weight === false) {
        throw new 
Exception('BaseToDec failed - encountered a character outside of given alphabet.');
      }

      
$result += $weight $prod;
      
$prod *= strlen($alphabet);
    }

    return 
$result;
  }

function 
MD5ToRaw($str) { return pack('H*'$str); }
function 
RawToMD5($raw) { return bin2hex($raw); }

Pretty much that’s all, now you can generate a short (24-chars long) MD5 hash like this:

PHP
$hash_24  MD5_24('hello!');     // => p71Eru8C4256573dg1hemv9A
$standard md5('hello!');        // => 5a8dd3ad0756a93ded72b823b19dd877
$ours     MD5_24to32($hash24);  // => 5a8dd3ad0756a93ded72b823b19dd877

As you can see we’re saving 8 bytes when compared to standard hash.

Can we do more?

Of course! There is always room for improvements.

For example, if we use an alphabet containing 85 chars we can squeeze the hash further and get it 20 chars long. Like this:

PHP
$md5_20 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"№;:?\/\'[]<>';

$hash_20  RawToShortMD5($md5_20md5('hello!'true));  // => T"!nkj&ID^bE(%YCI5E№
$standard RawToMD5(ShortToRawMD5($md5_20$hash_20));   // => 5a8dd3ad0756a93ded72b823b19dd877

Encryption

We can utilize the ability to use different alphabets to disgize some information. We can also use it to «encrypt» MD5 hashes making it even more harder to bruteforce.
For example, we can shuffle symbols used for 24-long MD5:

PHP
$alphabet 'D1yraq9s4wkhx0Alzdi6pfgeB5mn3t7jubcvoCE82';
MD5_24('hello!');                            // => p71Eru8C4256573dg1hemv9A
RawToShortMD5($alphabetmd5($strtrue));   // => 5s12n74Eayq9qsr0z1dAgjwo

So if someone wants to revert the hash in addition to figuring out the secret phrase he’ll need to determine the order of symbols in the alphabet.

A second of Base64

Everyone knows about base64 – a widely-used method to encode binary (but technically any) data into sequence of Latin letters, digits and sometimes a few other symbols.
It uses different principle from the one being described in this article but the purpose isnearly the same: to represent data in different form to ensure all symbols belong to a particular alphabet so data won’t get corrupted during transmission or storing.

In a nutshell, base64 divides all input stream into groups of 3 bytes and takes 4 numbers from them together, each number 6 bits long. 6 bits can have a maximum of 64 different values (that’s where base64’s name originales from) – which is exactly the total count of digits, lower-case, upper-case Latin letters (10 + 26 * 2 = 62) and 2 other symbols (in standard base64 they are «+» and «/» for 62nd and 63rd values correspondingly).
This way each 3 bytes are represented as 4 base64 symbols, that’s why after encodinga stream it becomes 33% longer (4 / 3 = 1.33).

With base ceonvertion we can look at this from different angle. Let’s say that we want to encode each 4 bytes as 5 symbols (20% overhead, 5 / 4 = 1.2). For this we’ll need an alphabet 85 chars long (as described above).
Then we read input stream in chunks by 4 bytes each, treat it as UINT32 and base-convertit using our alphabet for 5-chars long string. Now any input stream will be 20% longer.

Of course, such encoding scheme will contain «unsafe» characters that base64 avoids (such as symbols: % $ # * and so on). And, then, 13% overhead difference isn’t much.

Compression?

But we can go further and even create some sort of compression. Let’s think about it: what we do when we encounter an integer that, when represented in our custom notation, is shorter than expected length (for example, if each 4 bytes are represented at most with 6 symbols but this particular number fit into 5 chars)? We pad it with character which stands for «zero» in our alphabet. We do so with normal numbers too: 020, 00FF, etc.
When we encode a stream we add such padding symbols so we can determine the borders of numbergroups when decoding it. However, we don’t have to align each group!

This means that when we encounter a «number» in our 5-chars long notation, say, «Qx1» then instead of padding it with two «0»s on the left («00Qx1») we can add just one symbol, some character outside of our alphabet that we consider a pad character – like underscore («_»). This way we’re getting «Qx1_», saving 1 byte and, in fact, avoiding encoding overhead for this octet!

In other words, if input contains a lot of small values it can be encoded with less or none overhead thanks to the padding trick. If we’re using a 85-symbols alphabet for each 4 bytes of input then last 5th byte for the encoded string will be used if those 4 bytes are greater than 0x03197500 – well, not 100%-precise calculation but it must be somewhere close.
And we will save 1 byte if input integer is smaller than 0x00095200. And if it containsof just zeros we’ll save lot! :)

The code

For experemental purposes I’ve written a de/coder that works on chunks 4 bytes long and converts them using custom alphabets (of any length). The resulting stream will contain only symbols from that alphabet plus 2 extra symbols: pad character («_») and end-of-stream pad character («=»). EOS character can be one of the alphabet symbols.

The decoder reverts the convertion made. Both functions are written in PHP and use base convertion routines explained above.
Of course performance of encoding & decoding is very poor in PHP when compared to built-inBase64 functions (it took my server 2 seconds to encode a 300 KiB file). But it should be better when written in a compilable language.

PHP
function EncodeBinary($alphabet$str$padChar '_'$endChar '=') {
  
$length strlen(DecToBase($alphabet2147483647));

  
$lastChunkLen fmod(strlen($str), 4);
  
$lastPad str_repeat($endChar$lastChunkLen);
  
$lastChunkLen and $str .= str_repeat("\0"$lastChunkLen);

  
$result '';
  for (
$pos 0$pos strlen($str); $pos += 4) {
    
$dword substr($str$pos4);
    
$dword ord($dword[0]) + ord($dword[1]) * 256 ord($dword[2]) * 65536 ord($dword[3]) * 16777216;
    
$chunk $dword === '' DecToBase($alphabet$dword);
    
$result .= $chunk;

    
strlen($chunk) < $length and $result .= $padChar;
  }

  return 
$result.$lastPad;
}

function 
DecodeBinary($alphabet$str$padChar '_'$endChar '=') {
  
$length strlen(DecToBase($alphabet2147483647));

  
$result '';
  
$prev 0;
  do {
    
$pos strpos($str$padChar$prev);

    if (
$pos === false) {   // '' or 'abcde' or 'abcde==='
      
if (!isset($str[$prev])) {
        break;
      } elseif (
$str[$prev] === $endChar) {
        
$result substr($result0, -* (strlen($str) + $prev));
        break;
      } else {
        
$fullChunk true;
      }
    } else {
      
$fullChunk $pos >= $prev $length;
    }

    
$fullChunk and $pos $prev $length;
    
$chunk substr($str$prev$pos $prev);
    
$prev $pos + ((int) !$fullChunk);

    
$dword BaseToDec($alphabet$chunk);
    
$chunk chr($dword 0xFF) . chr($dword >> 0xFF) . chr($dword >> 16 0xFF) . chr($dword >> 24);
    
$result .= $chunk;
  } while (
$pos !== false);

  return 
$result;
}

Usage examples:

PHP
$alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"№;:?\/\'[]<>';
$str EncodeBinary($alphabet"Hello \0 World!");   // => y)0VNaoiK5y)TC'1f_=
$dec DecodeBinary($alphabet$str);                // => Hello \0 World!

Hope you had good time reading my post! Feel free to drop a comment below :)