A few days ago I was writing a plugin for my blog engine that was adding a nice feature of subscribing to new comments in some post or thread. It also included a few scripts to allow users manage their subscriptions (to very least to let them unsubscribe from threads).
Obviously, I needed some means to authenticate users so only the real owner of an e-mailbox can manage his subscriptions. For this I chose to use my favourite method which is easy to handle:

PHP

$secret = 'blowhe@d';
$email = 'proger@localhost';
$ticket = md5("$secret:$email");  // => 791e79bcf93cbd108a4e18090e28f3c1
echo "The link to manage your mailbox is: subscriptions.php?email=$email&ticket=$ticket";

This method is so beautiful that it even doesn’t require a database – in fact, it doesn’t require anything! Of course it has certain downsides such as these:

However, first two items are fine if this approach is used for short-living targets (such as login session) – even more, the system can automatically change the secret phrase, say, each midnight and thus even if the old one was somehow revealed it’ll only be valid for a day at most.

But the third point differs. Actually, IMO, it’s pretty lame to see links which query part is longer than the location itself.
And, besides my present problem, generating a shorter MD5 hash seemed like a goodtarget to release my brain at :D

The practice

But a bit of a theory first. Each MD5 digest is a 16-byte long binary value. When converting it to string a commonly used algorithm is simple: represent each byte in hexadecimal form (char case-insensitive of course) and go:

PHP

function RawToMD5($raw) {
  $result = '';
  foreach (str_split($raw) as $i => $byte) {
    $result .= str_pad(dechex(ord($byte)), 2, '0', STR_PAD_LEFT);
  }
  return $result;
}

This results in 16 * 2 = 32 characters of hash. We can say that by converting raw hash into text we’re performing a base convertion between 10 and 16 notations.
But the alphabet of hex notation only consists of 16 chars. Can we make it longer so thatthe resulting hash string will be shorter?

The key point here is base convertion: we can treat raw 16-byte MD5 hash as individual integers (for example, pairs of bytes, double words or more) and re-encode them using a custom alphabet.

Base convertion

Now we’ve got to the main part. We need to determine the length of individual components and how long our alphabet has to be to provide the shortest string combination.
Let’s skip right to the point and split 16-byte hash into double words. How many letters we need?The formula is: root of index LENGTH (256 ^ 4). Its components are:

We will use the above two because they look most appropriate for our needs. However, for demonstration purposes here are more calculations:

Okay, let us proceed to the alphabets then. The first calculation shows us that if we get 85 symbols we will have MD5 hash 20 characters long. The following symbols can apply:

Most of them are not safe to use in URLs and will require encoding as %XX which will make hashes even longer than they currently are.

So we’ll carry on the next one. 41 chars? That’s much easier! 0123456789abcdefghijklmnopqrstuvwxyzABCDE – just Latin letters and digits. Neat! This is what we will use further to get MD5 hashes just 24 characters long (of course, case-sensitive for «ABCDE»).

Getting short MD5 hashes in PHP

Fixed 2 April 2013 – now compatible with PHP 5.4 (removed «String offset cast occurred» notice).

PHP

/* Functions for converting between notations and short MD5 generation.
 * No license (public domain) but backlink is always welcome :)
 * By Proger_XP. http://proger.i-forge.net/Short_MD5/OMF
 */
define('MD5_24_ALPHABET', '0123456789abcdefghijklmnopqrstuvwxyzABCDE');

function MD5_24($str) {
  return RawToShortMD5(MD5_24_ALPHABET, md5($str, true));
}

function MD5File_24($file) {
  return RawToShortMD5(MD5_24_ALPHABET, md5_file($file, true));
}

function MD5_RawTo24($str) {
  return RawToShortMD5(MD5_24_ALPHABET, $str);
}

function MD5_32to24($hash32) {
  return RawToShortMD5(MD5_24_ALPHABET, MD5ToRaw($hash32));
}

function MD5_24toRaw($hash24) {
  return ShortToRawMD5(MD5_24_ALPHABET, $hash24);
}

function MD5_24to32($hash24) {
  return RawToMD5(ShortToRawMD5(MD5_24_ALPHABET, $hash24));
}

function RawToShortMD5($alphabet, $raw) {
  $result = '';
  $length = strlen(DecToBase($alphabet, 2147483647));

  foreach (str_split($raw, 4) as $dword) {
    $dword = ord($dword[0]) + ord($dword[1]) * 256 + ord($dword[2]) * 65536 + ord($dword[3]) * 16777216;
    $result .= str_pad(DecToBase($alphabet, $dword), $length, $alphabet[0], STR_PAD_LEFT);
  }

  return $result;
}

  function DecToBase($alphabet, $dword) {
    $rem = (int) fmod($dword, strlen($alphabet));
    if ($dword < strlen($alphabet)) {
      return $alphabet[$rem];
    } else {
      return DecToBase($alphabet, ($dword - $rem) / strlen($alphabet)).$alphabet[$rem];
    }
  }

function ShortToRawMD5($alphabet, $short) {
  $result = '';
  $length = strlen(DecToBase($alphabet, 2147483647));

  foreach (str_split($short, $length) as $chunk) {
    $dword = BaseToDec($alphabet, $chunk);
    $result .= chr($dword & 0xFF) . chr($dword >> 8 & 0xFF) . chr($dword >> 16 & 0xFF) . chr($dword >> 24);
  }

  return $result;
}

  function BaseToDec($alphabet, $str) {
    $result = 0;
    $prod = 1;

    for ($i = strlen($str) - 1; $i >= 0; --$i) {
      $weight = strpos($alphabet, $str[$i]);
      if ($weight === false) {
        throw new Exception('BaseToDec failed - encountered a character outside of given alphabet.');
      }

      $result += $weight * $prod;
      $prod *= strlen($alphabet);
    }

    return $result;
  }

function MD5ToRaw($str) { return pack('H*', $str); }
function RawToMD5($raw) { return bin2hex($raw); }

Pretty much that’s all, now you can generate a short (24-chars long) MD5 hash like this:

PHP

$hash_24  = MD5_24('hello!');     // => p71Eru8C4256573dg1hemv9A
$standard = md5('hello!');        // => 5a8dd3ad0756a93ded72b823b19dd877
$ours     = MD5_24to32($hash24);  // => 5a8dd3ad0756a93ded72b823b19dd877

Can we do more?

For example, if we use an alphabet containing 85 chars we can squeeze the hash further and get it 20 chars long. Like this:

PHP

$md5_20 = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ~!@#$%^&*()"№;:?\/\'[]<>';

$hash_20  = RawToShortMD5($md5_20, md5('hello!', true));  // => T"!nkj&ID^bE(%YCI5E№
$standard = RawToMD5(ShortToRawMD5($md5_20, $hash_20));   // => 5a8dd3ad0756a93ded72b823b19dd877

Encryption

We can utilize the ability to use different alphabets to disgize some information. We can also use it to «encrypt» MD5 hashes making it even more harder to bruteforce.
For example, we can shuffle symbols used for 24-long MD5:

PHP

$alphabet = 'D1yraq9s4wkhx0Alzdi6pfgeB5mn3t7jubcvoCE82';
MD5_24('hello!');                            // => p71Eru8C4256573dg1hemv9A
RawToShortMD5($alphabet, md5($str, true));   // => 5s12n74Eayq9qsr0z1dAgjwo

So if someone wants to revert the hash in addition to figuring out the secret phrase he’ll need to determine the order of symbols in the alphabet.

A second of Base64

Everyone knows about base64 – a widely-used method to encode binary (but technically any) data into sequence of Latin letters, digits and sometimes a few other symbols.
It uses different principle from the one being described in this article but the purpose isnearly the same: to represent data in different form to ensure all symbols belong to a particular alphabet so data won’t get corrupted during transmission or storing.

In a nutshell, base64 divides all input stream into groups of 3 bytes and takes 4 numbers from them together, each number 6 bits long. 6 bits can have a maximum of 64 different values (that’s where base64’s name originales from) – which is exactly the total count of digits, lower-case, upper-case Latin letters (10 + 26 * 2 = 62) and 2 other symbols (in standard base64 they are «+» and «/» for 62^nd and 63^rd values correspondingly).
This way each 3 bytes are represented as 4 base64 symbols, that’s why after encodinga stream it becomes 33% longer (4 / 3 = 1.33).

With base ceonvertion we can look at this from different angle. Let’s say that we want to encode each 4 bytes as 5 symbols (20% overhead, 5 / 4 = 1.2). For this we’ll need an alphabet 85 chars long (as described above).
Then we read input stream in chunks by 4 bytes each, treat it as UINT32 and base-convertit using our alphabet for 5-chars long string. Now any input stream will be 20% longer.

Of course, such encoding scheme will contain «unsafe» characters that base64 avoids (such as symbols: % $ # * and so on). And, then, 13% overhead difference isn’t much.

Compression?

But we can go further and even create some sort of compression. Let’s think about it: what we do when we encounter an integer that, when represented in our custom notation, is shorter than expected length (for example, if each 4 bytes are represented at most with 6 symbols but this particular number fit into 5 chars)? We pad it with character which stands for «zero» in our alphabet. We do so with normal numbers too: 020, 00FF, etc.
When we encode a stream we add such padding symbols so we can determine the borders of numbergroups when decoding it. However, we don’t have to align each group!

This means that when we encounter a «number» in our 5-chars long notation, say, «Qx1» then instead of padding it with two «0»s on the left («00Qx1») we can add just one symbol, some character outside of our alphabet that we consider a pad character – like underscore («_»). This way we’re getting «Qx1_», saving 1 byte and, in fact, avoiding encoding overhead for this octet!

In other words, if input contains a lot of small values it can be encoded with less or none overhead thanks to the padding trick. If we’re using a 85-symbols alphabet for each 4 bytes of input then last 5^th byte for the encoded string will be used if those 4 bytes are greater than 0x03197500 – well, not 100%-precise calculation but it must be somewhere close.
And we will save 1 byte if input integer is smaller than 0x00095200. And if it containsof just zeros we’ll save lot! :)

The code

For experemental purposes I’ve written a de/coder that works on chunks 4 bytes long and converts them using custom alphabets (of any length). The resulting stream will contain only symbols from that alphabet plus 2 extra symbols: pad character («_») and end-of-stream pad character («=»). EOS character can be one of the alphabet symbols.

The decoder reverts the convertion made. Both functions are written in PHP and use base convertion routines explained above.
Of course performance of encoding & decoding is very poor in PHP when compared to built-inBase64 functions (it took my server 2 seconds to encode a 300 KiB file). But it should be better when written in a compilable language.

PHP

function EncodeBinary($alphabet, $str, $padChar = '_', $endChar = '=') {
  $length = strlen(DecToBase($alphabet, 2147483647));

  $lastChunkLen = fmod(strlen($str), 4);
  $lastPad = str_repeat($endChar, $lastChunkLen);
  $lastChunkLen and $str .= str_repeat("\0", 4 - $lastChunkLen);

  $result = '';
  for ($pos = 0; $pos < strlen($str); $pos += 4) {
    $dword = substr($str, $pos, 4);
    $dword = ord($dword[0]) + ord($dword[1]) * 256 + ord($dword[2]) * 65536 + ord($dword[3]) * 16777216;
    $chunk = $dword === 0 ? '' : DecToBase($alphabet, $dword);
    $result .= $chunk;

    strlen($chunk) < $length and $result .= $padChar;
  }

  return $result.$lastPad;
}

function DecodeBinary($alphabet, $str, $padChar = '_', $endChar = '=') {
  $length = strlen(DecToBase($alphabet, 2147483647));

  $result = '';
  $prev = 0;
  do {
    $pos = strpos($str, $padChar, $prev);

    if ($pos === false) {   // '' or 'abcde' or 'abcde==='
      if (!isset($str[$prev])) {
        break;
      } elseif ($str[$prev] === $endChar) {
        $result = substr($result, 0, -1 * (4 - strlen($str) + $prev));
        break;
      } else {
        $fullChunk = true;
      }
    } else {
      $fullChunk = $pos >= $prev + $length;
    }

    $fullChunk and $pos = $prev + $length;
    $chunk = substr($str, $prev, $pos - $prev);
    $prev = $pos + ((int) !$fullChunk);

    $dword = BaseToDec($alphabet, $chunk);
    $chunk = chr($dword & 0xFF) . chr($dword >> 8 & 0xFF) . chr($dword >> 16 & 0xFF) . chr($dword >> 24);
    $result .= $chunk;
  } while ($pos !== false);

  return $result;
}

The story