18 Sep 2019

The other day I came across a problem in system integration with an interesting math twist: one-way multi-thread or multi-process synchronize records from one system to another. At the source, records are identified by sequential Globally Unique Identifiers (GUID) that destination doesn't support. Instead destination identifies records by up to 24 alphanumeric characters (0-9, a-z, 36 symbols) of which at least the first character is reserved for a preamble.

The challenge is then how to map a sequential GUID, such 63b0e1db-cd2f-4265-b4f1-eb4b436b6adf, onto an alphanumeric numeric string, while supporting multi-process and multi-threading.

A GUID in its standard base 16 representation, even without dashes, is 32 characters, also referred to as digits, or 128 bits in length. Mapping it as-is to destination thus isn't an option. But taking advantage of the larger base of the alphanumeric alphabet, if we encode it not in base 16 but base 36 we can shorten its length.

The relationship between base \(b\), number of digits \(d\), and number of states \(s\) is defined by the following equation: $$\begin{eqnarray*} b^d = s & & \textrm{\{states from base and digits\}}\\ \log{(b^d)} = \log(s) & & \textrm{\{apply \(\log\) to both sides}\}\\ d\log(b) = \log(s) & & \textrm{\{apply rule of \(\log(x^y) = y \log(x)\)}\}\\ d = \dfrac{\log(s)}{\log(b)} & & \textrm{\{divide both sides by \(\log(b)\)}\} \end{eqnarray*}$$ With \(l\) bits available, one may represent up to \(2^l\) states \(s\). Reserving one state for zero, the \(l\) bits may represent numbers 0 through \(2^{l} - 1\) (like in base 10 where two digits represent \(10^2\) states or numbers 0 through 99). Since a bit is an indivisible unit, the result must be rounded upwards: $$d = \left\lceil\dfrac{\log{(2^{l})}}{\log{(b)}}\right\rceil = \left\lceil\dfrac{\log{(2^{128})}}{\log(36)}\right\rceil = \lceil{24.75...}\rceil = 25$$ As we see, base 36 encoding alone doesn't solve our problem.

Say we reserve three digits for a preamble, how many bits can we fit into the remaining 21 digits when base 36 encoded? To find the answer, we must solve for \(l\), and since the bit is an indivisible unit, the result must be rounded down: $$\begin{eqnarray*} \left\lfloor\dfrac{\log(2^l)}{\log(36)}\right\rfloor = 21 & &\\ \left\lfloor\dfrac{l\log(2)}{\log(36)}\right\rfloor = 21 & & \textrm{\{apply rule of \(\log(x^y) = y \log(x)\)}\}\\ l = \left\lfloor\dfrac{21\log(36)}{\log(2)}\right\rfloor & & \textrm{\{multiply and divide}\}\\ l = \left\lfloor108.56...\right\rfloor = 108 & & \end{eqnarray*}$$ In other words, 21 digits, base 36 encoded correspond to 108 bits. That's 20 bits less than the original GUID.

The original GUID is a sequential GUID whose implementation details we shouldn't rely on. Worse still, the algorithm for generating sequential GUIDs may have changed over time and may change again in the future. So which 20 bits to drop without disproportionately increasing the risk of a collision?

One way to resolve this issue is compute a hash of the original 128 bits. With SHA1 for instance, regardless of input size, output is always 160 bits. Now instead of dropping 20 bits from the original GUID, we drop 52 bits from the hash. We can drop any 52 bits as long as we're consistent with which to drop; the simplest approach is a bit-wise right-shift by 52 bits.

Downside is that because SHA1 is a one-way hash function, and because we're dropping bits, mapping becomes a one-way function also. It doesn't matter in this case because we can store the original GUID in a separate field inside the destination record.

It seems we could've settled for a simpler scheme and just generated the 108 bits at random. Unfortunately, this would introduce a race condition into the record creation process. Imagine two threads or two processes about to create a record at destination. With random bits, each may query for the original GUID in a record and get a empty result back. Each then creates the record and succeeds. But now destination has two records matching one at source. With deterministic IDs only one record is created. The second creation attempt will fail, by design.

Because of hashing and subsequent bit shifting, there's a possibility that two source GUIDs could map to the same destination. The more records we create, the larger the chance of a collision. From a generalization of the Birthday problem, we get that if we have \(n\) records and \(s\) states, the approximate probability of a collision becomes (multiply by 100 for percentage): $$\begin{eqnarray*} p(n,s) = 1 - e^{-n^2 / 2^s}\\ p(10^6,2^{32}) = 1\\ p(10^6,2^{48}) \sim 3.55\textrm{e-03}\\ p(10^6,2^{64}) \sim 5.42\textrm{e-08}\\ p(10^6,2^{108}) \sim 3.08\textrm{e-21} \end{eqnarray*}$$ Assuming SHA1 hashes are uniformly distributed, even after dropping bits, the probability of a collision is infinitesimal. If such black swan event does occur, we should at least be able to detect it. Detection is easy as the record returned would hold an original GUID different from the one used to generate its ID. Depending on the domain, then either the original ID must the changed or the collision may be ignored.

We end up with the following equation to describe the mapping. The \(\textrm{lpad\(_{21}\)}\) addition ensures the outcome is always 21 digits. Otherwise, at the extreme, outcome from inner functions may result in zero which encoded would become a single zero instead of 21 zeroes: $$\begin{eqnarray*} \textrm{map}(\textrm{guid}) = \textrm{lpad\(_{21}\)}(\textrm{encode\(_{36}\)}(\textrm{rshift\(_{52}\)}(\textrm{sha\(_1\)}(\textrm{guid}))))\\ \textrm{map}(\textrm{63b0e1db-cd2f-4265-b4f1-eb4b436b6adf}) = \textrm{6yucu7i9kj49yjrwu252e} \end{eqnarray*}$$ If we so desire, we can switch mapping strategy later on. Perhaps multi-process or multi-threading is no longer a hard requirement and we can switch to the simpler random generated IDs.