The Burrows-Wheeler Transform via Permutation Groups

Comments
Last updated 2015-07-16 22:17:51 SGT

The Burrows–Wheeler transform works as if by magic, but, surprisingly, it's actually bijective1 and not all that hard to understand. Let's work through this slowly.

Preliminary Notions

Let [S], our string, denote an ordered tuple of [N] letters [(a_0, a_1, \ldots a_{N-1})] for [N \in \mathbb{N}], and let [S_N] be the symmetric group on [N] symbols; we define group action from the left such that [S = (1)S] and [(g)S = (a_{g(0)}, a_{g(1)}, \ldots ,a_{g(N-1)}) \forall g \in S_N]. Let [r \in S_N] be such that [r(i) = i + 1 \mod N]. We see that [(r)S = (a_1, a_2 \ldots ,a_{N-1}, a_0)]; that is to say, the group action of [r] on the string shifts all letters one slot to the left (whence r for “rotation”, as in a necklace of letters).

Claim: [\{1, r, r^2, \ldots r^{N-1}\}] is a cyclic subgroup of [S_N] of order [N]. (this should be fairly obvious).

Note that I'm using the opposite definition of [r] compared to the Wikipedia article, for reasons that will soon become obvious.

Suppose that we wanted to sort the letters [\{a_0, a_1 \ldots a_{N-1}\}] lexicographically (intuitively, in alphabetical order). This can be performed as a permutation of [N] symbols, and can therefore be represented as the group action of some element [s \in S_N]. Ideally, we would like to be able to represent [S] as a list of its letters in lexicographical order, and one element [s \in S_N]. This allows us to do nice things like run-length encoding and so on.

Unfortunately, nobody likes keeping lookup tables of [S_N], because group actions are not the kind of data structures computers are good at playing with very quickly. Also the utility of this approach in terms of compression is predicated on being able to compress a list of letters into tables of runlength; however if S contains repeated letters, then there may be more than one element of [S_N] that sorts [S], meaning that [s] as defined above is not unique. This is bad for equality testing and other practical things that people seem to need to worry about on an everyday basis.

Transform

Consider the ordered set of strings [\{S, (r)S, (r^2)S \ldots (r^{N-1})S\}]. We make the fairly banal observation that if we put the letters on a grid, each row representing one such string, we get a symmetric arrangement such that the [n]th column and [n]th row contain the same string (whence our choice of [r]). Obviously,

Claim: Taking the first letter of each of the strings in this set and putting them into an ordered tuple (i.e. turning the first column into a string) gives us [S] again.

Somewhat more subtle (but still pretty easy to see) is that

Claim: Taking the last letter of each of the strings in this set and putting them into an ordered tuple gives us [(r^{N-1})S = (r^{-1})S].

Equivalently, [S] is equal to the concatenation of the first slots of the orbit of [r] acting on [S]; so too with [(r^{-1})S]. Again we can permute elements of this set using the group action of [S_N], since it has [N] elements.

Claim: This set can be sorted into lexicographical order uniquely, corresponding to the action of a unique group element [s \in S_N].

In our grid, this permutes the rows. The first row now contains the string [(r^{s(0)})S], the second row [(r^{s(1)})S], and so on.

As above, taking the first letter from each string in the sorted set and putting them into an ordered tuple gives us [(s)S]. What happens when we do this taking the last character from each string?

Definition: The BW transform of a string [S] of [N] letters, [BWT(S)], is given by the group action of [(s \cdot r^{-1}) S].

Claim: [BWT(S_1) = BWT(S_2) \iff S_1 = S_2] up to rotation.

Proof: Suppose [BWT(S_1) = BWT(S_2)]; then [S_1] and [S_2] must contain the same letters, under some permutation [p \in S_N], such that [S_1 = (p)S_2]. Then if [(s)S_1] is sorted, so too is [(sp)S_2]. Then [BWT(S_1) = (sr^{-1})S_1 = (spr^{-1})S_2 = BWT(S_2) \implies rp = pr], but the only elements of [S_N] that commute with [r] are all elements of the cyclic group generated by [r]. Therefore [BWT(S_1) = BWT(S_2) \implies S_1 = S_2] up to rotation. Trivially, then, [S_1 = (r^a) S_2 \implies BWT(S_1) = BWT(S_2) \forall a] because [r] commutes with itself. We are done.

Inverse

Now suppose we are given [B = BWT(S) = (sr^{-1})S] and are asked to find the inverse transform [S]; this amounts to solving for the action of the group element [1] where [s] is unknown. We make the observation, however, that there must exist at least one way to put the transformed string into lexicographical order.

Claim: This corresponds to the group action of some [a \in S_N] (but [a] need not be uniquely defined).

Claim: For any such [a], we have [(a)B = (a s r^{-1})S = (s) S \implies a = s r s^{-1}]

It is easy to verify that [a] also generates a cyclic subgroup of order [N]. Moreover, if [R] is the cyclic subgroup generated by [r] and [A] that generated by [a], a simple calculation tells us that [As = sR].

We can then repeat the previous trick we did to generate strings from the orbit of group elements: we construct an [N \times N] grid, placing the letters of the string [(a)B = (s)S] in the first row, [(a^2)B] in the second, [(a^3)B] in the third and so on. Once again, the letters of the first column represent the orbit of [a] acting on B.

But since [(a^{n+1})B = (s r^{n})S], this grid is equivalent to the first unsorted grid with its columns permuted by [s]. The [n]th column is now the concatenation of the first slots of [(r^{s(n)})S] under the action of the orbit of [r], which is equal to [(r^{s(n)})S]. While [s] (and so [s(n)]) are not known, this suffices to recover [S] up to rotation by a finite number of applications of [r].

Both the forward and inverse transforms are well-defined up to rotation. If necessary, this ambiguity can easily be avoided (e.g. by using a special start- or end-of-string character)2.

Conclusion

Apparently the BWT is used in things like compression (given that it can reversibly convert strings into output in which repeated-letter sequences recur more frequently than plaintext) and to reduce the memory usage of sequence alignment algorithms in bioinformatics3. This is all fairly new to me, though; this blog post is basically me getting to know it better.

TL;DR: TIL BWT


  1. With some modifications 

  2. Actually this sort of reminds me of fields being defined only up to global phase/gauge transformations, but that might just be pareidolia here 

  3. This was today's lunch conversation 


comments powered by Disqus