Bit-by-bit LZW compressor with redundant-code-elimination
To explore the effectiveness of LZW compression when applied to 1-bit symbols rather than the more typical 1-byte symbols that it is normally used with.
Additionally, to explore the possibility of removing further redundancies from the LZW algorithm by removing codes from the codetable once additional codes have been added such that the older code will never again be used.
An iterator wrapper was produced, which is intended to wrap the file stream iterators (such as std::istreambuf_iterator
) and which allows iteration bit-by-bit, translating this back to calls to the wrapped iterator to iterate byte-by-byte.
A small redundancy in the original LZW algorithm was discovered —namely, that as the LZW code-matching algorithm is greedy, once a prefix code
Code | Bitstring |
---|---|
0 | 0 |
1 | 1 |
2 | 01 |
3 | 10 |
4 | 11 <-- at this point we can remove code 1: 1 because both 10 and 11 shadow it |
New table after RCE:
Code | Bitstring |
---|---|
0 | 0 |
shadowed | 1 <-- Note that this code is kept in the table structure but it is no longer part of the code |
1 | 01 |
2 | 10 |
3 | 11 |
To handle the special case at the end of the file, where we might indeed need code
First thing to note, is this algorithm is SLOW AS HELL! Really, it's not very practical for real-world usage. The factors contributing to this are firstly, the fact that we compress a bit at a time, and secondly, the processing overhead introduced by RCE. Before RCE was introduced, compression was still slow but not as slow as it currently is.
It may be possible to make this slightly more efficient by only running RCE every time the codetable size would increase to the next whole number of bits, rather than every time a new code is inserted. This could be efficiently done by remembering the index of the first new code that hasn't been checked for RCE, then checking this code and all those after it for RCE when the time comes. Modifying the technique to work in this way is not expected to increase compressed file size at all (it should remain exactly the same), however it will change the exact codes that are output. The performance increase however is only expected to be marginal, if any at all.
Another improvement that could be made is to move away from processing a bit at a time, to some larger unit like a byte, or some unit in between such as a nibble (4-bit) or crumb (2-bit). It should be noted that while there is no reason why RCE cannot be also be applied to these larger symbol sizes, the larger the symbol size, the more computationally-expensive RCE may become (due to the need to check more symbols for shadowing), and codes will also not be able to be removed as frequently, the larger the symbol size becomes (due to the larger number of codes needed before a code can be proven shadowed).