![](/Content/images/logo2.png)
Original Link: https://www.anandtech.com/show/218
Introduction
If you have ever unzipped a download, viewed a GIF or JPEG
file, watched an MPEG video or played an MP3 audio file, you have experienced data
compression. Data compression is arguably one of the most important aspect of the
internet/multimedia. Without data compression, 5 minute MPEG clips would take up tens of
even hundreds of megabytes, 1 minute CD-quality audio files would take up around 10
megabytes of hard drive space, game demos would take two times as long to download, and
web page images would take up to ten times longer to load! This article isn't going to
focus on all the wonderful things possible with data compression, or all the future
applications of data compression. Instead, this article will focus on explaining data
compression so that you, the reader, will know HOW it works, not just that "it
works". Also, after reading this article you can show off to your friends with
phrases like "Lempel-Ziv Compression", and astonishing trivia questions like
"Did you know that Huffman coding binary trees are 'full'?".
Lossless Compression Techniques
The first types of compression I am going to discuss are lossless compression techniques. These compression techniques allow the uncompressor to recreate the original data byte for byte. These types of compression are used predominately in file compression (PKZip, RAR, ARJ, LHA, etc.), though some of them are used in various forms of data compression because they are relatively easy to implement and very effective on certain data sets. Perhaps the most commonly used data compression algorithm is Huffman coding.
Huffman Coding
Huffman coding is based upon a simple philosophy: The most common data should be noted with the least number of bits, while the least common data should be noted with the most number of bits. After a statistical analysis of the file (or block) is done, the data is stored in a binary tree with the most common elements as close to the root as possible. Then, in order to find the coding sequence, if we traverse to the left of the tree in search for our character, we add a 0, if we traverse to the right of the tree, we add a 1. Below is an example of Huffman Compression at work.
String to Compress: "aabacbacdb"
Statistics:
a - 4
b - 3
c - 2
d - 1
Tree generation:
(1) start with 4 trees with 1 node in each, then merge the two smallest trees, and insert the result into the new list, repeat.
4
4 4 10
3 3 6
2 3
1
The resulting probability tree would look like the one below: (Note that this tree is always "full" (each node has 0 or 2 children))
10
/ \
6 4(a)
/ \
3 3(b)
/ \
2(c) 1(d)
Now, in order to come up with the notation for each character. Traverse the tree until you hit the character you are looking for. While traversing, every time you go to the left node, mark a '0', otherwise, mark a '1'.
Using this, finding the coding pattern is trivial:
a = 1
b = 01
c = 000
d = 001
Our string above, "aabacbacdb" would convert into '1101100001100000101'. Here we have 19bits, instead of 8*10 = 80bits in the original message. Of course we also have to store the tree for uncompression (or the code for each character) but as soon as the block compressed reaches a size of 32K or so, this extra space used is negligible.
Optimizing Huffman Compression
The main method used to optimize Huffman Compression is to combine it with LZ compression (which will be discussed next). There are; however, ways to optimize Huffman coding itself. The tree algorithm shown here (the one for generating the coding for each letter) is the most optimal; however, in order for Huffman coding to be most efficient, we must have large gaps between the amount of 1 letter as opposed to the other. The ideal statistical distribution for Huffman compression with a significant number of letters present is a division by powers of 2, i.e. 1/2 of 1 letter, 1/4 of another, etc..) Since we know what kind off statistics we want, we can choose blocks which will best reflect this statistical distribution. Let's take the data below.
aaaabbbbccccdddd
If we were to compress this with a block size of 16, we would not be able to compress the data that much, since all the data is distributed equally. If we decided to make our block size 4, assuming that storing the tree (or equivalent Huffman code for each character) is negligible (which it is in most real world cases) a block size of 4 would allow us to code each character in the above in 1 bit.
Now, what if a static block size won't work? Well, if we are doing Huffman compression alone, it is very feasible (albeit slow, or fast and difficult) to calculate what will be the most efficient block size dynamically. This means that every block will have a different size, the size which is most efficient. According to Tim Sweeney's, lead programmer of EPIC's kick a-- Unreal, response to an e-mail I fired off to him yesterday:
"A few simple attempts gained an additional
20% reduction in compressed bits,
but that was offset by having to store multiple Huffman tables -- so the net
result was about 0% reduction."
Tim Sweeney goes on to mention that:
"I think this could potentially be better
than Huffman alone, but I think the gain will be limited to maybe 20-30%.
Most kinds of data (such as text, graphics, program code) don't seem to have
much more local coherence than global coherence. LZH schemes will probably
beat this approach consistently."
This comes to the next point I will discuss: possible ways to work with dynamic block size when doing Huffman + LZ77 (LZH) compression, along with an explanation of what LZ(H) compression is.
Lempel-Ziv (LZ) Compression
Lempel-Ziv compression takes advantage of the large amounts of repetitive data in a file. Take, for example, this article. Chances are that there are many occurrences of words such as "the", "compression", "LZ", etc. Wouldn't it be great if, each time we hit one of these common words, we could just put in a shorter code for this word? This is, in essence, what LZ compression does. Let's "fake LZ" compress the (perhaps,err, definitely nonsensical) text below:
The blah blah the blah that these blah blah the blah.
Now in our dictionary (which appears at the beginning of a LZ file) we would store the following:
The = T
blah = B
that = C
these = D
Our message above could not be compressed into something like this:
T B B T B C D B B T B
LZ compression explained in more detail
The above example is not "real" LZ compression. Real LZ compression has a dictionary (size varies depending on the max. size of an entry, and the number of entries.) The typical example used is a 4096 entry dictionary (so everything in it can be coded in 12bits) which reserves the first 255 entries for single bytes, and the remaining entries for repetitive strings.
How does LZ compression find out repetitive strings? Well, the simple form of LZW compression, which is the one I will discuss now, does not know what is a commonly used string and what is not. Here is how it works:
Read initial character into "previous" string.
While there are characters
Read in a character
If the character + the previous n (max stored string size) characters +
this current character are in the dictionary,
add this character to our previous n
otherwise, output the code we made for our previous n characters, add
the n previous characters + this current character to the
dictionary.
You can probably see where the compression comes into play. If you have n (i.e. lots of previous characters still in the buffer) when you hit a code in the dictionary, all those characters will be replaced by the code, which is generally around 12bits in size. Looking at this algorithm you might also notice that for small amounts of data (i.e. less than 150 bytes) LZ compression is useless. On large files with lots of repetitive data, LZ compression is very efficient. It is also exceptionally efficient at compressing graphics, since there is generally lots of repeating patterns which are easily compressed.
LZ77 Compression scheme
LZ77, another type of Lempel-Ziv compression algorithm works by looking ahead into the file, and if it sees a pattern it recognizes, it will write the previous position of that match in the file instead of the actual data. The output is then generally compressed using Huffman Coding (hence then name LZH, Lempel-Ziv Huffman) which will deal with the cases where LZ77 compression was not very successful. Here variable length Huffman encoding as mentioned last page can potentially be very helpful. LZH, or LZA (LZ + Arithmetic Coding, another compression algorithm which will be discussed in part 2) Compression are the main methods used in all the popular compressors, such as ZIP, ARJ, LHA, etc.
In part 2, Decompression will be discussed in detail, and then, in part 3 (or 4), what you are all probably waiting for, I will talk about how these compression algorithms are applied (or not applied!) in various image formats (GIF, JPG, etc.), compressed Audio, and Compressed MPEG.
Acknowledgements:
http://www.cis.ohio-state.edu/hypertext/faq/usenet/compression-faq/part2/faq-doc-1.html
Data Compression, http://www.ics.uci.edu/~dan/pubs/DC-Sec1.html
Mark Nelson's Homepage, http://dogma.net/markn/
Tim Sweeney, Lead Programmer, Epic MegaGames.