CS 1501

Data Structures and Algorithms

Programming Project 5

Purpose: The purpose of this assignment is for you to fully understand both the Huffman and LZW compression algorithms, and to compare their performances for some test files of different types.

Procedure:

Thoroughly read the description/explanation of the LZW compression algorithm as discussed in http://www.cs.sfu.ca/cs/CC/365/li/squeeze/. Be sure to read not only the top level LZW document, but also the more detailed paper in http://www.dogma.net/markn/articles/lzw/lzw.htm . Also run the applet to see how the algorithm works.

Download the implementation of the algorithm provided and get it to work. The code is old-style C code, so you should save it as a .c node rather than a .cpp node. However, the code should work as is (possibly with some warnings). If you prefer working in C++, you may modify it, but be very careful if you choose to do this, as your changes may introduce some errors. Examine the C code very carefully, convincing yourself exactly what is accomplished by each function and by each statement within each function. It is important that you understand the algorithm and its implementation thoroughly.
Complete the code given in the Sedgewick text in Chapter 22 for the Huffman compression algorithm so that it works (both encoding and decoding). The text explains what needs to be added in some places, but in others you may have to figure this out for yourselves. The code in the text is not easy to understand and you may have to read it over a few times before you figure it out completely -- this is expected. Furthermore, the text code assumes that you only have 27 characters (letters plus a blank) when in a real file you would have ALL of the ASCII characters (or, perhaps better described as all of the 8-bit data combinations). Thus you need to modify the code so that it works on any arbitrary file. You will also need to add all of the file handling to the code (FOR COMPRESSING: input the original file, output the compressed file with the code array; FOR DECOMPRESSING: input the compressed file, output the original file). To handle files of arbitrary data you should probably access the files as BINARY files and process them a byte at a time. More notes/hints about the Huffman algorithm are provided below.

Once you have your programs working, test them with some example files. See how well your programs compress the various files by calculating the COMPRESSION RATIO. For each file you compress, record the original file size (in bytes), the compressed file size (in bytes), and the ratio of the (compressed file size)/(original file size). Clearly, the smaller the ratio, the better the compression. Also try the two algorithms in tandem: compress using LZW then using Huffman and vice versa. Record the various sizes and compression ratios for these combinations as well. We will provide you with a number of files to use for testing. See the announcements for when/where they are available.

Write a short paper (~2 pages) that discusses each of the following:

Point out and explain the section of code in the LZW program that handles the special case that we discussed in handout lzw3.txt. Be specific and detailed in your explanation.
Explain how the Huffman algorithm is implemented. In particular, explain what the code[] and dad[] arrays are used for in the algorithm. Be SPECIFIC.
Compare the Compression Ratios of the 2 algorithms for the files tested. Explain why, if at all, one algorithm is better than the other in a given situation.
Explain the results of the algorithms when done in tandem. Was any improvement achieved? Did the order of the compressions affect the results? Postulate explanations for all of your answers.

Notes and Hints for Huffman Compression:

The text says that you should store your code[] array together with your compressed file so that you can decompress. It may be easier for you to just store the frequency table and rebuild the (identical) tree when you decompress. Either way is ok as long as your algorithm is correct.
If you have trouble figuring out how to use the PQ from the text, just use ANY priority queue implementation in your building of the tree (even going as far as sorting them if you want). We will talk more about PQs later.
The file that you compress TO and decompress FROM MUST be a binary file, since you need to write/read individual bytes. If you have never used binary files in C++, you may not be familiar with the procedures for getting them to work. I have put a simple example program onto the Web site to help you (a bit) with this. Look on the Assignments link for it. Note that this program is NOT doing any compression and does NOT resemble what you will do in your assignment. I have simply provided it to show you how binary files work in C++.
Regarding the binary file above, you want to save your data as UNSIGNED (either char or unsigned int should work), since you don't want the machine to use the 2-s complement coding that is used for signed integers.
Since you are reading in irregular-length codewords during decompression, it is not trivial to correctly detect the end of the file and to handle it correctly. To do this you must add an EOF character in your compression table and put that as the last codeword in the file. Note that this should not be one of the regular ASCII characters since a binary file could contain arbitrary byte codes, including the EOT (ASCII 4) character. If you can't get this to work correctly, you should still run your tests, since your program should still decompress correctly up to the last character.