CS 1501

Data Structures and Algorithms

Programming Project 5

Purpose: The purpose of this assignment is for you to fully understand both the Huffman and LZW compression algorithms, and to compare their performances for some test files of different types.

Procedure:

Thoroughly read the description/explanation of the LZW compression algorithm as discussed in http://www.cs.sfu.ca/cs/CC/365/li/squeeze/. Be sure to read not only the top level LZW document, but also the more detailed paper in http://www.dogma.net/markn/articles/lzw/lzw.htm . Also run the applet to see how the algorithm works.

  1. Download the implementation of the algorithm provided and get it to work. The code is old-style C code, so you should save it as a .c node rather than a .cpp node. However, the code should work as is (possibly with some warnings). If you prefer working in C++, you may modify it, but be very careful if you choose to do this, as your changes may introduce some errors. Examine the C code very carefully, convincing yourself exactly what is accomplished by each function and by each statement within each function. It is important that you understand the algorithm and its implementation thoroughly.
  2. Complete the code given in the Sedgewick text in Chapter 22 for the Huffman compression algorithm so that it works (both encoding and decoding). The text explains what needs to be added in some places, but in others you may have to figure this out for yourselves. The code in the text is not easy to understand and you may have to read it over a few times before you figure it out completely -- this is expected. Furthermore, the text code assumes that you only have 27 characters (letters plus a blank) when in a real file you would have ALL of the ASCII characters (or, perhaps better described as all of the 8-bit data combinations). Thus you need to modify the code so that it works on any arbitrary file. You will also need to add all of the file handling to the code (FOR COMPRESSING: input the original file, output the compressed file with the code array; FOR DECOMPRESSING: input the compressed file, output the original file). To handle files of arbitrary data you should probably access the files as BINARY files and process them a byte at a time.  More notes/hints about the Huffman algorithm are provided below.

Once you have your programs working, test them with some example files. See how well your programs compress the various files by calculating the COMPRESSION RATIO. For each file you compress, record the original file size (in bytes), the compressed file size (in bytes), and the ratio of the (compressed file size)/(original file size). Clearly, the smaller the ratio, the better the compression.  Also try the two algorithms in tandem: compress using LZW then using Huffman and vice versa.  Record the various sizes and compression ratios for these combinations as well. We will provide you with a number of files to use for testing. See the announcements for when/where they are available.

Write a short paper (~2 pages) that discusses each of the following:

  1. Point out and explain the section of code in the LZW program that handles the special case that we discussed in handout lzw3.txt.  Be specific and detailed in your explanation.
  2. Explain how the Huffman algorithm is implemented. In particular, explain what the code[] and dad[] arrays are used for in the algorithm. Be SPECIFIC.
  3. Compare the Compression Ratios of the 2 algorithms for the files tested. Explain why, if at all, one algorithm is better than the other in a given situation.
  4. Explain the results of the algorithms when done in tandem.  Was any improvement achieved?  Did the order of the compressions affect the results?  Postulate explanations for all of your answers.

 

Notes and Hints for Huffman Compression: