CS 1501 Programming Project Lempel-Ziv-Welch Compression

Purpose: The purpose of this assignment is for you to fully understand the LZW compression algorithm, its performance and its implementation.

Procedure:

1)      Thoroughly read the description/explanation of the LZW compression algorithm as discussed in http://www.cs.sfu.ca/cs/CC/365/li/squeeze/. Be sure to read not only the top level LZW document, but also the more detailed paper in http://www.dogma.net/markn/articles/lzw/lzw.htm . Also run the applet to see how the algorithm works.

2)      Download the implementation of the algorithm provided and get it to work. The code is old-style C code, so you should save it as a .c node rather than a .cpp node. However, the code should work as is (possibly with some warnings). If you prefer, you may use the code converted by Mark into C++ style.

3)      Examine the C code very carefully, convincing yourself exactly what is accomplished by each function and by each statement within each function. Be particularly scrutinizing with the functions that input and output the codes. You may have to look up some bit operators in C++ help to understand these functions.

4)      Copy and modify the C code so that 1) the LZW algorithm has a varying number of bits, as mentioned in the paper http://www.dogma.net/markn/articles/lzw/lzw.htm and 2) the hash table uses a different hash function with linear probing.

Your codewords should vary from 9 bits to 14 bits, and should increment the bitcount when all codes for the previous size have been used. This does not require a lot of modification to the program, but you must REALLY understand exactly what the program is doing at each step in order to do this successfully. This step will be much easier if you first convert the C code into C++ syntax so I suggest doing that first. You may use the code converted by Mark as a starting point if you prefer. A good debugging technique for this program is to write into a text file the output codes during compression and the input codes during expansion. You can then compare to see if they are being output/input correctly. Once you get the program to work, thoroughly test it to make sure it is correct. A good way to do this is with a .exe file. If the decoded result of your algorithm still executes, you can be reasonably sure that your algorithm is correct. Also look at the byte counts -- they should match exactly.

The hash function that you should use is as follows: pick 2 long integers, A and B (> TABLE_SIZE) and let

h(x) = ((A)(hash_prefix) + (B)(hash_character)) mod TABLE_SIZE

Also remember that you should use LINEAR PROBING in the event of a collision. Thus it is a good idea to make your TABLE_SIZE about twice the maximum number of codes you may have in it. Since your maximum code will be 214-2 (see source code to see why) you can make your table around 215 and it should work fine.

5)      Once you have your variable code length program working, you should compare its performance with that of the original. We will provide you with a number of files to use for testing. See the announcements for when/where they are available. Run both programs on all of the files and for each file record the original size, compressed size, and compression ratio (compressed size / original size).

TEST FILES TO USE (also test on your .cpp file and .exe file for the assignment)

6)      Write a short paper that discusses each of the following:

a)      What your modifications were to the program and how you got the variable length codes to work. Explain in detail everything you did to the program and why.

b)      How the original compression program compared to your modified program (via their compression ratios) for each of the different files. Where there was a difference between the two, be sure to explain why. Also explain why different types of files gave different compression ratios.

7)      Hand in a printout of your modified source code, your paper, and a disk containing all of your source and executable code, along with your Assignment Information Sheet.