CS 1501 Programming Project 4

Lempel-Ziv-Welch Compression

Purpose: The purpose of this assignment is for you to fully understand the LZW compression algorithm, its performance and its implementation.

Background: In lecture we discussed the situation of running out of codewords during LZW compression.  We discussed two possible options: 1) Using the codewords we have for the remainder of the compression, adding no more new ones and 2) Reinitializing the dictionary of (string, codeword) pairs back to empty, regenerating codewords as we pass new patterns.  In this assignment you will implement a compromise of options 1) and 2).  As you are compressing you will monitor the compression ratio achieved by the compression.  This is defined to be (compressed file size) / (original file size).  It can be easily calculated as the compression is being done, since we can count how many bytes have been read and we can count how many codewords have been written.  Clearly, as longer patterns are matched within your file, the compression ratio should improve (i.e. get smaller).  However, after all codewords have been used up, the compression ratio will stop improving and may start to degrade (i.e. get larger) if you encounter a lot of new patterns.  At this point it may be a good idea to reset your dictionary and start generating new codewords.  However, if the compression ration stays fairly constant, it may be a better idea to stick with the current dictionary.  In this assignment you will implement this hybrid dictionary maintenance technique.

Procedure:

1)      Thoroughly read the description/explanation of the LZW compression algorithm as discussed in http://www.cs.sfu.ca/cs/CC/365/li/squeeze/. Be sure to read not only the top level LZW document, but also the more detailed paper in http://www.dogma.net/markn/articles/lzw/lzw.htm .  Also run the applet to see how the algorithm works.

2)      Download the implementation of the algorithm provided and get it to work.  The code is old-style C code, so you should save it as a .c node rather than a .cpp node.  However, the code should work as is (possibly with some warnings and even an error – but the build should still be successful in Borland C++, Visual C++ or gcc).  If you are not familiar with C programs or compiling and/or running them ask your TA and he/she can help you.  If you have not used either C or C++ before, I STRONGLY recommend using gcc on Unix to compile and run your program.  This is simply a command line compiler that is easy to use and does not have a GUI that you must learn.  The BITS identifier in this program can be set to either 12, 13 or 14. SET IT TO 14 for this assignment.

3)      Examine the C code very carefully, convincing yourself exactly what is accomplished by each function and by each statement within each function.  Be particularly scrutinizing with the functions that input and output the codes, and that update the dictionary.  You may have to look up some bit operators in help files or textbooks to understand these functions (your notes from lecture will also help).

4)      Copy the C code and modify it so as to maintain the dictionary as specified above.  Call this program lzwmod.c.  In particular:

1.      Keep track of the compression ratio as you perform your compression.  At any point in the compression, the compression ratio is equal to the bits required for the generated codewords divided by the bits required for the original data (from the beginning of the file on).

2.      When all 14-bit codewords have been used up begin monitoring the compression ratio.  If it degrades by more than X% from the point when the last codeword was added (where X is a variable input by the user) reset the dictionary to empty (just the original ASCII set) and reset the next_code value to 256.   Doing this does not require a lot of code and does not require you to allocate or deallocate any dynamic memory.  However, you will need to understand the C code so you know where to put your statements and why.  Be careful to coordinate the decompress algorithm with the compress algorithm so that they work correctly together.

5)      Thoroughly test your modified program to ensure that it is correct.  One good way to do this is to run it on a large .exe file.  Make sure that it is large enough to use up the 14-bit codewords.   Also set X to be small so that you test your dictionary reset code for correctness.    If the compressed then expanded file still runs, it is likely that your program is correct.

6)      Once you have your modified program working, you should do some experiments to compare performances.  I will provide you with a number of files to use for testing – see the Assignments page for the link (should be up by July 3).  Specifically, you will compare the performance of 5 different executions:

1.      The original lzw.c program using BITS set at 14

2.      Your lzwmod.c program using BITS set at 14 and X set at 1%

3.      Your lzwmod.c program using BITS set at 14 and X set at 5%

4.      Your lzwmod.c program using BITS set at 14 and X set at 15%

5.      Your lzwmod.c program using BITS set at 14 and X set at 25%

Run all programs on all of the files and for each file record the original size, compressed size, and compression ratio (compressed size / original size).  In addition to the files in the directory, also test copies of your source code and of your .exe code for this project.  For each file record the compression ratio for each of the situations shown above.

7)      Write a short (~2 pages) paper that discusses each of the following:

a)      What your modifications were to the program and how you got the dictionary maintenance to work.  Explain in detail everything you did to the program and why.

b)      How the compression ratios compare for the various test files.  Where there were differences between them, be sure to explain why.  Speculate as to which version (if any) was the best overall.

a)      For all algorithms, indicate which of the test files gave the best and worst compression ratios, and speculate as to why this was the case.  If any files did not compress at all or compressed very poorly, speculate as to why.

b)      Include in your paper all of the compression ratio results that you recorded in 6) above.

1)      Submit your lzwmod.c code, your paper, and both of your executable code files (.exe's of the original and of your modified program), along with your Assignment Information Sheet to the appropriate submission directory.  DO NOT SUBMIT THE TEST FILES – they will use up too much memory on the submission site.  As always, make sure that any executables will be able to be executed by the TAs without any modification.

2)      Extra Credit: Modify your program so that it can actually be used in a practical way (the way compress is used)

1.      Rather than prompting for the file, the name of the file should be read in from the command line

2.      Flags on the command line will indicate whether to compress or decompress (ex:  –c means compress and –d means decompress)

3.      Rather than making a copy file that is compressed, you must replace the original file by the compressed one (which can be done by simply deleting the old file and creating the new one).  Be VERY careful when doing this – test it only on COPIES of files, especially important ones, since an error could cause the original file to be corrupted and/or deleted.