CS 1501
Data Structures and Algorithms
Programming Project 5
Purpose: The purpose of this assignment is for you to fully
understand both the Huffman and LZW compression algorithms, and to compare
their performances for some test files of different types.
Procedure:
Thoroughly read the description/explanation of the LZW compression algorithm
as discussed in http://www.cs.sfu.ca/cs/CC/365/li/squeeze/.
Be sure to read not only the top level LZW document, but also the more detailed
paper in http://www.dogma.net/markn/articles/lzw/lzw.htm
. Also run the applet to see how the algorithm works.
- Download the implementation
of the algorithm provided and get it to work. The code is old-style C
code, so you should save it as a .c node rather than a .cpp node. However,
the code should work as is (possibly with some warnings). If you prefer
working in C++, you may modify it, but be very careful if you choose to do
this, as your changes may introduce some errors. Examine the C code very
carefully, convincing yourself exactly what is accomplished by each
function and by each statement within each function. It is important that
you understand the algorithm and its implementation thoroughly.
- Complete the code given in
the Sedgewick text in Chapter 22 for the Huffman compression algorithm so
that it works (both encoding and decoding). The text explains what needs
to be added in some places, but in others you may have to figure this out
for yourselves. The code in the text is not easy to understand and you may
have to read it over a few times before you figure it out completely --
this is expected. Furthermore, the text code assumes that you only have 27
characters (letters plus a blank) when in a real file you would have ALL
of the ASCII characters (or, perhaps better described as all of the 8-bit
data combinations). Thus you need to modify the code so that it works on
any arbitrary file. You will also need to add all of the file handling to
the code (FOR COMPRESSING: input the original file, output the compressed
file with the code array; FOR DECOMPRESSING: input the compressed file,
output the original file). To handle files of arbitrary data you should
probably access the files as BINARY files and process them a byte at a
time. More notes/hints about the
Huffman algorithm are provided below.
Once you have your programs working, test them with some example files. See
how well your programs compress the various files by calculating the
COMPRESSION RATIO. For each file you compress, record the original file size
(in bytes), the compressed file size (in bytes), and the ratio of the
(compressed file size)/(original file size). Clearly, the smaller the ratio,
the better the compression. Also try
the two algorithms in tandem: compress using LZW then using Huffman and vice
versa. Record the various sizes and
compression ratios for these combinations as well. We will provide you with a
number of files to use for testing. See the announcements for when/where they
are available.
Write a short paper (~2 pages) that discusses each of the following:
- Point out and explain the
section of code in the LZW program that handles the special case that we
discussed in handout lzw3.txt. Be
specific and detailed in your explanation.
- Explain how the Huffman
algorithm is implemented. In particular, explain what the code[] and dad[]
arrays are used for in the algorithm. Be SPECIFIC.
- Compare the Compression
Ratios of the 2 algorithms for the files tested. Explain why, if at all,
one algorithm is better than the other in a given situation.
- Explain the results of the
algorithms when done in tandem.
Was any improvement achieved?
Did the order of the compressions affect the results? Postulate explanations for all of your
answers.
Notes
and Hints for Huffman Compression:
- The text says that you should
store your code[] array together with your compressed file so that you can
decompress. It may be easier for you to just store the frequency table and
rebuild the (identical) tree when you decompress. Either way is ok as long as your
algorithm is correct.
- If you have trouble figuring
out how to use the PQ from the text, just use ANY priority queue
implementation in your building of the tree (even going as far as sorting
them if you want). We will talk more about PQs later.
- The file that you compress TO
and decompress FROM MUST be a binary file, since you need to write/read
individual bytes. If you have never used binary files in C++, you may not
be familiar with the procedures for getting them to work. I have put a
simple example program onto the Web site to help you (a bit) with this.
Look on the Assignments link for it. Note that this program is NOT doing any compression and does
NOT resemble what you will do in your assignment. I have simply provided it to show you
how binary files work in C++.
- Regarding the binary file
above, you want to save your data as UNSIGNED (either char or unsigned int
should work), since you don't want the machine to use the 2-s complement
coding that is used for signed integers.
- Since you are reading in
irregular-length codewords during decompression, it is not trivial to
correctly detect the end of the file and to handle it correctly. To do
this you must add an EOF character in your compression table and put that
as the last codeword in the file. Note that this should not be one of the
regular ASCII characters since a binary file could contain arbitrary byte
codes, including the EOT (ASCII 4) character. If you can't get this to work correctly, you should still
run your tests, since your program should still decompress correctly up to
the last character.