CS 1501
Algorithm Implementation
Programming Project 1
Online:
Due: All assignment materials:
1) All source files of program, 2) All .class files (or a .jar file containing
them) 3) All output
files resulting from execution of the test files 4) Well written/formatted
paper explaining your anagram algorithm and tree implementation and 5)
Assignment Information Sheet on the appropriate directory of the submission
site by 11:59 PM on Monday, May 30,
2005. Note: Do NOT submit the dictionary file or the input files.
Background:
Word scrambles (or anagrams) are fun puzzles that challenge our minds and our patience. The general idea of an anagram is that a string of letters is rearranged to produce a valid word or phrase. Originally the game was to unscramble a meaningless string of letters to generate a meaningful word or phrase. More recently (thanks to computers) it is also used to "create" phrases from other words or phrases. For input consider a string of letters which may or may not contain blanks (the blanks will be ignored in the anagram game, but are allowed to make the input easier for the user). The output will be a collection of valid resulting words or phrases, sorted from the phrase with the fewest words to the one with the most. Within groups with the same number of words, the listings should be alphabetical. For example, given the letters:
raziber
your anagram program should generate the following list of words:
bizarre rare
biz a biz err err a biz
brazier raze
rib a err biz err biz a
biz rare rear
biz biz a err
biz rear rib
raze biz err a
You assignment is to implement this anagram program in the following way:
1)
Read a dictionary of words in from a file
and form a dictionary trie of these words.
The file will contain ASCII strings, one word per line. Use the file dict10.txt on the CS1501
Web page. Details on the required trie implementation are specified below.
2)
Read input strings from an input file
(one string per line) and calculate the anagrams of those strings and output
them as described above. Your output
must go to an output file. The user should be able to enter the file name, and
your program should read the strings one at a time until it reaches the end of
the file. For a given input file, your
output file name should be the same as the input file name, but with the
extension .out appended to it. For
example, if the original input file is "data1.dat" your output file
name should be "data1.dat.out".
For more specifics on the requirements for generating the anagrams, see
the comments below.
For an example of an anagram-finding program, see the following link:
http://www.ssynth.co.uk/~gay/anagram.html
Use this program for reference and to help you in developing your solution, but note a few important things:
· The dictionary used for his program is different from the one you will be using, so in many cases the solutions will not be the same
· His solutions are listed in alphabetical order, regardless of the number of words in the phrase
In addition to your
program, you are required to write a short (~1.5 page) paper
explaining how you developed your anagram algorithm. Include in this paper your goals for the
algorithm, your basic approach for the recursion / backtracking, how the trie dictionary implementation was utilized by your
algorithm to improve the runtime, any problems you encountered and how you
solved them. Be specific to your actual implementation in your
explanation.
Important Notes:
Ø
There are two
principal parts to this project: Implementing
the anagram algorithm and Implementing the dictionary (such that prefixes as well as whole words can be searched).
It is strongly recommended that you proceed with the project in the
following manner:
1) Implement the anagram algorithm using a small words
file and a simple dictionary implementation (ex: sorted array). This will enable you to concentrate on the
anagram algorithm itself without worrying about a sophisticated dictionary
implementation. However, make sure you
set up the algorithm as explained below, so that you can easily switch it to a trie-implemented dictionary.
2) Implement your dictionary as a DLB trie as explained below, and use the trie
instead of the simple dictionary in your anagram algorithm. This should require little or no change to
your anagram algorithm itself – all that is changing is the dictionary
implementation.
Ø
Anagram
algorithm details: Your anagram
algorithm must be a
recursive backtracking algorithm (see class notes for idea of recursive backtracking algorithms). To
enable you to get more partial credit for the anagram algorithm, you should
consider the algorithm in two parts:
1) Determine
the valid anagrams (if they exist) using ALL of the letters in the input (if
there were spaces in the input, remove them first) (ex: for above the solutions
bizarre and brazier). Thus, your anagrams will always be single
words. This process can be done with a
fairly straightforward recursive / backtracking algorithm that considers
various permutations of the input characters.
However, you MUST prune your
execution tree such that impossible permutations are not tried beyond the point
where they are known to be impossible. This can be done by testing prefixes in
the dictionary. For example, using the
input above ("raziber"), if we consider the
letters in the order given, we can test in the following fashion:
r | ra | raz | razi à backtrack at
this point because "razi" is not a valid
prefix in our word dictionary
Note
that in order to implement this correctly, you will
need a function in your dictionary to test if a string is a valid prefix in the
dictionary. Think about how to do this efficiently in your trie
implementation. If you complete this part correctly (assuming the rest of your project
is also correct) you can receive up to 85 total points for the assignment.
2) Once you can find the anagrams using all of the letters,
add code that allows you to generate multiple-word anagrams as well (ex: for
above all of the solutions AFTER bizarre and brazier). This is tricky since there are different ways
of approaching the algorithm and the backtracking is more complicated. Think carefully about how you would approach
this before coding it (try it with a pencil and paper). Don't forget to sort
the solutions from fewest words to most words (and alphabetically for solutions
with the same number of words). If
you complete this part correctly (assuming the rest of your project is also correct)
you can receive up to 100 total points for the assignment.
Ø
If you search
the Web you will find the anagram program indicated in the link above and
others as well. If you search hard enough
you can probably find source code to one or more of these programs. I strongly urge you to resist trying to find
this code. If you use code found from
the Web for this project and you are caught, you will receive a 0 for the
project (following the cheating guidelines as stated in the Course Policies).
Ø
Dictionary
trie details: Tries must be implemented as de
la Briandais (DLB) trees (discussed in lecture). Recall that the DLB trie
implementation uses linked lists for "trie
nodes", as we discussed in lecture.
See online and hand written lecture notes for more details on the DLB
implementation. You must implement your
DLB as a class. Think carefully about
how your data will be structured and about the operations that your class will
need. Minimally you will need a method
to insert a new string into the trie (used when you
create your dictionary), a method to search for a prefix in the trie (as explained above) and a method to search for a word
in the trie.
It is actually fairly simple (and more efficient from a run-time point
of view) to combine the prefix and search operations into a single method. Building your DLB class is a significant part
of the overall project, so don't be discouraged if you have some difficulty
with it.
Ø
Note: You can improve the efficiency of your searches
even more if you maintain the "state" of the search in the trie and incrementally continue the search for each new
letter. For example, rather than
starting the prefix searches above for r | ra | raz | razi at the root of the trie each time, you can instead proceed one character down
the tree for each search. In this way,
rather than comparing 1 (r) + 2 (ra) + 3 (raz) + 4 (razi) characters for
the 4 prefix searches, you instead only compare 4 (not counting sibling comparisons). Once the prefix test fails you backtrack in
the trie just as you backtrack in your anagram
algorithm. Note that this will require
your search algorithm to be integrated into your trie
data structure, which violates the concept of "data hiding" with ADTs. However,
sometimes good object-oriented programming must be sacrificed for improved
efficiency. If you complete this
run-time improvement to the trie search algorithm,
and explain it thoroughly in your write-up, you can receive up to 10 extra credit
points for the assignment.
Ø
W
section details: Students in the
W section must make their papers 3
pages in length and must put
more emphasis on writing style. Papers
for those in the W section will be weighted more heavily than papers for those
in the non-W section. Be sure your paper is written in a technical manner (i.e.
using a somewhat formal tone). Even if
your project was not completed or your algorithms are incorrect, your paper
should still be 3 pages in length.
Ø
Be sure to thoroughly document your code,
and to follow all of the submission guidelines.