CS 1501

CS 1501

Algorithm Implementation

Programming Project 1

Online: Sunday, May 15, 2005

Due: All assignment materials: 1) All source files of program, 2) All .class files (or a .jar file containing them) 3) All output files resulting from execution of the test files 4) Well written/formatted paper explaining your anagram algorithm and tree implementation and 5) Assignment Information Sheet on the appropriate directory of the submission site by 11:59 PM on Monday, May 30, 2005. Note: Do NOT submit the dictionary file or the input files.

Background:

Word scrambles (or anagrams) are fun puzzles that challenge our minds and our patience. The general idea of an anagram is that a string of letters is rearranged to produce a valid word or phrase. Originally the game was to unscramble a meaningless string of letters to generate a meaningful word or phrase. More recently (thanks to computers) it is also used to "create" phrases from other words or phrases. For input consider a string of letters which may or may not contain blanks (the blanks will be ignored in the anagram game, but are allowed to make the input easier for the user). The output will be a collection of valid resulting words or phrases, sorted from the phrase with the fewest words to the one with the most. Within groups with the same number of words, the listings should be alphabetical. For example, given the letters:

raziber

your anagram program should generate the following list of words:

bizarre rare biz a biz err err a biz

brazier raze rib a err biz err biz a

biz rare rear biz biz a err

biz rear rib raze biz err a

You assignment is to implement this anagram program in the following way:

1) Read a dictionary of words in from a file and form a dictionary trie of these words. The file will contain ASCII strings, one word per line. Use the file dict10.txt on the CS1501 Web page. Details on the required trie implementation are specified below.

2) Read input strings from an input file (one string per line) and calculate the anagrams of those strings and output them as described above. Your output must go to an output file. The user should be able to enter the file name, and your program should read the strings one at a time until it reaches the end of the file. For a given input file, your output file name should be the same as the input file name, but with the extension .out appended to it. For example, if the original input file is "data1.dat" your output file name should be "data1.dat.out". For more specifics on the requirements for generating the anagrams, see the comments below.

For an example of an anagram-finding program, see the following link:

http://www.ssynth.co.uk/~gay/anagram.html

Use this program for reference and to help you in developing your solution, but note a few important things:

· The dictionary used for his program is different from the one you will be using, so in many cases the solutions will not be the same

· His solutions are listed in alphabetical order, regardless of the number of words in the phrase

In addition to your program, you are required to write a short (~1.5 page) paper explaining how you developed your anagram algorithm. Include in this paper your goals for the algorithm, your basic approach for the recursion / backtracking, how the trie dictionary implementation was utilized by your algorithm to improve the runtime, any problems you encountered and how you solved them. Be specific to your actual implementation in your explanation.

Important Notes:

Ø There are two principal parts to this project: Implementing the anagram algorithm and Implementing the dictionary (such that prefixes as well as whole words can be searched). It is strongly recommended that you proceed with the project in the following manner:

1) Implement the anagram algorithm using a small words file and a simple dictionary implementation (ex: sorted array). This will enable you to concentrate on the anagram algorithm itself without worrying about a sophisticated dictionary implementation. However, make sure you set up the algorithm as explained below, so that you can easily switch it to a trie-implemented dictionary.

2) Implement your dictionary as a DLB trie as explained below, and use the trie instead of the simple dictionary in your anagram algorithm. This should require little or no change to your anagram algorithm itself – all that is changing is the dictionary implementation.

Ø Anagram algorithm details: Your anagram algorithm must be a recursive backtracking algorithm (see class notes for idea of recursive backtracking algorithms). To enable you to get more partial credit for the anagram algorithm, you should consider the algorithm in two parts:

1) Determine the valid anagrams (if they exist) using ALL of the letters in the input (if there were spaces in the input, remove them first) (ex: for above the solutions bizarre and brazier). Thus, your anagrams will always be single words. This process can be done with a fairly straightforward recursive / backtracking algorithm that considers various permutations of the input characters. However, you MUST prune your execution tree such that impossible permutations are not tried beyond the point where they are known to be impossible. This can be done by testing prefixes in the dictionary. For example, using the input above ("raziber"), if we consider the letters in the order given, we can test in the following fashion:

r | ra | raz | razi à backtrack at this point because "razi" is not a valid prefix in our word dictionary

Note that in order to implement this correctly, you will need a function in your dictionary to test if a string is a valid prefix in the dictionary. Think about how to do this efficiently in your trie implementation. If you complete this part correctly (assuming the rest of your project is also correct) you can receive up to 85 total points for the assignment.

2) Once you can find the anagrams using all of the letters, add code that allows you to generate multiple-word anagrams as well (ex: for above all of the solutions AFTER bizarre and brazier). This is tricky since there are different ways of approaching the algorithm and the backtracking is more complicated. Think carefully about how you would approach this before coding it (try it with a pencil and paper). Don't forget to sort the solutions from fewest words to most words (and alphabetically for solutions with the same number of words). If you complete this part correctly (assuming the rest of your project is also correct) you can receive up to 100 total points for the assignment.

Ø If you search the Web you will find the anagram program indicated in the link above and others as well. If you search hard enough you can probably find source code to one or more of these programs. I strongly urge you to resist trying to find this code. If you use code found from the Web for this project and you are caught, you will receive a 0 for the project (following the cheating guidelines as stated in the Course Policies).

Ø Dictionary trie details: Tries must be implemented as de la Briandais (DLB) trees (discussed in lecture). Recall that the DLB trie implementation uses linked lists for "trie nodes", as we discussed in lecture. See online and hand written lecture notes for more details on the DLB implementation. You must implement your DLB as a class. Think carefully about how your data will be structured and about the operations that your class will need. Minimally you will need a method to insert a new string into the trie (used when you create your dictionary), a method to search for a prefix in the trie (as explained above) and a method to search for a word in the trie. It is actually fairly simple (and more efficient from a run-time point of view) to combine the prefix and search operations into a single method. Building your DLB class is a significant part of the overall project, so don't be discouraged if you have some difficulty with it.

Ø Note: You can improve the efficiency of your searches even more if you maintain the "state" of the search in the trie and incrementally continue the search for each new letter. For example, rather than starting the prefix searches above for r | ra | raz | razi at the root of the trie each time, you can instead proceed one character down the tree for each search. In this way, rather than comparing 1 (r) + 2 (ra) + 3 (raz) + 4 (razi) characters for the 4 prefix searches, you instead only compare 4 (not counting sibling comparisons). Once the prefix test fails you backtrack in the trie just as you backtrack in your anagram algorithm. Note that this will require your search algorithm to be integrated into your trie data structure, which violates the concept of "data hiding" with ADTs. However, sometimes good object-oriented programming must be sacrificed for improved efficiency. If you complete this run-time improvement to the trie search algorithm, and explain it thoroughly in your write-up, you can receive up to 10 extra credit points for the assignment.

Ø W section details: Students in the W section must make their papers 3 pages in length and must put more emphasis on writing style. Papers for those in the W section will be weighted more heavily than papers for those in the non-W section. Be sure your paper is written in a technical manner (i.e. using a somewhat formal tone). Even if your project was not completed or your algorithms are incorrect, your paper should still be 3 pages in length.

Ø Be sure to thoroughly document your code, and to follow all of the submission guidelines.