A partial Viterbi calculation is pictured here. This calculation takes us up through t=2 where v2(1) and v2(2) are computed. In the picture, the index 1 is used for the state labeled C and the index 2 is used for the state labeled H. Compute v3(1) and v3(2). You will need the transition and observation probabilities given here.
Think of this as filling in a table where the columns are moments in time and the rows are states in the HMM. Filling in the table with the numbers computed in the diagram above, and adding a column for time t = 0, and showing all the probability cells, it looks like this:
end | 0 | 0 | 0 | |
---|---|---|---|---|
H | 0 | .32 | .0448 | |
C | 0 | .02 | .048 | |
start | 1.0 | 0 | 0 | |
t = | 0 | 1 | 2 | 3 |
Each cell in the Viterbi table is filled with one of the Viterbi values computed in the diagram. Like the diagram, the table is complete through t=2. The values in the cells represent Viterbi probabilities. The Viterbi probability written as v2(2) repesents the probability of the highest probability path that ends at state 2 at time 2.
Implement a probabilistic CKY parser.
The sample grammar file pcfg.txt
(click here to download) provided is exactly the same as the probabilistic
grammar in Figure 13.1 from the textbook (at least that was the Figure before updating). Your program should read in any grammar file in this .txt format, as we will test it with other grammars besides this one. You are allowed to use NLTK and Stanford's NLP packages supporting PCFG parsing so that you don't have to code up the classes and data structures needed to maintain the grammar. But the rest of the programming must be done from scratch (e.g., the binarization and the CKY dynamic programming algorithm).
The grammar provided has rules such as VP -> Verb NP PP
,
which has more than two non-terminals on the right hand side. However,
the CKY algorithm can only handle grammars in a binarized format such
as CNF. Therefore, your program will need to binarize the grammar before CKY decoding
can be executed. You can't do this manually due to the blind testing of your program.
You should use the CNF conversion introduced in class.
You should be very careful that the binarization is done in a way
that the probability stays the same for equivalent rules before and
after binarization.
Note that when using the binarized grammar, the parse tree generated by CKY will not be in its most natural form (because intermediate non-terminal symbols introduced by the binarization process may be present in the tree). Such parses are not considered as the final result. You should "debinarize" (the reverse of binarization) the tree to convert it back to its most natural form (no non-terminal that is not in the original grammar is permitted in the final output).
Example
Suppose that in your binarization process, the grammar rule VP -> Verb NP PP
is broken into
VP -> Verb @VP.Verb
@VP.Verb -> NP PP
.
Using this grammar, your CKY produces the following parse tree for some sentence.
This tree is not in its most natural form because it contains the nonterminal symbol, @VP.Verb
, which is not in the original grammar.
You need to post-process this tree to convert it to the following tree
Directly showing the tree containing non-terminals that are not in the original grammar as the answer without debinarization will result in point deduction!
⚠️ Warning
Failing to conform to the the input/output requirement will result in a 5-point deduction.
Your script (for Python users) or executable jar (for Java users) must take three parameters:
If you use Python, your code will be tested as:
python prob-cky.py pcfg.txt "A test sentence ." "the gold-standard s-expression"
If you use Java, your code will be tested as:
java -cp yourname.jar cs2731.hw2.ProbCKY pcfg.txt "A test sentence ." "the gold standard s-expression"
The output should be printed to the standard output stream. Print the following information
Example
[S [NP [Pronoun I]] [VP [Verb book] [NP [Det a] [Nominal [Noun flight]]] [PP [Preposition to] [NP [Proper-Noun houston]]]]]
(Copy and paste this string into mshang.ca/syntree to visualize it. You will find this tool very useful throughout this homework. )
Submit a writeup for the HMM portion of the homework.
For the CKY portion for Python users, include:
prob-cky.py
.readme.txt
which includes:
For Java users, include:
yourname.zip
archive which includes the Java source.yourname.jar
which is compiled from your source code. The jar should have a main class named cs2731.hw2.ProbCKY
.readme.txt
which includes: