Homework 2 (CS 2731)

Assigned: September 26, 2019

Due: October 10, 2019 (midnight)

2.1 HMM Decoding (Viterbi) (20 points)

A partial Viterbi calculation is pictured here. This calculation takes us up through t=2 where v2(1) and v2(2) are computed. In the picture, the index 1 is used for the state labeled C and the index 2 is used for the state labeled H. Compute v3(1) and v3(2). You will need the transition and observation probabilities given here.

Think of this as filling in a table where the columns are moments in time and the rows are states in the HMM. Filling in the table with the numbers computed in the diagram above, and adding a column for time t = 0, and showing all the probability cells, it looks like this:

end	0	0	0
H	0	.32	.0448
C	0	.02	.048
start	1.0	0	0
t =	0	1	2	3

Each cell in the Viterbi table is filled with one of the Viterbi values computed in the diagram. Like the diagram, the table is complete through t=2. The values in the cells represent Viterbi probabilities. The Viterbi probability written as v2(2) repesents the probability of the highest probability path that ends at state 2 at time 2.

(10 points) Submit a completed version of the table above, together with the calculations you used to compute the Viterbi probabilities v3(1) and v3(2).
- The calculations should show the products producing the path probabilities and the maximization that gives the final Viterbi value.
- In addition, show how you would do all calculations in log (ln) space as well as directly as products of probabilities (recall Chapter 3).
(10 points)
- Report the best path through the HMM that fits the data.
- Justify your answer by adding backtraces to your table, including the backtraces for column 3. This figure illustrates the idea. The dashed lines represent the best path associated with each Viterbi value. In your submission you can just explain textually how you would modify the figure, e.g., "Add a backtrace link (dashed line) to the backtrace figure going from STATE? at time t = 3 to STATE? at time t = 2."

2.2 Probabilistic CKY Parsing (80 Points)

Implement a probabilistic CKY parser.

(40 points) Demonstrate the correctness of your implementation by running it with the grammar below and the inputs "The flight includes a meal" and "Book the flight through Houston", returning all parses and their probabilities. Also demonstrate the correctness of your parser using the grammar and sentence shown in Figure 13.2. (UPDATE: The textbook was updated October 2 so this is now 14.2). For every returned parse, compute its labeled recall and precison scores when compared with a gold standard (reference) parse tree in s-expression form (see below).
(40 points) We will also test your implementation on other blind tests (both grammars and sentences).

Grammar File

The sample grammar file pcfg.txt (click here to download) provided is exactly the same as the probabilistic grammar in Figure 13.1 from the textbook (at least that was the Figure before updating). Your program should read in any grammar file in this .txt format, as we will test it with other grammars besides this one. You are allowed to use NLTK and Stanford's NLP packages supporting PCFG parsing so that you don't have to code up the classes and data structures needed to maintain the grammar. But the rest of the programming must be done from scratch (e.g., the binarization and the CKY dynamic programming algorithm).

Binarization

The grammar provided has rules such as VP -> Verb NP PP, which has more than two non-terminals on the right hand side. However, the CKY algorithm can only handle grammars in a binarized format such as CNF. Therefore, your program will need to binarize the grammar before CKY decoding can be executed. You can't do this manually due to the blind testing of your program. You should use the CNF conversion introduced in class. You should be very careful that the binarization is done in a way that the probability stays the same for equivalent rules before and after binarization.

Note that when using the binarized grammar, the parse tree generated by CKY will not be in its most natural form (because intermediate non-terminal symbols introduced by the binarization process may be present in the tree). Such parses are not considered as the final result. You should "debinarize" (the reverse of binarization) the tree to convert it back to its most natural form (no non-terminal that is not in the original grammar is permitted in the final output).

Example

Suppose that in your binarization process, the grammar rule VP -> Verb NP PP is broken into
VP -> Verb @VP.Verb
@VP.Verb -> NP PP.
Using this grammar, your CKY produces the following parse tree for some sentence.

This tree is not in its most natural form because it contains the nonterminal symbol, @VP.Verb, which is not in the original grammar. You need to post-process this tree to convert it to the following tree

Directly showing the tree containing non-terminals that are not in the original grammar as the answer without debinarization will result in point deduction!

Input/Output Requirements

⚠️ Warning
Failing to conform to the the input/output requirement will result in a 5-point deduction.

Your script (for Python users) or executable jar (for Java users) must take three parameters:

The grammar file
The sentence to parse (which will be surrounded by double quotes)
The gold-standard parse of the sentence (which will be an s-expression (see below) surrounded by double quotes)

If you use Python, your code will be tested as:

python prob-cky.py pcfg.txt "A test sentence ."   "the gold-standard s-expression"

If you use Java, your code will be tested as:

java -cp yourname.jar cs2731.hw2.ProbCKY pcfg.txt "A test sentence ." "the gold standard s-expression"

The output should be printed to the standard output stream. Print the following information

Does the grammar accept the sentence? For this question, you should print a line saying either "Sentence accepted" or "Sentence rejected". If accepted, you in addition need to print the following.
What is the probability of the sentence? Just print the number.
What are the parse trees for the sentence? Print each tree in the s-expression format. Also print the probability of each tree immediately below the tree, as well as its recall and precision scores compared to the gold-standard.
The s-expression is a bracket-based format for parse trees. An example:

Example
[S [NP [Pronoun I]] [VP [Verb book] [NP [Det a] [Nominal [Noun flight]]] [PP [Preposition to] [NP [Proper-Noun houston]]]]]

(Copy and paste this string into mshang.ca/syntree to visualize it. You will find this tool very useful throughout this homework. )

What to Include in Submission?

Submit a writeup for the HMM portion of the homework.

For the CKY portion for Python users, include:

The python source file: prob-cky.py.
A readme.txt which includes:
- Python version (2 or 3)
- How did you do binarization?
- Any known issues that prevent your script from running.

For Java users, include:

A yourname.zip archive which includes the Java source.
A yourname.jar which is compiled from your source code. The jar should have a main class named cs2731.hw2.ProbCKY.
A readme.txt which includes:
- How did you do binarization?
- Any known issues that prevent your code from running.