Author: Yuhuan Jiang
Last updated: Feb 18, 2017
Implement a CKY parser which can parse any given sentence using the provided grammar.
The grammar file cfg.txt
(click here to download) provided is exactly the same as the grammar from the textbook.
S -> NP VP
S -> Aux NP VP
S -> VP
NP -> Pronoun
NP -> Proper-Noun
NP -> Det Nominal
Nominal -> Noun
Nominal -> Nominal Noun
Nominal -> Nominal PP
VP -> Verb
VP -> Verb NP
VP -> Verb NP PP
VP -> Verb PP
VP -> VP PP
PP -> Preposition NP
Det -> that | this | a
Noun -> book | flight | meal | money
Verb -> book | include | prefer
Pronoun -> i | she | me
Proper-Noun -> houston | twa
Aux -> does
Preposition -> from | to | on | near | through
The grammar provided has rules such as VP -> Verb NP PP
,
which has more than two non-terminals on the right hand side. However,
the CKY algorithm can only handle grammars in a binarized format (such
as CNF). Therefore, you need to binarize the grammar before CKY decoding
can be executed. You may use the CNF conversion introduced in class, or
any other binarization techniques.
Using the binarized grammar, the parse tree generated by CKY will not be in its most natural form (because intermediate non-terminal symbols introduced by the binarization process may present in the tree). Such parses are not considered as the final result. You should "debinarize" (the reverse of binarization) the tree to convert it back to its most natural form.
Example
Suppose that in your binarization process, the grammar rule
VP -> Verb NP PP
is broken intoVP -> Verb @VP.Verb
@VP.Verb -> NP PP
. Using this grammar, your CKY produces the following parse tree for some sentenceThis tree is not in its most natural form because it contains the nonterminal symbol,
@VP.Verb
, which is not in the original grammar.You need to post-process this tree to convert it to the following tree
Directly showing the tree containing non-terminals that are not in the original grammar as the answer without debinarization will result in point deduction!
⚠️ Warning
Failing to conform to the the input/output requirement will result in a 5-point deduction.
Your script (for Python users) or executable jar (for Java users) must take two parameters:
For example, when your script cky.py
is run as follows,
python cky.py cfg.txt "i book flight"
it should read the grammar in the file cfg.txt
, and parse the sentence i book flight
using that grammar.
For Java users, the command will be
java -cp yourname.jar cs2731.hw2.CKY cfg.txt "i book flight"
The output should be printed to the standard output stream, and it should answer the following two questions:
Sentence accepted!
or Sentence rejected!
If accepted, what are the parse trees? For this question, print out each tree using the s-expression, which is a bracket-based format for parse trees. An example:
Example
The tree in the second figure above in s-expression would be[S [NP [Pronoun I]] [VP [Verb book] [NP [Det a] [Nominal [Noun flight]]] [PP [Preposition to] [NP [Proper-Noun houston]]]]]
(Copy and paste this string into mshang.ca/syntree to visualize it. You will find this tool very useful throughout this homework. )
Implement a probabilistic CKY parser.
The grammar file pcfg.txt
(click here to download) provided is exactly the same as the probabilistic grammar from the textbook.
0.80 S -> NP VP
0.15 S -> Aux NP VP
0.05 S -> VP
0.35 NP -> Pronoun
0.30 NP -> Proper-Noun
0.20 NP -> Det Nominal
0.15 NP -> Nominal
0.75 Nominal -> Noun
0.20 Nominal -> Nominal Noun
0.05 Nominal -> Nominal PP
0.35 VP -> Verb
0.20 VP -> Verb NP
0.10 VP -> Verb NP PP
0.15 VP -> Verb PP
0.05 VP -> Verb NP NP
0.15 VP -> VP PP
1.0 PP -> Preposition NP
Det -> that [0.10] | a [0.30] | the [0.60]
Noun -> book [0.10] | flight [0.30] | meal [0.15] | money [0.05] | flights [0.40] | dinner [0.10]
Verb -> book [0.30] | include [0.30] | prefer [0.40]
Pronoun -> i [0.40] | she [0.05] | me [0.15] | you [0.40]
Proper-Noun -> houston [0.60] | twa [0.40]
Aux -> does [0.60] | can [0.40]
Preposition -> from [0.30] | to [0.30] | on [0.20] | near [0.15] | through [0.05]
This is the same as the non-probabilistic case. One extra thing you should be very careful is that the binarization should be done in a way that the probability stays the same for equivalent rules before and after binarization. This will ensure that, for example, the two trees in the figures above will have the same probability.
Again, remember to convert the trees generated by your probabilistic CKY algorithm back to its most natural form (no non-terminal that is not in the original grammar is permitted in the final output).
If you use Python, your code will be tested as:
python prob-cky.py pcfg.txt "A test sentence ."
If you use Java, your code will be tested as:
java -cp yourname.jar cs2731.hw2.ProbCKY pcfg.txt "A test sentence ."
The output should be printed to the standard output stream. Print the following information
For Python users, include:
cky.py
and prob-cky.py
.readme.txt
which includes:
For Java users, include:
yourname.zip
archive which includes the Java source.yourname.jar
which is compiled from your source code. The jar should have two main classes, cs2731.hw2.CKY
and cs2731.hw2.ProbCKY
.readme.txt
which includes: