Please do this on the board.
There are a few lecture slides (see the schedule), but mainly this is
to do on the board. Please do not use ppt slides.
This follows the chapter very closely.
N: are notes to say, not necessarily things to put on the board
Chapter 12
Probabilistic and Lexicalized Parsing

Handling Ambiguities

The Earley algorithm: represents all parses of a sentence
does not resolve ambiguity
N: that is, it does not choose which of the possibilities is the
correct one
Disambiguation methods include:
Semantics (choose parse that makes sense)
Statistics (choose parse that is most likely)
Probabilistic context free gramms (PCFG) offer a solution
Issues
The probabilistic model: assign probs to the parse trees
Getting the probabilities
Parsing with probabilities: find max probability tree for an
input sentence
============================
A PCFG is a 5tuple (N,sigma,P,S,D)
N: nonterminals
sigma: terminals
P: productions P, where each P is of the form A > Beta, where A
in N and Beta is a string of symbols from (N U sigma)*
S: start
D: function that assigns probs to each p in P
============================
So, we attach probabilities to grammar rules
VP > Verb .55
VP > Verb NP .40
VP > Verb NP NP .05
N: ask them: what do you think these numbers are, exactly?
A conditional probability: P(Specific rule  LHS)
E.g.: P(Verb NP  VP) is .4
N: ask them:
What needs to sum to 1?
The expansions for a given nonterminal must sum to 1
Remember: sum over a: P(A=a  B=b) = 1
E.g. (assuming the above are the only VP rules):
P(Verb NP  VP) + P(Verb  VP) + P(Verb NP NP  VP) = 1
N: show them the example PCFG in the lecture notes; figure 12.1 from the text.
A PCFG can be used to estimate probabilities of each parsetree for
sentence S.
===========================
The Probability Model

A derivation (tree) consists of the set of grammar rules that are in
the tree
The probability of a tree
N: is just the product of the probabilities of the rules in the derivation.
For sentence S, the probability assigned by a PCFG to a parse tree T
is given by: P(S,T) = product over n in T P(r(n))
i.e. the product of the probabilities of all the rules r used to
expand each node n in T
Note the independence assumption
N: Implicit in context free grammars  all rules are applied
N: independently of each other
N: show them the ambiguous sentence in the lecture notes; figure 12.2
from the text
N: go over the ambiguity carefully
Leading up to the numbers given on that slide:
We saw:
P(S,T) = product over n in T P(r(n))
N: The joint probability of a sentence and its parse
Actually, P(S,T) = P(T). Why?
P(S,T) = P(T) P(ST)
The words ARE INCLUDED in the parse tree! They are the terminals of
the tree. Thus, P(ST) = 1
N: the words are fixed in the parse tree
So, P(S,T) = P(T)
P(TL) = 1.5 x 106
.15 * .4 * .05 * ....
P(TR) = 1.7 x 106
.15 * .4 * .4 ...
So, pick the one on the right.
tau(S): all the possible parses for S according to the grammar
Want: That(S) = argmax for T in tau(S) P(TS)
= argmax for T in tau(S) P(T,S) / P(S)
Eliminate the denominator, since it is the same for all T in tau(S)
= argmax for T in tau(S) P(T,S)
We already know from above that:
= argmax for T in tau(S) P(T)
So, we can just choose the tree with the max prob
A PCFG also assigns probabilities to the strings of the language
Important in language modeling (speech recognition, spellcorrection,
etc.)
P(S) = sum over T in tau(S) P(S,T)
[axiom of probability: sum over a: P(A=a,B=b) = P(B=b)]
= sum over T in tau(S) P(T)
N: so, sum of the probs of all possible parse trees for a sentence is
the prob of the sentence itself. (remember that T contains the
words + the structure)
=================================
N: ok, above was the probability model.
Getting the probabilities:
Supervised: From an annotated database (a treebank)
So for example, to get the probability for a particular VP rule
just count all the times the rule is used and divide by the number
of VPs overall
If no labeled data: something we won't ocver, called the
insideoutside algorithm. There is a reference in the text.
=================================
Parsing: choose the most probable parse
N: you can augment the earley parser to choose the most likely parse;
see ref in text.
CYK (cockeyoungerkasami): Bottomup, no filtering, dynamic
programming
**You need the updated version from the Errata webpage.
**I point to it on the schedule
N: like viterbi and earley parsers
Input:
Grammar:
grammars must be in ChomskyNormal Form (CNF):
epsilonfree
a rule is either A > A B (nonterminal)
A > a (nonterminal, terminal)
nonterminals are numbered 1,2,...,N
Start (S for sentences) is 1
n words w1,..,wn
pi[i,j,a] holds the max prob for constituent a from i to j
b[i,j,a] holds backpointers, to tie the pieces of the parse tree together
Output: pi[1,n,1] is the probability of the maximum probability
parse (remember: S is 1)
Usebackpointers to recover the most probable parse.
Dynamic programming approach.
Assign probabilities to constituents as they are completed and added
to the chart.
Use the max probability for each constituent as we work our way "up"
 we are doing bottomup parsing.
Suppose we want the probability for a sentence:
S>0_NP_i VP_j
The probability of the S is
P(S>NP VP)*P(NP)*P(VP)
^

We already know everything after the arrow, since
we are going bottom up.
N: P(NP) e.g. is the product of the probabilities of the rules
N: used to build up to the NP
Let's show the algorithm with an example:
S > NP VP
VP > V NP
NP > NP PP
VP > VP PP
PP > P NP
NP > John, Mary, Denver
V > called
P > from
John called Mary from Denver.
parse tree:
S
NP  John
VP  VP  V  called
 NP  Mary
 PP  P  from
 NP  denver
base case: (bottom of bottomup)
John called Mary from Denver
pi[1,1,NP] = pi[2,2,V] = pi[3,3,NP] = etc.
p(NP > John) p(V > called) p(NP > Mary)
If, e.g., "called" could also be a NP, then pi[2,2,NP] = p(NP >called)
recursive case:
For strings longer than 1, there must be a rule A > BC and some m
such that B covers the first m words in the string, and C covers the
rest of the string. Since the substrings are shorter than the main
string, their probabilities will already be stored in the table.
For all possibilities (see below):
prob = pi[begin,m,B] * pi[m + 1,end,C] * p(A > B C)
if prob is larger than pi[begin,end,A] (we are finding the maximum)
update pi[begin,end,A] and save backpointer[begin,end,A]: {m,B,C}
N: does backpointer have enough info? Yes. The indices: begin, end tell us where
N: the constituent starts and ends. A is the LHS of the rule.
N: The contents: m tells us where B ends and C starts. B, C are the
N: RHS of the rule.
All possibilities (see the forloop indices in Figure 12.3):
any constituent over any span followed by any
consituent over the following span
For example:
Base case:
Here are the 1word assignments that are made (the rest are set to 0)
pi[ 1 1 NP ] = P( NP > John )
pi[ 2 2 V ] = P( V > called )
pi[ 3 3 NP ] = P( NP > Mary )
pi[ 4 4 P ] = P( P > from )
Here is a sample from the recursive cases:
**Strings of len 2:
prob = pi[1,1,NP] * pi[2,2,VP] * P(S > NP VP)
if pi[1,2,S] < prob then update it to prob and save back[1,2,S] = <1,NP,VP>
...
prob = pi[3,3,V] * pi[4,4,NP] * P(VP > V NP)
if pi[3,4,VP] < prob then update it to prob and save back[3,4,VP] = <3,V,NP>
**Strings of len 3:
prob = pi[1,1,NP] * pi[2,3,VP] * P(S > NP VP)
if pi[1,3,S] < prob then update it to prob and save back[1,3,S] = <1,NP,VP>
...
prob = pi[2,2,V] * pi[3,4,NP] * P(VP > V NP)
if pi[2,4,VP] < prob then update it to prob and save back[2,4,VP] = <2,V,NP>
snip
** Strings of len 5:
prob = pi[1,1,NP] * pi[2,5,VP] * P(S > NP VP)
if pi[1,5,S] < prob then update it to prob and save back[1,5,S] = <1,NP,VP>
...
prob = pi[1,1,P] * pi[2,5,NP] * P(PP > P NP)
if pi[1,5,PP] < prob then update it to prob and save back[1,5,PP] = <1,P,NP>
...
prob = pi[1,3,P] * pi[4,5,NP] * P(PP > P NP)
if pi[1,5,PP] < prob then update it to prob and save back[1,5,PP] = <3,P,NP>
========At the end:
pi[1,5,S] : back[1,5,S] = <1,NP,VP>
pi[1,1,NP]  terminal
pi[2,5,VP]
prob = pi[2,3,VP] * pi[4,5,PP] * P(VP > VP PP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <3,VP,PP>
Above is the correct one (hopefully, has maximum probability).
Here were the other possibilities:
prob = pi[2,2,VP] * pi[3,5,PP] * P(VP > VP PP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <2,VP,PP>
prob = pi[2,2,V] * pi[3,5,NP] * P(VP > V NP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <2,V,NP>
prob = pi[2,3,V] * pi[4,5,NP] * P(VP > V NP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <3,V,NP>
prob = pi[2,4,VP] * pi[5,5,PP] * P(VP > VP PP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <4,VP,PP>
prob = pi[2,4,V] * pi[5,5,NP] * P(VP > V NP)
if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <4,V,NP>
Continuing with the backpointers back[2,5,VP] = <3,VP,PP>:
pi[2,3,VP] ...
pi[4,5,PP] ...
And so on.
=====================================
Problems with PCFGs
expansions of nonterminals are all independent of each other.
Hence, the probabilities are simply multiplied together.
doesn't use the words in any real way.
E.g., pp attachment often depends on the verb, its object,
and the preposition:
I ate macaroni with a fork
I ate macaroni with cheese
Doesn't take into account where in the derivation a rule
is used. E.g. pronouns are more often subjects than objects
(She hates Mary. Mary hates her)
We would like to make the probabilities assigned to the rules:
NP > pronoun
NP > Det noun
depend on whether the NP is subject or object. But you can't do that
in a standard PCFG
====
See slides 4 and 5.
Slide 5:
N: Such a preference can be only structural,
N: and is the same for all verbs
But in this example, the PP attaches to the verb.
"send" subcategorizes for a destination which can be
expressed with "into":
A basic pattern of "send":
NP send NP (PP_into)
Lexicalized PCFGs
=================
Add lexical dependencies to the scheme
Add the predilections of particular words
into the probabilities in the derivation
I.e. Condition the rule probabilities on the
actual words
We need the idea of "heads" of constituents
Phrasal Heads
============
The head of an NP is its noun
The head of a VP is its verb
The head of a PP is its preposition
"most important" word of the phrase
Phrases are generally identified by their heads
(i.e., NP = "noun phrase")
This is easy for cases such as those above.
But what the head is can be complicated and
controversial.
Linguistic theories specify what the heads are.
Collins 1999 has a practical set defined for the
Penn Treebank tags.
Basic idea: a head is identified for each PCFG rule
The headword of a node in a parse tree is set to the
headword if its head daugher.
N: See the example lexicalized trees in the lecture notes.
N: These show two different parses of the same sentence.
Let's treat the probabilistic lexicalized CFG
like a normal but huge PCFG.
Remember, our PCFGs have rule probabilities:
VP > V NP PP P(ruleVP)
Estimated from a treebank by count(this rule) / count(VP)
Now we want lexicalized probabilities
We have one rule for each rule/head combination:
VP(dumped)> V(dumped) NP(sacks)PP(into)
The probability assigned:
P(ruleVP ^ dumped is the verb ^
sacks is the head of the NP ^
into is the head of the PP)
Need to cover all combinations!
VP(dumped)> V(dumped) NP(cats)PP(into)
VP(dumped)> V(dumped) NP(hats)PP(into)
...
Not likely to have significant counts in any treebank
Most probabilities would be 0.
So, let's make some simplifying independence assumptions
and calculate what we can.
Statistical parsers differ in which ones they make.
We'll look at (a simplified version of )
Charniak's 1997 parser.
N: Rather than starting with the full version and simplifying,
N: we'll start with PCFGs and add information.
p(r(n) n) is in PCFG
(notation: r is the rule, n is a syntactic category)
Add: the headword
p(r(n)  n,h(n))
E.g., for:
VP > VBD NP PP
p(rVP,dumped): what is the probability that a VP headed by dumped
will be expanded as VBD NP PP?
This will capture subcategorizaton information:
A VP whose head is dumped is more likely
to have an NP and a PP (dumped him into the river)
than some other VP structures.
Probability of a noun having a head: we will have this
depend on the syntactic cat of the node n and the head of the node's
mother h(m(n))
p(h(n) = word_i  n, h(m(n)))
E.g.: prob that the NP that that is the second daughter of the
VP in our example tree has head "sacks".
p(head(n) = sacks  n = NP, h(m(n)) = dumped)
What is the prob that an NP whose mother's head is "dumped"
has the head "sacks"?
Capturing dependency info between "dumped" and "sacks"
X(dumped)

NP(?sacks?)
Probability of a complete parse:
P(T) = product over n in T:
p(r(n)  n, h(n)) * p(h(n)  n, h(m(n)))
Look at our sample parse trees:
Question is where does the PP attach?
To the verb or the noun?
We care about affinities between
'dumped' and 'into' vs. 'sacks' and 'into'
One difference between the trees:
VP(dumped) > VBD NP PP in the correct tree
VP(dumped)  VBD NP in the incorrect tree
In the Brown, first rule is likely:
p(VP > VBD NP PP  VP,dumped) =
count(VP(dumped) > VBD NP PP)

sum over B count(VP(dumped) > B)
6/9 = .67
The second never happens in the Brown corpus:
count(VP(dumped) > VBD NP)

sum over B count(VP(dumped) > B)
0/9 = 0
***The usage "She dumped her boyfriend" isn't in this corpus!
Now consider the head probabilities:
***In the correct parse:
X(dumped)

PP(?into?)
***In the incorrect parse:
X(sacks)

PP(?into?)
From brown corpus:
p(into  PP, dumped) =
count (X(dumped) > ... PP(into) ...)

sum over B: count(X(dumped) > ...PP(B)...)
2/9 = .22
p(into  PP, sacks) =
count (X(sacks) > ... PP(into) ...)

sum over B: count(X(sacks) > ...PP(B)...)
0/0
There weren't any instances of sacks with PP children.
e.g. you might find in another or larger corpus:
Go get the sacks on the shelf
The sacks on the shelf are heavy
"into" is less likely:
The sacks into the shelf are heavy...
So, both types of probabilities help in our example.
====================================================
By the way, is our grammar still context free?
Can we use our methods for parsing CFGs?
Yes  just view it as one large grammar:
NP(dumped) > ...
NP(sacks) > ...
Don't treat the heads as features in any way. Just
treat this as if you have syntactic categories:
NPdumped, NPsacks,... and so on.
====================================================
Dependency Grammars
Grammars based on lexical dependencies.
Constituents and phrase structure rules are not central
root > main 'Gave' > subj 'I'
> dat 'Him'
> obj 'address'> attr 'my'
Each link holds between two lexical nodes.
Fixed inventory of 35 relations.
A few different sets have been proposed.
Good for languages with relatively free word order.
Czech, Korean, ...
=========
Read the rest of the chapter for your own information.
N: Good to have
N: some idea. We won't cover it explicitly or test it. The human parsing
N: section is interesting.