Please do this on the board. There are a few lecture slides (see the schedule), but mainly this is to do on the board. Please do not use ppt slides. This follows the chapter very closely. N: are notes to say, not necessarily things to put on the board Chapter 12 Probabilistic and Lexicalized Parsing ------------------------------------- Handling Ambiguities -------------------- The Earley algorithm: represents all parses of a sentence does not resolve ambiguity N: that is, it does not choose which of the possibilities is the correct one Disambiguation methods include: Semantics (choose parse that makes sense) Statistics (choose parse that is most likely) Probabilistic context free gramms (PCFG) offer a solution Issues The probabilistic model: assign probs to the parse trees Getting the probabilities Parsing with probabilities: find max probability tree for an input sentence ============================ A PCFG is a 5-tuple (N,sigma,P,S,D) N: nonterminals sigma: terminals P: productions P, where each P is of the form A --> Beta, where A in N and Beta is a string of symbols from (N U sigma)* S: start D: function that assigns probs to each p in P ============================ So, we attach probabilities to grammar rules VP -> Verb .55 VP -> Verb NP .40 VP -> Verb NP NP .05 N: ask them: what do you think these numbers are, exactly? A conditional probability: P(Specific rule | LHS) E.g.: P(Verb NP | VP) is .4 N: ask them: What needs to sum to 1? The expansions for a given non-terminal must sum to 1 Remember: sum over a: P(A=a | B=b) = 1 E.g. (assuming the above are the only VP rules): P(Verb NP | VP) + P(Verb | VP) + P(Verb NP NP | VP) = 1 N: show them the example PCFG in the lecture notes; figure 12.1 from the text. A PCFG can be used to estimate probabilities of each parse-tree for sentence S. =========================== The Probability Model ------------------------ A derivation (tree) consists of the set of grammar rules that are in the tree The probability of a tree N: is just the product of the probabilities of the rules in the derivation. For sentence S, the probability assigned by a PCFG to a parse tree T is given by: P(S,T) = product over n in T P(r(n)) i.e. the product of the probabilities of all the rules r used to expand each node n in T Note the independence assumption N: Implicit in context free grammars -- all rules are applied N: independently of each other N: show them the ambiguous sentence in the lecture notes; figure 12.2 from the text N: go over the ambiguity carefully Leading up to the numbers given on that slide: We saw: P(S,T) = product over n in T P(r(n)) N: The joint probability of a sentence and its parse Actually, P(S,T) = P(T). Why? P(S,T) = P(T) P(S|T) The words ARE INCLUDED in the parse tree! They are the terminals of the tree. Thus, P(S|T) = 1 N: the words are fixed in the parse tree So, P(S,T) = P(T) P(TL) = 1.5 x 10-6 .15 * .4 * .05 * .... P(TR) = 1.7 x 10-6 .15 * .4 * .4 ... So, pick the one on the right. tau(S): all the possible parses for S according to the grammar Want: T-hat(S) = argmax for T in tau(S) P(T|S) = argmax for T in tau(S) P(T,S) / P(S) Eliminate the denominator, since it is the same for all T in tau(S) = argmax for T in tau(S) P(T,S) We already know from above that: = argmax for T in tau(S) P(T) So, we can just choose the tree with the max prob A PCFG also assigns probabilities to the strings of the language Important in language modeling (speech recognition, spell-correction, etc.) P(S) = sum over T in tau(S) P(S,T) [axiom of probability: sum over a: P(A=a,B=b) = P(B=b)] = sum over T in tau(S) P(T) N: so, sum of the probs of all possible parse trees for a sentence is the prob of the sentence itself. (remember that T contains the words + the structure) ================================= N: ok, above was the probability model. Getting the probabilities: Supervised: From an annotated database (a treebank) So for example, to get the probability for a particular VP rule just count all the times the rule is used and divide by the number of VPs overall If no labeled data: something we won't ocver, called the inside-outside algorithm. There is a reference in the text. ================================= Parsing: choose the most probable parse N: you can augment the earley parser to choose the most likely parse; see ref in text. CYK (cocke-younger-kasami): Bottom-up, no filtering, dynamic programming **You need the updated version from the Errata webpage. **I point to it on the schedule N: like viterbi and earley parsers Input: Grammar: grammars must be in Chomsky-Normal Form (CNF): epsilon-free a rule is either A --> A B (nonterminal) A --> a (non-terminal, terminal) non-terminals are numbered 1,2,...,N Start (S for sentences) is 1 n words w1,..,wn pi[i,j,a] holds the max prob for constituent a from i to j b[i,j,a] holds backpointers, to tie the pieces of the parse tree together Output: pi[1,n,1] is the probability of the maximum probability parse (remember: S is 1) Usebackpointers to recover the most probable parse. Dynamic programming approach. Assign probabilities to constituents as they are completed and added to the chart. Use the max probability for each constituent as we work our way "up" -- we are doing bottom-up parsing. Suppose we want the probability for a sentence: S->0_NP_i VP_j The probability of the S is P(S->NP VP)*P(NP)*P(VP) ^ | We already know everything after the arrow, since we are going bottom up. N: P(NP) e.g. is the product of the probabilities of the rules N: used to build up to the NP Let's show the algorithm with an example: S -> NP VP VP -> V NP NP -> NP PP VP -> VP PP PP -> P NP NP -> John, Mary, Denver V -> called P -> from John called Mary from Denver. parse tree: S NP - John VP - VP - V - called - NP - Mary - PP - P - from - NP - denver base case: (bottom of bottom-up) John called Mary from Denver pi[1,1,NP] = pi[2,2,V] = pi[3,3,NP] = etc. p(NP --> John) p(V --> called) p(NP --> Mary) If, e.g., "called" could also be a NP, then pi[2,2,NP] = p(NP -->called) recursive case: For strings longer than 1, there must be a rule A --> BC and some m such that B covers the first m words in the string, and C covers the rest of the string. Since the substrings are shorter than the main string, their probabilities will already be stored in the table. For all possibilities (see below): prob = pi[begin,m,B] * pi[m + 1,end,C] * p(A --> B C) if prob is larger than pi[begin,end,A] (we are finding the maximum) update pi[begin,end,A] and save backpointer[begin,end,A]: {m,B,C} N: does backpointer have enough info? Yes. The indices: begin, end tell us where N: the constituent starts and ends. A is the LHS of the rule. N: The contents: m tells us where B ends and C starts. B, C are the N: RHS of the rule. All possibilities (see the for-loop indices in Figure 12.3): any constituent over any span followed by any consituent over the following span For example: Base case: Here are the 1-word assignments that are made (the rest are set to 0) pi[ 1 1 NP ] = P( NP --> John ) pi[ 2 2 V ] = P( V --> called ) pi[ 3 3 NP ] = P( NP --> Mary ) pi[ 4 4 P ] = P( P --> from ) Here is a sample from the recursive cases: **Strings of len 2: prob = pi[1,1,NP] * pi[2,2,VP] * P(S --> NP VP) if pi[1,2,S] < prob then update it to prob and save back[1,2,S] = <1,NP,VP> ... prob = pi[3,3,V] * pi[4,4,NP] * P(VP --> V NP) if pi[3,4,VP] < prob then update it to prob and save back[3,4,VP] = <3,V,NP> **Strings of len 3: prob = pi[1,1,NP] * pi[2,3,VP] * P(S --> NP VP) if pi[1,3,S] < prob then update it to prob and save back[1,3,S] = <1,NP,VP> ... prob = pi[2,2,V] * pi[3,4,NP] * P(VP --> V NP) if pi[2,4,VP] < prob then update it to prob and save back[2,4,VP] = <2,V,NP> --snip--- ** Strings of len 5: prob = pi[1,1,NP] * pi[2,5,VP] * P(S --> NP VP) if pi[1,5,S] < prob then update it to prob and save back[1,5,S] = <1,NP,VP> ... prob = pi[1,1,P] * pi[2,5,NP] * P(PP --> P NP) if pi[1,5,PP] < prob then update it to prob and save back[1,5,PP] = <1,P,NP> ... prob = pi[1,3,P] * pi[4,5,NP] * P(PP --> P NP) if pi[1,5,PP] < prob then update it to prob and save back[1,5,PP] = <3,P,NP> ========At the end: pi[1,5,S] : back[1,5,S] = <1,NP,VP> pi[1,1,NP] -- terminal pi[2,5,VP] prob = pi[2,3,VP] * pi[4,5,PP] * P(VP --> VP PP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <3,VP,PP> Above is the correct one (hopefully, has maximum probability). Here were the other possibilities: prob = pi[2,2,VP] * pi[3,5,PP] * P(VP --> VP PP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <2,VP,PP> prob = pi[2,2,V] * pi[3,5,NP] * P(VP --> V NP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <2,V,NP> prob = pi[2,3,V] * pi[4,5,NP] * P(VP --> V NP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <3,V,NP> prob = pi[2,4,VP] * pi[5,5,PP] * P(VP --> VP PP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <4,VP,PP> prob = pi[2,4,V] * pi[5,5,NP] * P(VP --> V NP) if pi[2,5,VP] < prob then update it to prob and save back[2,5,VP] = <4,V,NP> Continuing with the backpointers back[2,5,VP] = <3,VP,PP>: pi[2,3,VP] ... pi[4,5,PP] ... And so on. ===================================== Problems with PCFGs expansions of non-terminals are all independent of each other. Hence, the probabilities are simply multiplied together. doesn't use the words in any real way. E.g., pp attachment often depends on the verb, its object, and the preposition: I ate macaroni with a fork I ate macaroni with cheese Doesn't take into account where in the derivation a rule is used. E.g. pronouns are more often subjects than objects (She hates Mary. Mary hates her) We would like to make the probabilities assigned to the rules: NP --> pronoun NP --> Det noun depend on whether the NP is subject or object. But you can't do that in a standard PCFG ==== See slides 4 and 5. Slide 5: N: Such a preference can be only structural, N: and is the same for all verbs But in this example, the PP attaches to the verb. "send" subcategorizes for a destination which can be expressed with "into": A basic pattern of "send": NP send NP (PP_into) Lexicalized PCFGs ================= Add lexical dependencies to the scheme Add the predilections of particular words into the probabilities in the derivation I.e. Condition the rule probabilities on the actual words We need the idea of "heads" of constituents Phrasal Heads ============ The head of an NP is its noun The head of a VP is its verb The head of a PP is its preposition "most important" word of the phrase Phrases are generally identified by their heads (i.e., NP = "noun phrase") This is easy for cases such as those above. But what the head is can be complicated and controversial. Linguistic theories specify what the heads are. Collins 1999 has a practical set defined for the Penn Treebank tags. Basic idea: a head is identified for each PCFG rule The headword of a node in a parse tree is set to the headword if its head daugher. N: See the example lexicalized trees in the lecture notes. N: These show two different parses of the same sentence. Let's treat the probabilistic lexicalized CFG like a normal but huge PCFG. Remember, our PCFGs have rule probabilities: VP -> V NP PP P(rule|VP) Estimated from a treebank by count(this rule) / count(VP) Now we want lexicalized probabilities We have one rule for each rule/head combination: VP(dumped)-> V(dumped) NP(sacks)PP(into) The probability assigned: P(rule|VP ^ dumped is the verb ^ sacks is the head of the NP ^ into is the head of the PP) Need to cover all combinations! VP(dumped)-> V(dumped) NP(cats)PP(into) VP(dumped)-> V(dumped) NP(hats)PP(into) ... Not likely to have significant counts in any treebank Most probabilities would be 0. So, let's make some simplifying independence assumptions and calculate what we can. Statistical parsers differ in which ones they make. We'll look at (a simplified version of ) Charniak's 1997 parser. N: Rather than starting with the full version and simplifying, N: we'll start with PCFGs and add information. p(r(n) |n) is in PCFG (notation: r is the rule, n is a syntactic category) Add: the headword p(r(n) | n,h(n)) E.g., for: VP -> VBD NP PP p(r|VP,dumped): what is the probability that a VP headed by dumped will be expanded as VBD NP PP? This will capture subcategorizaton information: A VP whose head is dumped is more likely to have an NP and a PP (dumped him into the river) than some other VP structures. Probability of a noun having a head: we will have this depend on the syntactic cat of the node n and the head of the node's mother h(m(n)) p(h(n) = word_i | n, h(m(n))) E.g.: prob that the NP that that is the second daughter of the VP in our example tree has head "sacks". p(head(n) = sacks | n = NP, h(m(n)) = dumped) What is the prob that an NP whose mother's head is "dumped" has the head "sacks"? Capturing dependency info between "dumped" and "sacks" X(dumped) | NP(?sacks?) Probability of a complete parse: P(T) = product over n in T: p(r(n) | n, h(n)) * p(h(n) | n, h(m(n))) Look at our sample parse trees: Question is where does the PP attach? To the verb or the noun? We care about affinities between 'dumped' and 'into' vs. 'sacks' and 'into' One difference between the trees: VP(dumped) --> VBD NP PP in the correct tree VP(dumped) -- VBD NP in the incorrect tree In the Brown, first rule is likely: p(VP --> VBD NP PP | VP,dumped) = count(VP(dumped) --> VBD NP PP) ------------------------------ sum over B count(VP(dumped) --> B) 6/9 = .67 The second never happens in the Brown corpus: count(VP(dumped) --> VBD NP) ---------------------------- sum over B count(VP(dumped) --> B) 0/9 = 0 ***The usage "She dumped her boyfriend" isn't in this corpus! Now consider the head probabilities: ***In the correct parse: X(dumped) | PP(?into?) ***In the incorrect parse: X(sacks) | PP(?into?) From brown corpus: p(into | PP, dumped) = count (X(dumped) --> ... PP(into) ...) ------------------------------------------ sum over B: count(X(dumped) --> ...PP(B)...) 2/9 = .22 p(into | PP, sacks) = count (X(sacks) --> ... PP(into) ...) ------------------------------------------ sum over B: count(X(sacks) --> ...PP(B)...) 0/0 There weren't any instances of sacks with PP children. e.g. you might find in another or larger corpus: Go get the sacks on the shelf The sacks on the shelf are heavy "into" is less likely: The sacks into the shelf are heavy... So, both types of probabilities help in our example. ==================================================== By the way, is our grammar still context free? Can we use our methods for parsing CFGs? Yes -- just view it as one large grammar: NP(dumped) --> ... NP(sacks) --> ... Don't treat the heads as features in any way. Just treat this as if you have syntactic categories: NP-dumped, NP-sacks,... and so on. ==================================================== Dependency Grammars Grammars based on lexical dependencies. Constituents and phrase structure rules are not central root ---> main 'Gave' --> subj 'I' --> dat 'Him' --> obj 'address'--> attr 'my' Each link holds between two lexical nodes. Fixed inventory of 35 relations. A few different sets have been proposed. Good for languages with relatively free word order. Czech, Korean, ... ========= Read the rest of the chapter for your own information. N: Good to have N: some idea. We won't cover it explicitly or test it. The human parsing N: section is interesting.