CS 4344/7344 - Lecture 3 - January 13, 1998

CS 4344/7344 - Lecture 3

Beginning Parsing Techniques


What we get from syntax

Syntax provides constraints that can be used in extracting the meaning
of a sentence.  At this point in the course, syntax appears to tell us
only that specific groups of words constitute various components of the
sentence.  It may not seem like this tells us much about meaning.
Later on we'll see that there are techniques for using this
information to determine what action or event is described by the sentence,
who or what caused the action, who or what was affected by the action,
what are the attributes of the various actors, actions, and objects,
and so on.  It is not the case, however, that there is a one-to-one
mapping between the syntactic components and the semantic components.
For example, in the sentence

  I saw the man on the hill with a telescope.

it is easy to pick out the two prepositional phrases, but it is unclear
as to whether the man on the hill had a telescope, or a telescope was
used to see him.  Thus, while structure constrains meaning, it alone 
does not determine meaning.  This is just one example of the quality
that makes natural language understanding by computer a difficult problem--
that quality called AMBIGUITY.


Parsing...

...(or syntactic analysis) is the process of decomposing a sentence
into its components, and verifying that the syntactic structure is correct.

Parsing needs two things:

A grammar or other formal specification of allowable structures
in the language (i.e., the structural rules of the language)

A parsing technique or procedural method for analyzing the
sentence (i.e., a means of applying or using the rules mentioned above)


What's a grammar?

A grammar contains the knowledge about "legal" syntactic structures, 
represented as rewrite rules.  The grammar defines the language.  
Here's a very simple example:

     S <- NP VP
    VP <- VERB NP
    NP <- ART NOUN
    NP <- NAME
    NP <- POSS NOUN
  NAME <- Jack
  NOUN <- frog | dog
   ART <- a
  VERB <- ate
  POSS <- my


What's a parse tree?

A parse tree is the classic representation of a sentence's syntactic 
structure, and usually the desired output of a parser:

                                 S
                                / \
                               /   \
                              /     \
                             /       \
                            /         \
                           NP         VP
                          /  \       /  \
                         /    \     /    \
                       POSS  NOUN VERB   NP
                        |      |   |    /  \
                        |      |   |   /    \
                        |      |   |  ART  NOUN
                        |      |   |   |     |
                       my     dog ate  a   frog


The tree notation is difficult to compute with directly, so we like
to convert the representation above into something more useful:

    (S (NP (POSS my)
           (NOUN dog))
       (VP (VERB ate)
           (NP (ART a)
               (NOUN frog))))


From grammars to transition nets

The grammar given above is an example of a context-free grammar (CFG).
Every left-hand-side has only one non-terminal symbol and no terminals.
You can rewrite a left-hand-side with its corresponding right-hand-side
without considering its context (i.e., the symbols surrounding it).
That makes our programming experience computationally simpler than
if we had to worry about context-sensitive grammars.

English is most likely not a context-free language, but we can define
a large subset of English by a CFG and that's good enough for our
purposes in this class.  

As grammars get more rules, they become difficult to understand and
computationally more demanding.  We can make our jobs easier (at least
in some ways) by converting the grammar to a more convenient representation
known as a finite state machine (FSM) or transition network (TN).

One of the primary advantages to TNs is that while a grammar is a 
static set of rules, a TN describes a process, and a process for
parsing is what we're trying to get at here.

Let's look at that grammar again:

     S <- NP VP
    VP <- VERB NP
    NP <- ART NOUN
    NP <- NAME
    NP <- POSS NOUN
  NAME <- Jack
  NOUN <- frog | dog
   ART <- a
  VERB <- ate
  POSS <- my

That first rule can be viewed as a description of a process:  "You can
recognize that you have a valid sentence if you see that you have a
noun phrase followed by a verb phrase."  In computer science world, we
can translate that into a collection of states and state transitions:
"Start at the state where you've recognized nothing.  Then test the
input.  If the input is a noun phrase, then go to the state that says
you've seen a noun phrase.  Test the rest of the input.  If that
input is a verb phrase, then go to the state that says you've seen a
verb phrase.  If there's no more input left, you've just recognized
a sentence."

Of course, in computer science world, we'd sooner cut off body parts than
write all that text, so we have this nice symbolic formalism for 
representing this sort of thing.  It's called a finite state machine,
or more informally, a transition network (or just transition net, or just
TN).  If you've been in the compiler course or the automata theory course,
you've seen this stuff before.  It's just states, and the tests that get
you from one state to another.  Here's what that first grammar rule looks
like as a TN:

                                  pop
                                 /
       NP            VP         /
  S0 ---------> S1 ---------> S2

(This notation is a little bit different than we used in class, because
it's just too darn hard to draw circles in ASCII.  For homework and exams,
use the notation in the book or what we use in class.  Don't use this
notation.)

The remainder of the rules would look like this as TNs:


                                    pop
                                   /
        VERB            NP        /
  VP0 ---------> VP1 ---------> VP2
  
                                    pop
                                   /
        ART            NOUN       /
  NP0 ---------> NP1 ---------> NP2
  
                     pop
                    /
        NAME       /
  NP3 ---------> NP4  

                                    pop
                                   /
        POSS           NOUN       /
  NP5 ---------> NP6 ---------> NP7

We can consolidate those noun phrase TNs into a single TN that might
be a little bit easier to deal with:

            ART           NOUN   
         ---------> NP1 ---------
        /           ^            \
       /           /              \     pop
      /           /                \   /
     /   POSS    /                 _\|/
  NP0 -----------                   NP2
     \                              ^
      \                            /
       \                          /
        \          NAME          /
         ------------------------

(OK, so it's ugly here, but on paper or on the whiteboard, it looks a
lot better.)

So now our TN looks like this:

                                  pop
                                 /
       NP            VP         /
  S0 ---------> S1 ---------> S2

                                    pop
                                   /
        VERB            NP        /
  VP0 ---------> VP1 ---------> VP2

            ART           NOUN   
         ---------> NP1 ---------
        /           ^            \
       /           /              \     pop
      /           /                \   /
     /   POSS    /                 _\|/
  NP0 -----------                   NP2
     \                              ^
      \                            /
       \                          /
        \          NAME          /
         ------------------------


This gives us a nice modular set of transition nets.  We start with the
S transition net at state S0, and test the input to see if we have a noun
phrase.  But to do that test, we need to jump to the NP net at state NP0,
and then test for the various possibilities.  If we find that we recognize
a noun phrase (i.e., we've made it all the way to state NP2), then we jump
back to the S net at state S1 and continue from there.  In order to do this
jumping and returning, we need to store return points on a stack, so you
can just pretend that there's an implicit stack hanging around somewhere
that allows you to do this.  (A finite state machine with a stack is called
a pushdown automaton, or a PDA.)

We could avoid the use of a stack by making one big transition net:


           ART       NOUN                          ART       NOUN
         ------> S1 -------                      ------> S4 -------
        /       ^          \                    /       /          \
       /       /            \                  /       /            \    pop
      /       /              \                /       /              \    /
     /  POSS /               _\|   VERB      /  POSS /               _\| /
   S0 -------                  S2 -------> S3 -------                  S5
     \                        ^              \                         ^
      \                      /                \                       /
       \                    /                  \                     /
        \       NAME       /                    \        NAME       /
         ------------------                      -------------------


But now we've duplicated the NP net, and that's completely undesirable
for the same reason that duplicating chunks of code in real live 
programs is undesirable.  So take advantage of the ability to organize
your nets in cohesive modules and rely on the stack to allow you to jump
from module to module, and you'll be doing fine.  (In other words, the
net immediately above sucks.  The group of three nets for S, NP, and VP
is much better.)

What happened to these rules?:

  NAME <- Jack
  NOUN <- frog | dog
   ART <- a
  VERB <- ate
  POSS <- my


The rules with only terminal symbols in the right-hand-side merely define
the acceptable lexicon in this language.  These nets always look the same,
so it's more or less a waste of time to bother to draw them.  Also, when
the lexicon gets really big, drawing them takes forever, so don't bother.



Copyright (c) 1998 by Kurt Eiselt and Jennifer Holbrook.  All rights 
reserved, except as previously noted.

Last revised: January 15, 1998