N-grams
quotes at beginning of the chapter - Jelenik and Chomsky
POS, WSD,spelling correction, hand-writing recognition, speech
recognition, augmentative communication.
For all of them, cut down on the possibilities for the
next word:
tag
wsd
what the word IS:
hand-writing rec
speech recogition
augmentative communication
people who can't speak
choose words from menus, or with simple
hand movements
include the most likely words in the menu..
spelling correction
They are leaving in fifteen minuets to go to her house
Spelling mistakes that are real words!
Spell checkers can flag low probability sequences
Hand-writing recognition error example:
I have a gub.
Woody Allen movie, take the money and run:
He robs a bank but misspells the note he gives to the teller and she
makes fun of him.
gun and gull are both words, but gun has a higher probability in the
context of a bank
This chapter: N-gram model of word prediction.
Use the previous N-1 words to predict the next one.
Speech recogntiion:
"language model" for such statistical models of word sequences.
===========================
An aside: P(X,Y) = number of things that are X and Y in the population
----------------------------------------------------
total number of things - N
P(X|Y) = P(X,Y) / P(Y) =
#X,Y
-----
N
-----
#Y
-----
N
= #X,Y N
----- ---- The N's cancel, so we get: #X,Y
N #Y ------
#Y
That's why we used those counts to estimate our parameters from the
training data
============================
Berkely restaurant project: speech-based restaurant consultant.
People ask it for advice about where to eat.
Bigram counts: from sentences spoken by users.
row followed by column entry (I want: 1087 times)
Lots of 0s.
These are the counts in the corpus we referred to before. The corpus
here is made up of spoken sentences.
This is just a few of the words; the table is actually larger.
Going to probabilities
P("I| I") = count(I I)
----------
count(I)
8 / (8 + 1087 + 13+...) =
8/3437 = .0023
remember, we will compute this:
p(I|start)P(want|I)P(to|want)...P(food|chinese)
.0004 (not in table) * .32 * .65 * .26 * .02 * .56
the product gets smaller and smaller as you multiple by
additional terms, since they are all smaller than 1
multiplying lots of small numbers
Can result in underflow (depends on the numbers, the number of terms,
and the variable sizes you are using)
remember that adding in log space is equivalent to multiplying in
linear space
We can add the logs of the probabilities, and add them.
If we are just looking for the most probabile thing, we
never have to take the ani-log.
(x > y in log space <--> x > y in linear space)
We'll use log to the base 2 if we list logs.
==============================
first-order model: unigram
second-order model: bigram
these are sentences randomly generated.
i.e., randomly choose among things where the estimate
is not 0.
e.g., for unigrams, just randomly choose single words that are
possible.
bigrams:
randomly choose a possible "start w" bigram
then randomly choose a possible w w1 bigram
then randomly choose a possible w1 w2 bigram
and so on
trigrams:
randomly choose a possible "start w1 w2" trigram
then randomly choose a possible w1 w2 w3 trigram
then randomly choose a possible w2 w3 w4 trigram
and so on
=========
they trained the models on the collected words of shakespeare
actually, this is much too little data for the complex models
29066 word types in shakespeare
that many squared - 844 million -- possible bigrams
only 884647 words (tokens) in the collected works
so we can't estimate the parameters of the rarer possibilities
of course can't do the quadrigrams -- for many there is only one
continuation! basically just storing pieces of sentences that
appear in shakespeare
It's interesting -- they did the same exercise on some WSJ data.
See the text for examples. The sentences look very very different.
There is little overlap between the n-gram models for shakespeare and
for WSJ.
if you want n-gram models to be good descriptions, you need lots of
data, and you need the data to be mixed in genre.
Or, you need to use different models for different genres.
Very sensitive to the training data you use.
==========
Another example:
man pages for unix:
1626 unique types out of 11,000 tokens
top seven in frequency:
709 the
304 is
275 to
250 of
192 and
188 command
184 csh
But 740 of the 1626 types only occur once!
46%
========
As you add man pages:
very few or no additional new high frequency types
lots of additions to the counts for the existing
high frequency types
more and more single instance types
Sad but true!
====
training portion of the corpus used to develop the system
overly narrow: probs don't generalize
overly general: probs don't reflect the task or the domain
a separate test set used to evaluate the model:
held out test set
cross validation
evaluation diffs should be statistically significance
=========
Add-one smoothing (for bi-grams):
p(wn | wn-1) = count(wn,wn-1) + 1
------------------
count(wn-1) + V
Where V is the size of the vocabulary...i.e., the number
of unique words -- the number of word types.
======Original counts
(remember: column following the row)
We would just add one to each:
i.e., I
9 4 4 1 3 20 5
Probabilities:
I unigram count: 3437
V = 1616
P(I | I) = 8 + 1
------ = .0018
3437 + 1616
Other sample probs:
want following eat: .00039
p(want|eat) want-column eat-row
all the original 0 entries in a row will have the same
probabilities as each other
===
Too much probability mass is moved to zeros -- the
events we haven't seen have too high probability
the count augmentation of 1 was arbitrary
====
Witten Bell
Consider unigrams to start.
The probability of seeing a new n-gram.
The ones you saw are the word types (each was new
the first time it appeared).
So, we'll take the probability for all unigrams which
have not been seen yet as T/(N + V).
This is the probability of seeing a new word.
But we want to keep a valid probability distribution, so that all
sum to 1. So, we need to take some probability mass away from the
ones you have seen.
By adding to the denomenator, you are making the probability
smaller.