From steffi@pitt.edu Mon Oct 15 13:33:05 2001
Return-Path: <steffi@pitt.edu>
Received: from sancho.lrdc.pitt.edu (sancho.lrdc.pitt.edu [136.142.147.68])
	by gomez.cs.pitt.edu (8.11.5/8.11.5) with ESMTP id f9FHX4E28057
	for <litman@cs.pitt.edu>; Mon, 15 Oct 2001 13:33:05 -0400 (EDT)
	(envelope-from steffi@pitt.edu)
Received: from pitt.edu (localhost [127.0.0.1])
	by sancho.lrdc.pitt.edu (8.9.1/8.9.1) with ESMTP id NAA05216
	for <litman@cs.pitt.edu>; Mon, 15 Oct 2001 13:31:22 -0400 (EDT)
Sender: steffi@sancho.lrdc.pitt.edu
Message-ID: <3BCB1D66.76990230@pitt.edu>
Date: Mon, 15 Oct 2001 13:31:18 -0400
From: Stefanie Bruninghaus <steffi@pitt.edu>
Organization: University of Pittsburgh
X-Mailer: Mozilla 4.76 [en] (X11; U; SunOS 5.6 sun4u)
X-Accept-Language: en
MIME-Version: 1.0
To: litman@cs.pitt.edu
Subject: My fixes 
Content-Type: multipart/mixed;
 boundary="------------B3ACCB822C7CFE1692D482F3"
Status: RO

This is a multi-part message in MIME format.
--------------B3ACCB822C7CFE1692D482F3
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Diane - attached are my two files. The naming is not exactly appropriate
(but I am sure you know how these things evolve ...). 

You probably want to have quick look at the code - just so that you know
what it looks like. I am confident that it works as expected, but I have
found count discrepancies of +/- 1 in few cases for my lowercase version
of the bigrams (the overall behavior is OK!). The code is not what I
would submit to a Perl contest, either. 

There is a switch for lower case/regular case, and I am ordering things
by counts (and not alphabetically, as in the original code). If you
think that this choice is bad, it's easy to locate and change, I have
added comments to the code. 

Here is what to do: 

- type: telnet unixs.cis.pitt.edu and log in 

Start downloading the files, unpacking, compiling - up the point where
things don't work (which is the call of "Stats"). From there, 

- copy the attached two files into the ngrams directory (Stats will be
replaced by my version)
- like in the instructions on the web, change the first line of
count-unigrams.pl (on unixs.cis, this is /usr/pitt/bin/perl)
- type: chmod +x count-unigrams.pl
- type: ./Stats corpora/GEN.EN genesis/out

... and then proceed as explained on your webpage. 

I hope that works! I have tested this on Solaris 2.6 - and it works
fine, Ilya tried it on the unixs machines, where it worked, and on his
PC version, where something went wrong. 

Best - STeffi. 
-- 
----------------------------------------------------------------------
Stefanie Bruninghaus
Learning Research and Development Center     Mail:    steffi+@pitt.edu
University of Pittsburgh                     Web: www.pitt.edu/~steffi
3939 O'Hara Street                           Phone:   (412) 624 - 6748
Pittsburgh, PA 15260-5159 -- USA             Fax:     (412) 624 - 9149
----------------------------------------------------------------------
--------------B3ACCB822C7CFE1692D482F3
Content-Type: application/x-perl;
 name="count-unigrams.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="count-unigrams.pl"

#!/usr/local/bin/perl


# Stefanie Bruninghaus 10/2001 

# usage: see Stats program, follows the convention of the embedding program 
# first parameter: infile, second parameter: outfile prefix string

$infile = $ARGV[0];
$outstring = $ARGV[1];

$unigram_outfile = ">>$outstring.unigrams";
$bigram_outfile = ">>$outstring.bigrams";
$words_outfile = ">>$outstring.words";

open BIGRAMOUTFILE, $bigram_outfile 
  or die "Can't open bigrams $bigram_outfile: $!\n";
open UNIGRAMOUTFILE, $unigram_outfile 
  or die "Can't open unigrams $unigram_outfile: $!\n";
open WORDSOUTFILE, $words_outfile 
  or die "Can't open words $words_outfile: $!\n";
open INFILE, $infile 
  or die "Can't open infile $infile: $!\n";

undef $\; # slurp in file at once - not pretty, but makes life easier


while (<INFILE>) {
    $text .= $_; 
}
close INFILE; 


$text =~ s/[()!<>=:;.,?0123456789]/ /g; # remove all non-letter characters, 
                                        # the next few lines are Emacs-perl-mode
                                        # artifacts
                                        # Notice that \w and \W aren't perfect,
                                        # since we also want to delete numbers, 
                                        # but keep whitespaces 
                       
$text =~ s/\-/ /g;                  
$text =~ s/\'/ /g;                      # remove the ' in special line 
                                        # because otherwise, formatting 
                                        # in Emacs is confused 
$text =~ s/\"//g;                       # dito 
$text =~ s:/: :g;                       # funny construct to remove Perl-
	                                # control character \

$text =~ s/\s/ /g;                      # replace whitespaces 
$text =~ s/ +/ /g;                      # collapse duplicate whitespace 
$text =~ s/^ +//;                       # remove any leading whitespace 
$text =~ s/ +$//;                       # remove any trailing whitespace 

#$text = lc($text);                     # uncomment this line to get everything
                                        # in lower case if wanted 

@words = split / /, $text;              # finally - split text into words! 

foreach $word (@words) {                # double function loop - get words file
                                        # and do unigram count in one go - not
                                        # good programming, but efficient. 
    unless ($word =~ m/^ +$/) {         # good habit - ditch blanks (not 
                                        # really needed, I guess)
	print WORDSOUTFILE "$word\n";
	$count{$word}++;
    }
}

@keys = sort { $count{$b} <=> $count{$a} } (keys %count);     
                                        # sort unigrams by count
# @keys = sort { $count{$a} cmp $count{$b} } (keys %count);   
                                        # sort unigrams alphabetically 

foreach $word (@keys) {
    print UNIGRAMOUTFILE "$count{$word} $word\n";
}


# Less elegant solution start: with first word alone as bigram 
#$prev = "";
#foreach $word (@words) {
#    $bigram = "$prev\t$word"; 
#    $bcount{$bigram}++;
#    $prev = $word; 
#}

# Much more beautiful
for ($i = 0; $i < $#words; $i++) {
    $bigram = "$words[$i]\t$words[$i+1]";   # following the original code, we 
                                            # define a bigram as the two words, 
                                            # glued together with a tab. 
    $bcount{$bigram}++; 
}
 

@bkeys = sort { $bcount{$b} <=> $bcount{$a} } (keys %bcount); 
                                            # sort bigrams by count
# @bkeys = sort { $bcount{$a} cmp $bcount{$b} } (keys %bcount); 
                                            # sort bigrmas alphabetically

foreach $bigram (@bkeys) {
    print BIGRAMOUTFILE "$bcount{$bigram} $bigram\n";
}
    
# good housekeeping ;-) 
close BIGRAMOUTFILE; 
close UNIGRAMOUTFILE; 
close WORDSOUTFILE; 


--------------B3ACCB822C7CFE1692D482F3
Content-Type: text/plain; charset=us-ascii;
 name="Stats"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="Stats"

#!/bin/csh
#
#
#  Usage: Stats input-file output-file-prefix
#
#  Code for identifying word-word collocations
#  based on Ken Church's 1995 'NGRAMS' tutorial
#  and extended to use the likelihood ratio as
#  an alternative (and generally better) measure
#  of association than mutual information.
#
#  Author:  Philip Resnik
#
# changes made by Steffi Bruninghaus, 10/2001 to make this code 
# run under Solaris 2.6 



if ($2 == "" || $3 != "") then
  echo "Usage: $0 input-file output-file-prefix"
else

  set INFILE=$1
  set OUTPREFIX=$2

  echo "Processing file $INFILE"

#  echo "Getting unigram counts: see $OUTPREFIX.unigrams"
#  /bin/rm -f $OUTPREFIX.unigrams
#  ./count_words < $INFILE > $OUTPREFIX.unigrams

#  echo "Getting bigram counts: see $OUTPREFIX.bigrams"
#  /bin/rm -f $OUTPREFIX.bigrams
#  ./count_bigrams $INFILE $OUTPREFIX > $OUTPREFIX.bigrams
    
    echo "Steffi is computing bigram and unigram counts" 
    /bin/rm -f $OUTPREFIX.words
    /bin/rm -f $OUTPREFIX.bigrams
    /bin/rm -f $OUTPREFIX.unigrams
    ./count-unigrams.pl $INFILE $OUTPREFIX
    echo "Returning to the original code"
  
    echo "Computing bigram mutual information: see $OUTPREFIX.mi"
    /bin/rm -f $OUTPREFIX.mi
    ./mutual_info $OUTPREFIX > $OUTPREFIX.mi

  echo "Computing likelihood ratio for bigrams: see $OUTPREFIX.lr"
  /bin/rm -f $OUTPREFIX.lr.values $OUTPREFIX.lr 
  cat $OUTPREFIX.mi | ./lr_filter.pl `cat $OUTPREFIX.words | wc -l` \
     | xargs -n 6 ./lr_simple > $OUTPREFIX.lr.values
  paste $OUTPREFIX.lr.values $OUTPREFIX.mi | sort -nr > $OUTPREFIX.lr



endif

--------------B3ACCB822C7CFE1692D482F3--