Assignment 1, Q1:

The course project will be to develop a system to perform semantic
role labeling as defined in the CoNLL 2005 shared task
(http://www.lsi.upc.edu/~srlconll).  Your system will take as input a
file that combines the output of a number of other NLP processors
(such as a parser).  Since you will need to process this data later,
we'll start using it now.

Also, it's good to see "real" stuff at the start of a class, and then
figure out what it all really is as the class goes on.
 
Below is an excerpt from a README file provided for the CoNLL05 shared
task.  It describes some preprocessing systems and the format of the
data.

What to do for Q1 is described below the excerpt from the README file.

====From the CoNLL 2005 Shared Task README

PREPROCESSING SYSTEMS

The input annotations we provide have been computed with the following
state-of-the-art systems:

*/ UPC processors : 
   
   - Part-of-Speech (PoS) tagger of (Gimenez and Marquez, 2003) 
   - Chunker and clauser of (Carreras and Marquez 2003), both
     developed within the CoNLL-2000 and CoNLL-2001 Shared Task
     settings, respectively. The clause boundaries predicted by this
     partial parser respect the boundaries of the chunks. Hence, this
     processor outputs a well-formed structure of chunks and clauses. 

*/ Collins parser: 

   The full parser of (Collins 99), with "model 2". Predicts WSJ full
   parses, with information of the lexical head for each syntactic
   constituent. The PoS tags (required by the parser) have been
   computed with (Gimenez and Marquez 2003).
   
*/ Charniak parser: 

   The full parser of (Charniak 00). Predicts PoS tags and WSJ full
   parses.

*/ Named Entity Extractor:  

   Of (Chieu and Ng 2003), developed within the 2003 Shared Task on
   Named Entity Extraction for English. 

   NOTE: this processor has not been developed with WSJ training data,
   and is the only exception allowed for the closed challenge.


FORMAT

Here is an example of a fully-annotated sentence: 


   WORDS---->  NE--->  POS   PARTIAL_SYNT   FULL_SYNT------>   VS   TARGETS  PROPS------->
								          
   The             *   DT    (NP*   (S*        (S(NP*          -    -        (A0*    (A0*       
   $               *   $        *     *     (ADJP(QP*          -    -           *       *       
   1.4             *   CD       *     *             *          -    -           *       *       
   billion         *   CD       *     *             *))        -    -           *       *       
   robot           *   NN       *     *             *          -    -           *       *       
   spacecraft      *   NN       *)    *             *)         -    -           *)      *)    
   faces           *   VBZ   (VP*)    *          (VP*          01   face      (V*)      *       
   a               *   DT    (NP*     *          (NP*          -    -        (A1*       *       
   six-year        *   JJ       *     *             *          -    -           *       *       
   journey         *   NN       *)    *             *          -    -           *       *       
   to              *   TO    (VP*   (S*        (S(VP*          -    -           *       *       
   explore         *   VB       *)    *          (VP*          01   explore     *     (V*)     
   Jupiter     (ORG*)  NNP   (NP*)    *       (NP(NP*)         -    -           *    (A1*       
   and             *   CC       *     *             *          -    -           *       *       
   its             *   PRP$  (NP*     *          (NP*          -    -           *       *       
   16              *   CD       *     *             *          -    -           *       *       
   known           *   JJ       *     *             *          -    -           *       *       
   moons           *   NNS      *)    *)            *)))))))   -    -           *)      *)    
   .               *   .        *     *)            *)         -    -           *       *    


There is one line for each token, and a blank line after the last
token. The columns, separated by spaces, represent different
annotations of the sentence with a tagging along words. For structured
annotations (named entities, chunks, clauses, parse trees, arguments),
we use the Start-End format.

The Start-End format represents phrases (chunks, arguments, and
syntactic constituents) that constitute a well-formed bracketing in a
sentence (that is, phrases do not overlap, though they admit
embedding). Each tag is of the form STARTS*ENDS, and represents
phrases that start and end at the corresponding word. A phrase of type
$k places a "($k" parenthesis at the STARTS part of the first word,
and a ")" parenthesis at the END part of the last word.  Scripts will
be provided to transform a column in Start-End format into other
standard formats (IOB1, IOB2, WSJ trees). The Start-End format used
last year (that considered the phrase type in the start and end parts)
is compatible with the current software and scripts.

The different annotations in a sentence are grouped in the following
blocks:

- WORDS        : The words of the sentence.
- NE           : Named entities.
- POS          : PoS tags.
- PARTIAL SYNT : Partial syntax, namely chunks (1st column) and
                 clauses (2nd column)
- FULL SYNT    : Full syntactic tree. Note that this column
                 represents the following WSJ tree:

    (S 
       (NP (DT The) 
         (ADJP 
           (QP ($ $) (CD 1.4) (CD billion) ))
         (NN robot) (NN spacecraft) )
       (VP (VBZ faces) 
         (NP (DT a) (JJ six-year) (NN journey) 
           (S 
             (VP (TO to) 
               (VP (VB explore) 
                 (NP 
                   (NP (NNP Jupiter) )
                   (CC and) 
                   (NP (PRP$ its) (CD 16) (JJ known) (NNS moons) )))))))
       (. .) )


- VS           : VerbNet sense of target verbs. These are hand-crafted
                 annotations that will be available only in training
		 and development sets (not for the test set).
- TARGETS      : The target verbs of the sentence, in infinitive form. 
- PROPS        : For each target verb, a column reprenting the arguments  
                 of the target verb.

====End of README excerpt

Note: the actual data is formatted as shown in the sample below.
Conceptually, it is the same.  The differences are that there is an
extra POS column, and the NE column is in a different place (the
utility they provide produces this output).

*FOR THIS ASSIGNMENT* we'll use the actual data format shown below.
You can test your program on this data:

  www.cs.pitt.edu/~wiebe/courses/CS1671/Sp2012/Assign1Data/conll05.sampledata

****What to do for Q1:****

Write a program that takes as input a file in the actual data format
shown below (though without the first line), stores the information in
an internal data structure that will be useful for working with the
information later on, and then outputs the same information as on the
input file, by accessing your internal data structure (obviously, not
by simply printing the each input line to output).  The input file
should be a command line argument.  The output should be printed to
standard output.  For example,

  java smithQ1HW1 conll05.sampledata 

And the program prints the results to the screen.

WORDS   POS  CHUNKS  CLAUSES POS    SYNTAX   NE SENSES TARGET-VERBS An argument column for each target verb

The     DT    (NP*     (S*    DT    (S1(S(NP*                  *    -   -                    (A1*            (A1* 
trade   NN       *       *    NN            *                  *    -   -                       *               * 
gap     NN       *)      *    NN            *)                 *    -   -                       *)              *) 
is      VBZ   (VP*       *    AUX        (VP*                  *    -   -                       *               * 
expected VBN      *       *    VBN        (VP*                  *    01  expect                (V*)              * 
to       TO       *       *    TO       (S(VP*                  *    -   -                  (C-A1*               * 
widen    VB       *)      *    VB         (VP*                  *    01  widen                   *             (V*) 
to       TO    (PP*)      *    TO         (PP*                  *    -   -                       *            (A4* 
about    RB    (NP*       *    RB      (NP(QP*                  *    -   -                       *               * 
$ $        *       *    $             *                  *    -   -                       *               * 
9        CD       *       *    CD            *                  *    -   -                       *               * 
billion CD       *)      *    CD            *)))               *    -   -                       *               *) 
from IN    (PP*)      *    IN         (PP*                  *    -   -                       *            (A3* 
July NNP   (NP*)      *    NNP     (NP(NP*                  *    -   -                       *               * 
's POS   (NP*       *    POS           *)                 *    -   -                       *               * 
$ $        *       *    $          (QP*                  *    -   -                       *               * 
7.6 CD       *       *    CD            *                  *    -   -                       *               * 
billion CD       *)      *    CD            *))))))            *    -   -                       *)              *) 
, ,        *       *    ,             *                  *    -   -                       *               * 
according VBG   (PP*)      *    VBG        (PP*                  *    -   -                (AM-ADV*               * 
to TO    (PP*)      *    TO         (PP*                  *    -   -                       *               * 
a DT    (NP*       *    DT      (NP(NP*                  *    -   -                       *               * 
survey NN       *)      *    NN            *)                 *    -   -                       *               * 
by IN    (PP*)      *    IN         (PP*                  *    -   -                       *               * 
MMS NNP   (NP*       *    NNP     (NP(NP*              (ORG*    -   -                       *               * 
International NNP      *)      *    NNP           *)                 *)   -   -                       *               * 
, ,        *       *    ,             *                  *    -   -                       *               * 
a DT    (NP*       *    DT      (NP(NP*                  *    -   -                       *               * 
unit NN       *)      *    NN            *)                 *    -   -                       *               * 
of IN    (PP*)      *    IN         (PP*                  *    -   -                       *               * 
McGraw-Hill NNP   (NP*       *    NNP     (NP(NP*              (ORG*    -   -                       *               * 
Inc. NNP      *)      *    NNP           *)                 *)   -   -                       *               * 
, ,        *       *    ,             *                  *    -   -                       *               * 
New NNP   (NP*       *    NNP        (NP*              (LOC*    -   -                       *               * 
York NNP      *)      *    NNP           *)))))))))))       *)   -   -                       *)              * 
. .        *       *S)  .             *))                *    -   -                       *               *