Assignment 1, Q1: The course project will be to develop a system to perform semantic role labeling as defined in the CoNLL 2005 shared task (http://www.lsi.upc.edu/~srlconll). Your system will take as input a file that combines the output of a number of other NLP processors (such as a parser). Since you will need to process this data later, we'll start using it now. Also, it's good to see "real" stuff at the start of a class, and then figure out what it all really is as the class goes on. Below is an excerpt from a README file provided for the CoNLL05 shared task. It describes some preprocessing systems and the format of the data. What to do for Q1 is described below the excerpt from the README file. ====From the CoNLL 2005 Shared Task README PREPROCESSING SYSTEMS The input annotations we provide have been computed with the following state-of-the-art systems: */ UPC processors : - Part-of-Speech (PoS) tagger of (Gimenez and Marquez, 2003) - Chunker and clauser of (Carreras and Marquez 2003), both developed within the CoNLL-2000 and CoNLL-2001 Shared Task settings, respectively. The clause boundaries predicted by this partial parser respect the boundaries of the chunks. Hence, this processor outputs a well-formed structure of chunks and clauses. */ Collins parser: The full parser of (Collins 99), with "model 2". Predicts WSJ full parses, with information of the lexical head for each syntactic constituent. The PoS tags (required by the parser) have been computed with (Gimenez and Marquez 2003). */ Charniak parser: The full parser of (Charniak 00). Predicts PoS tags and WSJ full parses. */ Named Entity Extractor: Of (Chieu and Ng 2003), developed within the 2003 Shared Task on Named Entity Extraction for English. NOTE: this processor has not been developed with WSJ training data, and is the only exception allowed for the closed challenge. FORMAT Here is an example of a fully-annotated sentence: WORDS----> NE---> POS PARTIAL_SYNT FULL_SYNT------> VS TARGETS PROPS-------> The * DT (NP* (S* (S(NP* - - (A0* (A0* $ * $ * * (ADJP(QP* - - * * 1.4 * CD * * * - - * * billion * CD * * *)) - - * * robot * NN * * * - - * * spacecraft * NN *) * *) - - *) *) faces * VBZ (VP*) * (VP* 01 face (V*) * a * DT (NP* * (NP* - - (A1* * six-year * JJ * * * - - * * journey * NN *) * * - - * * to * TO (VP* (S* (S(VP* - - * * explore * VB *) * (VP* 01 explore * (V*) Jupiter (ORG*) NNP (NP*) * (NP(NP*) - - * (A1* and * CC * * * - - * * its * PRP$ (NP* * (NP* - - * * 16 * CD * * * - - * * known * JJ * * * - - * * moons * NNS *) *) *))))))) - - *) *) . * . * *) *) - - * * There is one line for each token, and a blank line after the last token. The columns, separated by spaces, represent different annotations of the sentence with a tagging along words. For structured annotations (named entities, chunks, clauses, parse trees, arguments), we use the Start-End format. The Start-End format represents phrases (chunks, arguments, and syntactic constituents) that constitute a well-formed bracketing in a sentence (that is, phrases do not overlap, though they admit embedding). Each tag is of the form STARTS*ENDS, and represents phrases that start and end at the corresponding word. A phrase of type $k places a "($k" parenthesis at the STARTS part of the first word, and a ")" parenthesis at the END part of the last word. Scripts will be provided to transform a column in Start-End format into other standard formats (IOB1, IOB2, WSJ trees). The Start-End format used last year (that considered the phrase type in the start and end parts) is compatible with the current software and scripts. The different annotations in a sentence are grouped in the following blocks: - WORDS : The words of the sentence. - NE : Named entities. - POS : PoS tags. - PARTIAL SYNT : Partial syntax, namely chunks (1st column) and clauses (2nd column) - FULL SYNT : Full syntactic tree. Note that this column represents the following WSJ tree: (S (NP (DT The) (ADJP (QP ($ $) (CD 1.4) (CD billion) )) (NN robot) (NN spacecraft) ) (VP (VBZ faces) (NP (DT a) (JJ six-year) (NN journey) (S (VP (TO to) (VP (VB explore) (NP (NP (NNP Jupiter) ) (CC and) (NP (PRP$ its) (CD 16) (JJ known) (NNS moons) ))))))) (. .) ) - VS : VerbNet sense of target verbs. These are hand-crafted annotations that will be available only in training and development sets (not for the test set). - TARGETS : The target verbs of the sentence, in infinitive form. - PROPS : For each target verb, a column reprenting the arguments of the target verb. ====End of README excerpt Note: the actual data is formatted as shown in the sample below. Conceptually, it is the same. The differences are that there is an extra POS column, and the NE column is in a different place (the utility they provide produces this output). *FOR THIS ASSIGNMENT* we'll use the actual data format shown below. You can test your program on this data: www.cs.pitt.edu/~wiebe/courses/CS1671/Sp2012/Assign1Data/conll05.sampledata ****What to do for Q1:**** Write a program that takes as input a file in the actual data format shown below (though without the first line), stores the information in an internal data structure that will be useful for working with the information later on, and then outputs the same information as on the input file, by accessing your internal data structure (obviously, not by simply printing the each input line to output). The input file should be a command line argument. The output should be printed to standard output. For example, java smithQ1HW1 conll05.sampledata And the program prints the results to the screen. WORDS POS CHUNKS CLAUSES POS SYNTAX NE SENSES TARGET-VERBS An argument column for each target verb The DT (NP* (S* DT (S1(S(NP* * - - (A1* (A1* trade NN * * NN * * - - * * gap NN *) * NN *) * - - *) *) is VBZ (VP* * AUX (VP* * - - * * expected VBN * * VBN (VP* * 01 expect (V*) * to TO * * TO (S(VP* * - - (C-A1* * widen VB *) * VB (VP* * 01 widen * (V*) to TO (PP*) * TO (PP* * - - * (A4* about RB (NP* * RB (NP(QP* * - - * * $ $ * * $ * * - - * * 9 CD * * CD * * - - * * billion CD *) * CD *))) * - - * *) from IN (PP*) * IN (PP* * - - * (A3* July NNP (NP*) * NNP (NP(NP* * - - * * 's POS (NP* * POS *) * - - * * $ $ * * $ (QP* * - - * * 7.6 CD * * CD * * - - * * billion CD *) * CD *)))))) * - - *) *) , , * * , * * - - * * according VBG (PP*) * VBG (PP* * - - (AM-ADV* * to TO (PP*) * TO (PP* * - - * * a DT (NP* * DT (NP(NP* * - - * * survey NN *) * NN *) * - - * * by IN (PP*) * IN (PP* * - - * * MMS NNP (NP* * NNP (NP(NP* (ORG* - - * * International NNP *) * NNP *) *) - - * * , , * * , * * - - * * a DT (NP* * DT (NP(NP* * - - * * unit NN *) * NN *) * - - * * of IN (PP*) * IN (PP* * - - * * McGraw-Hill NNP (NP* * NNP (NP(NP* (ORG* - - * * Inc. NNP *) * NNP *) *) - - * * , , * * , * * - - * * New NNP (NP* * NNP (NP* (LOC* - - * * York NNP *) * NNP *))))))))))) *) - - *) * . . * *S) . *)) * - - * *