Put the solution to lab 7 in the "hidden Place" - H on the schedule.

MUST DO SYS - though not right away.  I need to know what the command
line is Macs if they don't access the OS directly.


File processing.  Great examples in chapter 8.

The code for the book is on my webpage.  You can download it too -
just google the book, and look for "Code".  Easy to download.

http://www.cs.pitt.edu/~wiebe/courses/CS0007/Lectures/code

Data (Scientific, Financial, Linguistic, etc.) is often stored in many
different ways.  

Often, data is more complex (as in assignment 3!)  A study might keep
track of the heights, weights, and ages of the participants in a
study.

Each RECORD may appear on a line by itself, with peices of data in
each record separated by delimiters.  Or, a RECORD might be spread out
over more than one line.

NOTE:  PAGE 156:  in wing 101, there is a way supposedly to enter file
   arguments to the program:  right cliick editing pane; properties;
   debug tab; enter the run arguments.  but this only works with
   files.

http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat

following files are now in Lectures - sp16*py

import urllib

if __name__ == '__name__':
    url = "http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat"

    web_page = urllib.urlopen(url)

    for line in web_page:
        line = line.strip()
        print line
    web_page.close()

Or, we can download the file separately, and then just read the file:

import urllib

if __name__ == '__name__':
    f = open("hopedale.dat",'r')
    for line in f:
        line = line.strip()
        print line
    f.close()

    
Let's make a printing function that works, whether from web or a url:

import urllib

def process_file(reader):
    ''' Read and print the contents of reader '''

    for line in reader:
        line = line.strip()
        print line

if __name__ == '__name__':
    f = open("hopedale.dat",'r')
    process_file(f)
    f.close()

    url = "http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat"
    web_page = urllib.urlopen(url)
    process_file(web_page)
    web_page.close()


Suppose we want to create a list of just the data.

We know there is one header line, and any number of comments
(prefaced by #).

WE do not know how many comment lines there are!

Skip the first line, throw it away
Skip the comment lines, throw them away
Process the real data

literal translation:

def get_data(f):
   f.readline() 
   line = f.readline()
   while line.startswith('#'):
      line = f.readline()

   data = []
   # now, process the data

   line = f.readline()
   while line != '':
      data.append(int(line.strip()))
      line = f.readline()
   return data

Nope!  You would lose the first line of data.
In hopedale.dat:

[trace on board]


def get_data(f):
   f.readline() 
   line = f.readline()
   while line.startswith('#'):
      line = f.readline()

   data = []
   # now, process the data

   while line != '':
      data.append(int(line.strip()))
      line = f.readline()
   return data

trace that.

data files with missing values:  fileproc/hebron.txt

Would get an error, with our above code, because int('-') will cause
an error.

def get_data(f):
   f.readline() 
   line = f.readline()
   while line.startswith('#'):
      line = f.readline()

   data = []
   # now, process the data

   while line != '':
      line = line.strip()
      if line != '-':
          data.append(int(line))
   return data

fileproc/lynx.txt

Now, we have periods.

def input_data(filename):
    # fileproc/lynx.txt

    f = open(filename)
    f.readline()
    line = f.readline()
    while line.startswith('#'):
        line = f.readline()
    data = []
    while line != "":
        # Note:  this gets rid of the \n at the end of the
        # line!
        nums = line.split()
        int_nums = []
        for n in nums:
            int_nums.append(int(n[:-1]))
        data.append(int_nums)
        line = f.readline()
    f.close()

    return data

print input_data('lynx.txt')

====Lab 8 files - only some people got to this.

Multiline Records data in section 8.4 of the text.  

Assignment 3 involves reading from data stored as multiline records,
so this will be good practice.  

The code for the book is on my webpage:  

    http://www.cs.pitt.edu/~wiebe/courses/CS0007/Lectures/code

You can download it too -just google the book, and look for "Code".

==== (See Section 8.4, p. 170)

Not every data record will fit onto a single line. Here is a file in
simplified Protein Data Bank (PDB) format that describes the arrangements of
atoms in ammonia:

COMPND AMMONIA
ATOM 1 N 0.257 -0.363 0.000
ATOM 2 H 0.257 0.727 0.000
ATOM 3 H 0.771 -0.727 0.890
ATOM 4 H 0.771 -0.727 -0.890
END

The first line is the name of the molecule. All subsequent lines down
to the one containing END specify the ID, type, and XYZ coordinates of
one of the atoms in the molecule.  The file may contain two or more
molecules, like this:

[file multimol.pdb]

COMPND AMMONIA
ATOM 1 N 0.257 -0.363 0.000
ATOM 2 H 0.257 0.727 0.000
ATOM 3 H 0.771 -0.727 0.890
ATOM 4 H 0.771 -0.727 -0.890
END
COMPND METHANOL
ATOM 1 C -0.748 -0.015 0.024
ATOM 2 O 0.558 0.420 -0.278
ATOM 3 H -1.293 -0.202 -0.901
ATOM 4 H -1.263 0.754 0.600
ATOM 5 H -0.699 -0.934 0.609
ATOM 6 H 0.716 1.404 0.137
END

The basic idea of how to read molecules is this:

while there are more molecules in the file: 
   read a molecule from the file 
   append it to the list of molecules read so far 

Let's refine this further:

reading = True 
while reading: 
    try to read a molecule from the file 
    if there is one:
       append it to the list of molecules read so far
    else:
       reading = False

Assume that the following function has been defined:

def read_molecule(r):
'''Read a single molecule from reader r and return it,
   or return None to signal end of file.'''

TODO! Write the following function, which will call read_molecule:

When you are done, check your answer - fileproc/multimol.py

def read_all_molecules(r):
    '''Read zero or more molecules from reader r,
    returning a list of the molecules read.'''

    result = []
    reading = True
    while reading:
        molecule = read_molecule(r)
        if molecule:
            result.append(molecule)
        else:
            reading = False
    return result

[trace it]


TODO! Now, write the read_molecule(r) function:

When you are done, check your answer - fileproc/multimol_2.py

[Trace this on the board, perhaps]

def read_molecule(r):
    '''Read a single molecule from reader r and return it,
    or return None to signal end of file.'''

    # If there isn't another line, we're at the end of the file.
    line = r.readline()
    if not line:
        return None

    # Name of the molecule: "COMPND   name"
    key, name = line.split()
    
    # Other lines are either "END" or "ATOM num type x y z"
    molecule = [name]
    reading = True

    while reading:
        line = r.readline()
        if line.startswith('END'):
            reading = False
        else:
            key, num, type, x, y, z = line.split()
            molecule.append((type, x, y, z))

    return molecule


In a main program, open file fileproc/multimol.pdb,
call read_all_molecules to read them in, close the file, 
and print the resulting list.  

Trace through this program until both you and your partner understand
it.

8.5 Looking Ahead:

What if there are no END markers?

see multimol_no_ends.pdb for an example datafile.

Read through, run, and trace lookahead.py, lookahead_2.py, on the

[trace this on the board, i think]

def read_all_molecules(r):
    '''Read zero or more molecules from reader r,
    returning a list of the molecules read.'''

    result = []
    line = r.readline()
    while line:
        molecule, line = read_molecule(r, line)
        result.append(molecule)
    return result

def read_molecule(r, line):
    '''Read a molecule from reader r.  The variable 'line'
    is the first line of the molecule to be read; the result is
    the molecule, and the first line after it (or the empty string
    if the end of file has been reached).'''

    fields = line.split()
    molecule = [fields[1]]

    line = r.readline()    
    while line and not line.startswith('COMPND'):
        fields = line.split()
        key, num, type, x, y, z = fields
        molecule.append((type, x, y, z))
        line = r.readline()

    return molecule, line


In read_all_molecules (show the updates to these variables):

result:  

line:  


On separate paper, for each call to read_molecule:

line:

molecule:

On separate paper, keep track of where you are in the file as lines
are read.