Put the solution to lab 7 in the "hidden Place" - H on the schedule. MUST DO SYS - though not right away. I need to know what the command line is Macs if they don't access the OS directly. File processing. Great examples in chapter 8. The code for the book is on my webpage. You can download it too - just google the book, and look for "Code". Easy to download. http://www.cs.pitt.edu/~wiebe/courses/CS0007/Lectures/code Data (Scientific, Financial, Linguistic, etc.) is often stored in many different ways. Often, data is more complex (as in assignment 3!) A study might keep track of the heights, weights, and ages of the participants in a study. Each RECORD may appear on a line by itself, with peices of data in each record separated by delimiters. Or, a RECORD might be spread out over more than one line. NOTE: PAGE 156: in wing 101, there is a way supposedly to enter file arguments to the program: right cliick editing pane; properties; debug tab; enter the run arguments. but this only works with files. http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat following files are now in Lectures - sp16*py import urllib if __name__ == '__name__': url = "http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat" web_page = urllib.urlopen(url) for line in web_page: line = line.strip() print line web_page.close() Or, we can download the file separately, and then just read the file: import urllib if __name__ == '__name__': f = open("hopedale.dat",'r') for line in f: line = line.strip() print line f.close() Let's make a printing function that works, whether from web or a url: import urllib def process_file(reader): ''' Read and print the contents of reader ''' for line in reader: line = line.strip() print line if __name__ == '__name__': f = open("hopedale.dat",'r') process_file(f) f.close() url = "http://www.robjhyndman.com/TSDL/ecology1/hopedale.dat" web_page = urllib.urlopen(url) process_file(web_page) web_page.close() Suppose we want to create a list of just the data. We know there is one header line, and any number of comments (prefaced by #). WE do not know how many comment lines there are! Skip the first line, throw it away Skip the comment lines, throw them away Process the real data literal translation: def get_data(f): f.readline() line = f.readline() while line.startswith('#'): line = f.readline() data = [] # now, process the data line = f.readline() while line != '': data.append(int(line.strip())) line = f.readline() return data Nope! You would lose the first line of data. In hopedale.dat: [trace on board] def get_data(f): f.readline() line = f.readline() while line.startswith('#'): line = f.readline() data = [] # now, process the data while line != '': data.append(int(line.strip())) line = f.readline() return data trace that. data files with missing values: fileproc/hebron.txt Would get an error, with our above code, because int('-') will cause an error. def get_data(f): f.readline() line = f.readline() while line.startswith('#'): line = f.readline() data = [] # now, process the data while line != '': line = line.strip() if line != '-': data.append(int(line)) return data fileproc/lynx.txt Now, we have periods. def input_data(filename): # fileproc/lynx.txt f = open(filename) f.readline() line = f.readline() while line.startswith('#'): line = f.readline() data = [] while line != "": # Note: this gets rid of the \n at the end of the # line! nums = line.split() int_nums = [] for n in nums: int_nums.append(int(n[:-1])) data.append(int_nums) line = f.readline() f.close() return data print input_data('lynx.txt') ====Lab 8 files - only some people got to this. Multiline Records data in section 8.4 of the text. Assignment 3 involves reading from data stored as multiline records, so this will be good practice. The code for the book is on my webpage: http://www.cs.pitt.edu/~wiebe/courses/CS0007/Lectures/code You can download it too -just google the book, and look for "Code". ==== (See Section 8.4, p. 170) Not every data record will fit onto a single line. Here is a file in simplified Protein Data Bank (PDB) format that describes the arrangements of atoms in ammonia: COMPND AMMONIA ATOM 1 N 0.257 -0.363 0.000 ATOM 2 H 0.257 0.727 0.000 ATOM 3 H 0.771 -0.727 0.890 ATOM 4 H 0.771 -0.727 -0.890 END The first line is the name of the molecule. All subsequent lines down to the one containing END specify the ID, type, and XYZ coordinates of one of the atoms in the molecule. The file may contain two or more molecules, like this: [file multimol.pdb] COMPND AMMONIA ATOM 1 N 0.257 -0.363 0.000 ATOM 2 H 0.257 0.727 0.000 ATOM 3 H 0.771 -0.727 0.890 ATOM 4 H 0.771 -0.727 -0.890 END COMPND METHANOL ATOM 1 C -0.748 -0.015 0.024 ATOM 2 O 0.558 0.420 -0.278 ATOM 3 H -1.293 -0.202 -0.901 ATOM 4 H -1.263 0.754 0.600 ATOM 5 H -0.699 -0.934 0.609 ATOM 6 H 0.716 1.404 0.137 END The basic idea of how to read molecules is this: while there are more molecules in the file: read a molecule from the file append it to the list of molecules read so far Let's refine this further: reading = True while reading: try to read a molecule from the file if there is one: append it to the list of molecules read so far else: reading = False Assume that the following function has been defined: def read_molecule(r): '''Read a single molecule from reader r and return it, or return None to signal end of file.''' TODO! Write the following function, which will call read_molecule: When you are done, check your answer - fileproc/multimol.py def read_all_molecules(r): '''Read zero or more molecules from reader r, returning a list of the molecules read.''' result = [] reading = True while reading: molecule = read_molecule(r) if molecule: result.append(molecule) else: reading = False return result [trace it] TODO! Now, write the read_molecule(r) function: When you are done, check your answer - fileproc/multimol_2.py [Trace this on the board, perhaps] def read_molecule(r): '''Read a single molecule from reader r and return it, or return None to signal end of file.''' # If there isn't another line, we're at the end of the file. line = r.readline() if not line: return None # Name of the molecule: "COMPND name" key, name = line.split() # Other lines are either "END" or "ATOM num type x y z" molecule = [name] reading = True while reading: line = r.readline() if line.startswith('END'): reading = False else: key, num, type, x, y, z = line.split() molecule.append((type, x, y, z)) return molecule In a main program, open file fileproc/multimol.pdb, call read_all_molecules to read them in, close the file, and print the resulting list. Trace through this program until both you and your partner understand it. 8.5 Looking Ahead: What if there are no END markers? see multimol_no_ends.pdb for an example datafile. Read through, run, and trace lookahead.py, lookahead_2.py, on the [trace this on the board, i think] def read_all_molecules(r): '''Read zero or more molecules from reader r, returning a list of the molecules read.''' result = [] line = r.readline() while line: molecule, line = read_molecule(r, line) result.append(molecule) return result def read_molecule(r, line): '''Read a molecule from reader r. The variable 'line' is the first line of the molecule to be read; the result is the molecule, and the first line after it (or the empty string if the end of file has been reached).''' fields = line.split() molecule = [fields[1]] line = r.readline() while line and not line.startswith('COMPND'): fields = line.split() key, num, type, x, y, z = fields molecule.append((type, x, y, z)) line = r.readline() return molecule, line In read_all_molecules (show the updates to these variables): result: line: On separate paper, for each call to read_molecule: line: molecule: On separate paper, keep track of where you are in the file as lines are read.