YUP Tutorial Module 0: Preparations

4. Exercises

The following programming exercises (6 hours) should be completed if you plan to study Module 3 and beyond.

These exercises are to give you some experience in writing Python programs. They also direct you to consider an object-oriented approach to programming. These exercises do not make use of YUP.

Sections

1. Introduction
2. Python
3. Objects
4. Exercises
We will develop a browser for Protein Data Bank (PDB) files. The browser should read in a PDB file and then the user can query the browser on such things as the number and names of chains, then for a given chain, the number of and names of residues and so on to the atoms. The user can also obtain the coordinates of atoms from a residue of a chain.

Download several PDB files from the Protein Data Bank. Make sure one of them is 1BNA which we will use for illustration.

For this exercise, we will process only the ATOM and HETATM records and only the following fields (extracted from the PDB format documentation and rearranged):

 1 -  6        Record name     "ATOM  "      or "HETATM"
22             Character       chainID       Chain identifier.
18 - 20        Residue name    resName       Residue name.
23 - 26        Integer         resSeq        Residue sequence number.
13 - 16        Atom            name          Atom name.
 7 - 11        Integer         serial        Atom serial number.
31 - 38        Real(8.3)       x             Orthogonal coordinates for X in Angstroms.
39 - 46        Real(8.3)       y             Orthogonal coordinates for Y in Angstroms.
47 - 54        Real(8.3)       z             Orthogonal coordinates for Z in Angstroms.

In the following, if you place your pointer over this symbol: Hint for a moment, a hint will appear. You can download an archive containing all the worked examples from here. You can look at the answers any time you like but try to write your own code first. We will also make references to the Python documentation. Once again, the entire Python Documentation set is conveniently accessible from one location (Mac Help). If you are working on another platform, you can download the PDF documentations from the Python.org site and open the appropriate document.

A: Read a PDB file

Let us start by writing a simple function that will open a PDB file and then extract only the ATOM and HETATM lines. The function will return a list containing only those lines. Following is a sketch of the function in pseudo-code:

given a string containing the name of the PDB file:
open the file and save a reference to the file object
start a new (empty) list to hold the PDB records
scan each line of the PDB file:
if the first six columns of a line is either "ATOM  " or "HETATM":
attach that line to the list of PDB records
close the file
return the list of ATOM and HETATM lines

To open a file, use the Python built-in function file (also known as open) documented in section 2.1 of the Library Reference. This function returns a file object that you should assign to a variable. Start an empty list to hold the records that we want to capture Hint, then scan the PDB file line by line. You have several options here. You can use the readline or readlines methods: (Library Reference 2.2.8). Or you can use the iterator protocol Hint, which reads more naturally. Use whichever method you like.

Try out the following exercise first, to make sure you understand lists:

List = []
List.append( 1 )
List.append( 'a' )
List.append( ( 'X', 5 ) )
List

Let us now scan one line at a time. It is quite easy to pick up the lines that we want. Those are the lines where columns one to six are either "ATOM  " (two trailing spaces) or "HETATM". You can do an equality test on the text in the first six columns of each line. You can apply a slicing operation (Library Reference 2.2.6) on each line but watch out for proper indexing. Indices start from zero and a slice is indicated by at least two numbers, the start and one over the closing index of the slice: Hint. We are almost done. Each line that matches the two keys is added to the list that we started at the top of the loop. It is a good idea to close the file explicitly. Finally, return the list.

Try out the following to understand slicing:

line = "123456789"
i = 4
line[i]
line[0:i]
line[:i]
line[i:]
line[:i] + line[i:]
line[i:] + line[:i]

The file Y2T0C3a.py contains the definition of a function readpdb that takes a string argument containing the name of a PDB file. Start the Python interpreter and read the 1BNA file:

from Y2T0C3a import readpdb
PDB = readpdb( 'pdb1bna.ent' )

The variable PDB should now reference a list containing only the ATOM and HETATM lines of the PDB file pdb1bna.ent. Let us look at a few things about the file:

len( PDB )
for p in PDB: print p,

The first command prints the length of the list. The second prints the list one line at a time. Note the trailing comma. Try leaving out the comma.

B: Separate the Fields

We will now break up the contents of the ATOM and HETATM lines. The various fields are listed in the table at the start of this page. What we want to do now is to collect the information from each line into logical groupings, convert the sub-strings to the proper data type and clean up the rest of the data. The function will then return a list of these data (again, in pseudo-code):

CHAIN = chain_name
RESIDUE = ( residue_sequence_number, residue_name )

ATOM = ( atom_serial_number, atom_name )
COORDS = ( x, y, z )
PDB_list = list of ( CHAIN, RESIDUE, ATOM, COORDS )

Each entry in the list will be a tuple containing four items. The first item is the chain identifier in column 22. The second item will be a tuple containing two items: the residue sequence number, and the residue name. To get a whole number, you would have to apply the conversion function int Hint The residue name may contain leading and trailing blanks. Apply the strip method to get rid of those. We will see later why we put the sequence number ahead of the name. The third item is a tuple of the atom serial number (converted to integer) and the atom name (stripped). The last item is a tuple of the three coordinates. Convert them to floating point numbers using float.

Try out the following to understand conversion and the strip method:

a = '3.14'
b = float( a )
a
b
"  first last  ".strip()

The file Y2T0C3b.py contains the new definition of readpdb. Let us try it out:

from Y2T0C3b import readpdb
PDB = readpdb( 'pdb1bna.ent' )

And again,

len( PDB )
for p in PDB: print p

Note we no longer have to put a trailing comma in the second command.

C: Split the List

We have been storing the data in a flat list so far, pretty much as they had been organized in the PDB file. This is not very convenient. We have to scan the list if we want to find the data for a particular atom. Almost lost in the punch card format of the PDB file is a hierarchy: chains contain residues and residues contain atoms. Let us now put the data into a more structured form, by storing the data in dictionaries. Let us start with the topmost layer of the hierarchy: split the single list we have been building into one list per chain. Each list is a dictionary entry keyed by the chain name (pseudo-code):

PDB_dict[ CHAIN ] = list of ( RESIDUE, ATOM, COORDS )

To begin, replace the empty list before the start of the loop with an empty dictionary Hint. You can use the has_key method to test if an entry exists for a chain. If not, as is the case at the first line for the chain, make a new entry in the dictionary with the chain name as key and a new and empty list as the value Hint. Extract the remaining data as before and then append the (now smaller) tuple to the list for the chain Hint.

The file Y2T0C3c.py contains the new definition of readpdb. To try it out:

from Y2T0C3c import readpdb
PDB = readpdb( 'pdb1bna.ent' )
len( PDB )

as before, except now the last line yields a much smaller value than before. It is the number of chain lists. 1BNA contains three chains. Let us try something different:

PDB.keys()

One chain has a blank name, or rather, there are atoms that are not associated with any chain. Let us start working with the dictionary and pick out the entry for the chain named "A".

chainA = PDB['A']
len( chainA )
for a in chainA: print a

Thus, chain "A" has 243 atoms. To find the total length:

len( chainA ) + len( PDB[' '] ) + len( PDB['B'] )

which should be 566 as before.

D: Structure the Data

The next step is to replace the chain lists with dictionaries. In fact, we will go all the way:

PDB_dict[ CHAIN ][ RESIDUE ][ ATOM ] = COORDS

This is a dictionary of dictionaries of dictionaries. First, to get you familiar with nested dictionaries, try out the following:

CHAIN = "A"
RESIDUE = ( 1, "C" )
ATOM = ( 14, "C2" )
COORDS = ( 18.224, 28.454, 25.015 )
PDB = { CHAIN : { RESIDUE : { ATOM : COORDS } } }
PDB[CHAIN][RESIDUE][ATOM] == COORDS

Construct the second to last line gradually (and comprehend each step as you type) to get a sense of the hierarchy. First type "PDB = {}". Then, move the insertion point between the braces and type "CHAIN:{}". Move again inside the empty set of braces and type: "RESIDUE:{}" and so on. The last line should evaluate to True. Notice that we can use a tuple as a key. Experiment with these dictionaries until you are comfortable with them. Yes, trying to work out a hierarchy can make your head reel.

The program needs only a few revisions. In the previous program, we had used the has_key method to determine if a dictionary entry exists and if not we create a new list as the entry for the new key. In this program we will create an empty dictionary instead Hint. We then check the residue tuple key Hint. Then, we can assign the coordinates to the proper dictionary Hint.

The file Y2T0C3d.py contains the new definition of readpdb. To try it out:

from Y2T0C3d import readpdb
PDB = readpdb( 'pdb1bna.ent' )
PDB.keys()

It is now a lot harder to determine if we have collected all the data. The following program adds together the lengths of all the residue dictionaries. Take a moment to understand the program.

suml = 0
for chn in PDB.keys():
for rsd in PDB[chn].keys():
suml += len( PDB[chn][rsd] )
suml

You should get a length of 566. Let us now examine the dictionary.

chainB = PDB['B']
Bresidues = chainB.keys()
Bresidues

The list of residue keys for chain "B" is not ordered by the residue sequence number. We can easily correct that.

Bresidues.sort()
Bresidues

The list of residues is now ordered by the sequence number. This is why we put the sequence number ahead of the residue name. If we had used the reverse order, we can still sort by sequence number by providing a comparison function to the sort method. Note also we could have used only the residue sequence number as a key since it is unique. What we have done is to use the key to store the residue name.

To continue, pick out a specific residue:

residueG10 = chainB[(10,"G")]
G10atoms = residueG10.keys()
G10atoms.sort()
G10atoms

The atoms are sorted by their serial number.

To get the coordinates of the atom with the serial number 448:

atom448 = residueG10[(448,'C6')]
atom448

We can also retrieve the coordinates directly from the PDB dictionary:

PDB[ "B" ][ (10,"G") ][ (448,"C6") ]

The answers from the two queries may not be exactly the same (because of rounding error) but should be close.

E: Create a Browser

We have now reached a point where we are able to access chains, residues and atoms in a more natural manner. Still, we have to type small programs to look up such things as the total number of atoms and the names of residues and atoms. There has got to be a better way.

One route to a better browser is to rewrite the function we have been developing so far as a class. We can then add the small programs we have been writing to query the PDB file as methods of the class.

Let us write a class named Brookhaven Hint. We will adapt the readpdb function as the initialization or constructor function for the class. You will have to indent the function definition one more level to the right. Do not try to do that line by line. Use your text editor's indent command (select the whole definition then Text > Shift Right in TextWrangler). The function name has to be changed to __init__ and the first argument is always taken to refer to an instantiation of the class Hint. It is usually named self but any name can be used. The dictionary has to be attached to each instantiation. We can attach it to self after the dictionary has been created and populated or at all stages of the process Hint. Otherwise, nothing else in the original function needs to be changed.

Let us add a few conveniences. When the initialization routine has finished scanning the PDB file, collect the chain names, i.e. the keys in the PDB dictionary, sort them, and link the list to a data attribute, say Chains Hint. Then determine the number of chains and link that number to a data attribute named, say NumberOfChains Hint. These are all commands we have typed in the Python interpreter in the previous section.

Now, add a method, say showchains that will print the name of the chains, one per line. Note that a method definition looks just like a function definition with the first argument referring to the instantiation of the class Hint. The initialization function has already compiled a list of chain names Hint so this is an easy function to write.

The file Y2T0C3e.py defines a class Brookhaven that has one method showchains and data attributes NumberOfChains, Chains and DataBase. To try it out:

from Y2T0C3e import Brookhaven
PDB = Brookhaven( 'pdb1bna.ent' )
PDB.NumberOfChains
PDB.showchains()

The dictionary is still accessible and can be manipulated as before:

PDB.DataBase[ "B" ][ (10,"G") ][ (448,"C6") ]

which is not always convenient.

F: Going all the Way

We can now define any number of methods to make the Brookhaven class as useful and easy to use as possible.

The file Y2T0C3f.py adds new methods. Given a chain name, getchain returns the dictionary for the chain. The method getresiduekeys returns a sorted list of residue names for a given chain name. The method showresidues prints the residue names, one per line. The other methods are getresidue (similar to getchain), getatomkeys, showatoms and getatom. Finally, totalatoms returns the total number of atoms in the PDB file. As before:

from Y2T0C3f import Brookhaven
PDB = Brookhaven( 'pdb1bna.ent' )

Then:

PDB.NumberOfChains
PDB.totalatoms()
PDB.showchains()
PDB.showresidues( 'B' )
PDB.showatoms( 'B', (7,'T') )
PDB.getatom( atom = (385,'N1'), residue = (7,'T'), chain = 'B' )

The last command shows how you can use keyword arguments to specify them in any order you prefer.

G: One Last Thing

Let us say that a full implementation of the Brookhaven class is installed in a public location. We find it useful but we need a special feature that nobody else needs. Can we add our method to the Brookhaven class?

The answer is yes of course. We can always derive a class from the Brookhaven class. We then inherit all the attributes and methods from the parent class. To those, we can add our own attributes and methods. We can even override the methods and attributes of the Brookhaven class.

Derive a class from Brookhaven, say DeKalb Hint and add a method getPatoms, that given a chain name, returns the coordinates of only the phosphorus atoms.

The file Y2T0C3f.py defines a class DeKalb that has one additional method getPatoms. To use it:

from Y2T0C3g import DeKalb
PDB = DeKalb( 'pdb1bna.ent' )

And we still have all the methods and attributes of the Brookhaven class:

PDB.NumberOfChains
PDB.totalatoms()
PDB.showchains()

And of course our own method:

PDB.getPatoms( "B" )

For a nicer listing:

for a in PDB.getPatoms( "B" ): print "%g\t%g\t%g" % a

Note that getPatoms calls the getresiduekeys and getatomkeys methods. The latter methods return lists that are sorted by serial or sequence number. This ensure that the list that we get in return is in proper order.

Discussion

Notes about this code development:
  • The development of the class is very close to the formulation of the problem. This is perhaps more obvious if you use an object-oriented approach from the very beginning.
  • We do not have to change existing code to add new functionality to the class object. This is true both when we added new methods in Section F and when we derive a new class in Section G.
What is wrong with this approach?
  • The dictionary is exposed to the user. (This is easily corrected.)
  • The dictionary hierarchy is not natural. A more natural hierarchy is to treat residues and atoms as objects, i.e. classes. We can then store names within the object and use only the serial or sequence numbers to reference the objects.
Other than doing a proper implementation, what more can we do with this code?
  • A graphical user interface is a natural fit with a class implementation.
Modules

0. Preparations
1. Model Construction and Simulations

2. Programming with YUP Part 1
3. Programming with YUP Part 2
4. Force field Assembly
5. Developing Potential Energy Terms

Programming with YUP is not any different. You will be using the same syntax. YUP introduces new data types.
Resources

Yammp Under Python
The Harvey Laboratory

Python Home

Python Tutorial

(new browsers)
Prev | Top | Next