| YUP Tutorial Module 0: Preparations | ||||||||||||||||||||||||||||||||
4. ExercisesThe following programming exercises (6 hours) should be completed if you plan to study Module 3 and beyond.These exercises are to give you some experience in writing Python programs. They also direct you to consider an object-oriented approach to programming. These exercises do not make use of YUP. |
||||||||||||||||||||||||||||||||
| We will develop a browser for Protein
Data Bank (PDB) files. The browser should read in a PDB file and then the user can
query the browser on such things as the number and names of chains,
then for a given chain, the number of and names of residues and so on
to the atoms. The user can also obtain the coordinates of atoms from
a residue of a chain. Download several PDB files from the Protein Data Bank. Make sure one of them is 1BNA which we will use for illustration. For this exercise, we will process only the ATOM and HETATM records and only the following fields (extracted from the PDB format documentation and rearranged):
In the following, if you place your pointer over this symbol:
for a moment, a
hint will appear. You can download an archive containing all the worked
examples from here. You can look at the
answers any time you like but try to
write your own code first. We will
also make references to the Python documentation. Once again, the
entire Python Documentation set is conveniently accessible from one
location
(Mac Help). If you are working on another platform, you can
download the PDF
documentations from
the Python.org site and open the appropriate document.A: Read a PDB fileLet us start by writing a simple function that will open a PDB file and then extract only the ATOM and HETATM lines. The function will return a list containing only those lines. Following is a sketch of the function in pseudo-code:
To open a file, use the Python built-in function file (also known as open) documented in section 2.1 of the Library Reference. This function returns a file object that you should assign to a variable. Start an empty list to hold the records that we want to capture , then scan the PDB file line by
line. You have several options here. You can use the readline or readlines
methods: (Library Reference
2.2.8). Or you can use the iterator protocol , which reads more
naturally. Use whichever method you like.Try out the following exercise first, to make sure you understand lists:
Let us now scan one line at a time. It is quite easy to pick up the lines that we want. Those are the lines where columns one to six are either "ATOM " (two trailing spaces) or "HETATM". You can do an equality test on the text in the first six columns of each line. You can apply a slicing operation (Library Reference 2.2.6) on each line but watch out for proper indexing. Indices start from zero and a slice is indicated by at least two numbers, the start and one over the closing index of the slice: . We are almost
done. Each line that matches the two keys is added to the list that we
started at the top of the loop. It is a good idea to close the file
explicitly. Finally, return the list.Try out the following to understand slicing:
The file Y2T0C3a.py contains the definition of a function readpdb that takes a string argument containing the name of a PDB file. Start the Python interpreter and read the 1BNA file:
The variable PDB should now
reference a list containing only the ATOM
and HETATM lines of the PDB file pdb1bna.ent. Let us look at a few
things about the file:
The first command prints the length of the list. The second prints the list one line at a time. Note the trailing comma. Try leaving out the comma. B: Separate the FieldsWe will now break up the contents of the ATOM and HETATM lines. The various fields are listed in the table at the start of this page. What we want to do now is to collect the information from each line into logical groupings, convert the sub-strings to the proper data type and clean up the rest of the data. The function will then return a list of these data (again, in pseudo-code):
Each entry in the list will be a tuple containing four items. The first item is the chain identifier in column 22. The second item will be a tuple containing two items: the residue sequence number, and the residue name. To get a whole number, you would have to apply the conversion function int The residue name may
contain leading and trailing blanks. Apply the strip method to
get rid of those. We will see later why we put the sequence number
ahead of the name. The third item is a tuple of the atom serial number
(converted
to integer) and the atom name (stripped). The last item is a tuple of
the three
coordinates. Convert them to floating point numbers using float.Try out the following to understand conversion and the strip method:
The file Y2T0C3b.py contains the new definition of readpdb. Let us try it out:
And again,
Note we no longer have to put a trailing comma in the second command. C: Split the ListWe have been storing the data in a flat list so far, pretty much as they had been organized in the PDB file. This is not very convenient. We have to scan the list if we want to find the data for a particular atom. Almost lost in the punch card format of the PDB file is a hierarchy: chains contain residues and residues contain atoms. Let us now put the data into a more structured form, by storing the data in dictionaries. Let us start with the topmost layer of the hierarchy: split the single list we have been building into one list per chain. Each list is a dictionary entry keyed by the chain name (pseudo-code):
To begin, replace the empty list before the start of the loop with an empty dictionary . You can use the has_key method to test if an entry
exists for a chain. If not, as is the case at the first line for the
chain, make a new entry in the dictionary with the chain name as key
and a new and empty list as the value . Extract the remaining data as before and then append
the (now smaller) tuple to the list for the chain .The file Y2T0C3c.py contains the new definition of readpdb. To try it out:
as before, except now the last line yields a much smaller value than before. It is the number of chain lists. 1BNA contains three chains. Let us try something different:
One chain has a blank name, or rather, there are atoms that are not associated with any chain. Let us start working with the dictionary and pick out the entry for the chain named "A".
Thus, chain "A" has 243 atoms. To find the total length:
which should be 566 as before. D: Structure the DataThe next step is to replace the chain lists with dictionaries. In fact, we will go all the way:
This is a dictionary of dictionaries of dictionaries. First, to get you familiar with nested dictionaries, try out the following:
Construct the second to last line gradually (and comprehend each step as you type) to get a sense of the hierarchy. First type "PDB = {}". Then, move the insertion point between the braces and type "CHAIN:{}". Move again inside the empty set of braces and type: "RESIDUE:{}" and so on. The last line should evaluate to True. Notice that we can use a tuple as a key. Experiment with these dictionaries until you are comfortable with them. Yes, trying to work out a hierarchy can make your head reel. The program needs only a few revisions. In the previous program, we had used the has_key method to determine if a dictionary entry exists and if not we create a new list as the entry for the new key. In this program we will create an empty dictionary instead . We then check the residue tuple key . Then, we can assign the coordinates to the proper
dictionary .The file Y2T0C3d.py contains the new definition of readpdb. To try it out:
It is now a lot harder to determine if we have collected all the data. The following program adds together the lengths of all the residue dictionaries. Take a moment to understand the program.
You should get a length of 566. Let us now examine the dictionary.
The list of residue keys for chain "B" is not ordered by the residue sequence number. We can easily correct that.
The list of residues is now ordered by the sequence number. This is why we put the sequence number ahead of the residue name. If we had used the reverse order, we can still sort by sequence number by providing a comparison function to the sort method. Note also we could have used only the residue sequence number as a key since it is unique. What we have done is to use the key to store the residue name. To continue, pick out a specific residue:
The atoms are sorted by their serial number. To get the coordinates of the atom with the serial number 448:
We can also retrieve the coordinates directly from the PDB dictionary:
The answers from the two queries may not be exactly the same (because of rounding error) but should be close. E: Create a BrowserWe have now reached a point where we are able to access chains, residues and atoms in a more natural manner. Still, we have to type small programs to look up such things as the total number of atoms and the names of residues and atoms. There has got to be a better way.One route to a better browser is to rewrite the function we have been developing so far as a class. We can then add the small programs we have been writing to query the PDB file as methods of the class. Let us write a class named Brookhaven . We will adapt the readpdb
function as the
initialization or constructor function for the class. You will have to
indent the function definition one more level to the right. Do not try
to do that line by line. Use your text editor's indent command (select
the whole definition then Text > Shift Right in TextWrangler). The function name
has to be changed to __init__
and the first argument is always taken to
refer to an instantiation of the class . It is usually named self
but any name can be used. The dictionary has to be attached to each
instantiation. We can attach it to self
after the dictionary has been
created and populated or at all stages of the process . Otherwise, nothing else in the original function
needs to be changed.Let us add a few conveniences. When the initialization routine has finished scanning the PDB file, collect the chain names, i.e. the keys in the PDB dictionary, sort them, and link the list to a data attribute, say Chains . Then determine the number of chains and link that
number to a data attribute named, say NumberOfChains .
These are all commands we have typed in the Python interpreter in the
previous section.Now, add a method, say showchains that will print the name of the chains, one per line. Note that a method definition looks just like a function definition with the first argument referring to the instantiation of the class . The initialization
function has already compiled a list of chain names so this is an easy function to write.The file Y2T0C3e.py defines a class Brookhaven that has one method showchains and data attributes NumberOfChains, Chains and DataBase. To try it out:
The dictionary is still accessible and can be manipulated as before:
which is not always convenient. F: Going all the WayWe can now define any number of methods to make the Brookhaven class as useful and easy to use as possible.The file Y2T0C3f.py adds new methods. Given a chain name, getchain returns the dictionary for the chain. The method getresiduekeys returns a sorted list of residue names for a given chain name. The method showresidues prints the residue names, one per line. The other methods are getresidue (similar to getchain), getatomkeys, showatoms and getatom. Finally, totalatoms returns the total number of atoms in the PDB file. As before:
Then:
The last command shows how you can use keyword arguments to specify them in any order you prefer. G: One Last ThingLet us say that a full implementation of the Brookhaven class is installed in a public location. We find it useful but we need a special feature that nobody else needs. Can we add our method to the Brookhaven class?The answer is yes of course. We can always derive a class from the Brookhaven class. We then inherit all the attributes and methods from the parent class. To those, we can add our own attributes and methods. We can even override the methods and attributes of the Brookhaven class. Derive a class from Brookhaven, say DeKalb and add a method getPatoms,
that given a chain name, returns the coordinates of only the phosphorus
atoms.The file Y2T0C3f.py defines a class DeKalb that has one additional method getPatoms. To use it:
And we still have all the methods and attributes of the Brookhaven class:
And of course our own method:
For a nicer listing:
Note that getPatoms calls the getresiduekeys and getatomkeys methods. The latter methods return lists that are sorted by serial or sequence number. This ensure that the list that we get in return is in proper order. DiscussionNotes about this code development:
|
||||||||||||||||||||||||||||||||
|
Modules
0. Preparations
1. Model Construction and Simulations 2. Programming with YUP Part 1 3. Programming with YUP Part 2 4. Force field Assembly 5. Developing Potential Energy Terms |
||||||||||||||||||||||||||||||||
Programming with YUP is not any different. You will be using the same syntax. YUP introduces new data types. |
Resources Yammp Under Python The Harvey Laboratory Python Home Python Tutorial (new browsers) |
|||||||||||||||||||||||||||||||
| Prev | Top | Next | ||||||||||||||||||||||||||||||||