Pymol and very large PDB files. The Zika Cryo-EM structure as a case study

One of my major research interests is the Flavivirus group of viruses. In our work at the University of Queensland we’ve been involved in developing inhibitors of viral proteins associate with Dengue and West Nile viruses. Particularly inhibitors of the NS2B/NS3 protease and the surface E-protein [1,2,3]. A key way to target these proteins is to examine their X-ray crystal structures. A newer technique of determining the structure of proteins is cryo electron microscopy (Cryo-EM).

I found myself on Thursday looking at the Cryo-EM structure of the entire Zika virus, published recently by the Kuhn group [4]. Zika is in that same class of Flaviviruses and is therefore of great interest to me not only because of the recent flurry of media commentary and public health concern surrounding current outbreaks. The Kuhn group had previously published Cryo-EM structures of the Dengue and West Nile viruses as well.

I wanted to take a closer look at the structure so I used my favourite visualisation program Pymol to take a look. After a bit of fiddling, which I’ll go into below, I got to this lovely picture of all the surface proteins of a Zika virus particle, with all the chains rendered as cartoon helices and sheets etc.



Zika virus surface proteins via Cryo-EM


I was quite please with the result and posted it to Twitter. I next made a Quicktime movie direct from Pymol of the whole picture spinning 360° over 4 seconds, and posted it to YouTube. The result is a bit pixelated because I just used the default YouTube compression settings which reduced the 28MB Quicktime file to an 808KB Youtube video.

A few kind people commented on the structure which has a lovely symmetry to it but my interest was piqued by one Twitter user Jonas Boström (@DrBostrom) who was interested in whether I’d be happy to share the viral assembly as a single PDB file so he could look into making a VR version. Sure I said, and went back to Pymol and first myself tried to save it as a PDB and VRML2 file, which is one of Pymol’s export features. Some time later I had a 54MB PDB file and a, wait for it, 2.93GB .wrl file. Not really the size of files you want to pop into a tweet! Even when I gzipped the PDB file it was 12MB. But there was a problem. Before I go into the problems and the techniques required in Pymol to get the assembly just so, I need to fill you in on a little of the background to these surface proteins.

The surface proteins of the viral particles are a mixture of the E and M proteins arranged in a regular pattern; 360 proteins are arranged in an icosahedral shell. If you go to the PDB page for this structure and download the PDB file 5ire.pdb what you are getting is the asymmetric unit which consists of just 6 chains A-F. Three repeats of the E-protein (A-C), and three of the M protein (D-F). There are 60 of these subunits in the Biological assembly giving the total of 360 total viral proteins at the surface 180 E, 180 M.

So lets just take a look at this subunit for a bit. This smaller more manageable chunk of the overall virus surface is the best one to use if all you are interested in is the way these proteins interact with each other, or if you’re interested in the important Asn154 glycosylation site (boxed in the figure, one glycan for each of the E-protein chains).


The 5ire monomer contains 3x E- and 3x M-proteins


But if you want to visualise the whole viral surface proteins you want to download the “biological assembly” file, which when unzipped runs to 54MB.


RCSB download dialog gives both monomer and biological assembly options



5ire assembly as it opens in Pymol initially as 60 states


If you decompress and then open this file in Pymol, you will see at first just the one subunit, as above, but note that it has been loaded as 60 states (boxed in the figure). You can cycle through these with the play button but we want to visualise them all at once. To do this we use the split_states command at the Pymol command line. This splits the multi-state file into 60 new objects. In this case I have given each new object the Zika prefix. You can now delete the original multistate object for neatness. You will probably need to click the zoom button to get them all into the window (or type zoom at the command line).

split_state 5ire_assembly, prefix = zika

dele 5ire_assembly




after split_states you can visualise the whole assembly at once


Now we can see all the subunits all at once and things start to get a bit more tricky. If you have a reasonably modern Mac or PC you should be fine unless you try and make some fancy surfaces, in which case you might find your machine chugging a bit. But for now the first thing you should do is save the session as whatever file name you want in case the next few steps cause Pymol to crash. Pymol adds a fair bit of overhead to these session files so you’ll end up with something about 210MB in size.

You might want to experiment with turning the cartoon representation on too at this point. This is also the point at which I made that short spinning video I mentioned earlier. The next thing you will probably want to do is save the resulting assembly as a single PDB file so you don’t have to repeat this process, or perhaps you want to offload the file to a different modelling package, like Jonas Boström suggested via Twitter.

Fortunately Pymol lets you export/save multiple objects into a single PDB file. The simplest way to do this is via the “select” function. At the Pymol command line type:

select *

Everything should now be highlighted. Now choose File-> Save Molecule and in the dialog box scroll down and chose “sele” as the object to be saved, then give it a file name, and wait…

In our case the resulting file is about 54 MB but there is a big problem. To demonstrate, open the PDB file in a text editor and scroll down. I used Smultron and you have to wait a fair bit as it’s going to get very laggy. Anyway if you scroll down far enough you will discover a problem I had not encountered before. Pymol doesn’t seem to be able to export a PDB file more than 99999 atoms long properly. This file contains over 660000 atoms, and every one past 99999 is numbered 99999.


Pymol gets stuck at numbering atoms after No 99999



Even the heteroatoms (Glycans in this case) have atomID 99999



And the CONECT records are horrid


This as you would expect is going to cause problems, not least when we get to the CONECT records part of the PDB file. Behold the ugliness. However, all the xyz coordinates are legit. Let’s see what happens when we load this big PDB file back into a new Pymol session. Whoops, see all those extra long bonds? That’s trouble.


Some long wonky bonds there courtesy of those CONECT records


Fortunately CONECT records are not completely necessary in a PDB file if you just want to do simple visualisations. So in a text editor it is a simple matter to remove them all. The resulting edited PDB file can then be read back into Pymol without all those nasty extra lines, but…


Wonky CONECT records removed

…there’s a new problem. This export-import routine introduces some weird forgetfulness about what secondary structural elements the protein contains. So Pymol does nothing when you try and show a cartoon representation of the proteins. The normal thing to do in these circumstances is to run the from the Pymol command line. I did this and got the famous Apple spinning beachball of death. I stuck it out however (put the cursor in the “do not sleep” corner and went and made coffee) and eventually Pymol came back to life. But it was a long wait (20 minutes). This is a pretty good place to point out that this was all done in Pymol 1.6 on a 2011 iMac (2.7GHz, Intel Core i5) running 10.10.5 with 8GB RAM. But the process eventually ended up a failure. Despite showing all the right messages in the log box, no secondary structural features could be obtained. initiating secondary structure assignment on 103680 residues. extracting sequence and relationships… analyzing phi/psi angles (slow)… finding hydrogen bonds… verifying beta sheets… assignment complete.

Save: Please wait — writing session file…

So the process is not yet complete. But the good news is that I have a PDB file that doesn’t make wonky bonds. The bad news is that I still have >500000 atoms with atomID 99999. Clearly this job is a bit beyond Pymol’s current abilities. I shall keep you posted. Once I have some more functional files I may put them in a public Dropbox if anyone wants them, as they’re still a bit too large to email.


[1] “Potent Cationic Inhibitors of West Nile Virus NS2B/NS3 Protease With Serum Stability, Cell Permeability and Antiviral Activity.” Martin J. Stoermer, Keith J. Chappell, Susann Liebscher, Christina M. Jensen, Chun H. Gan, Praveer K. Gupta, Wei-Jun Xu, Paul R. Young, and David P. Fairlie, J. Med. Chem. 2008, 51(18), 5714-5721. Full text via ACS publications

[2] “Structure of West Nile Virus NS3 Protease: Ligand Stabilization of Catalytic Conformation.” Gautier Robin, Keith Chappell, Martin J. Stoermer, Shu-Hong Hu, Paul R. Young, David P. Fairlie, Jennifer L Martin J. Mol. Biol. 2009, 385(5), 1568-1577. Full text via ScienceDirect.

[3] “In silico screening of small molecule libraries using the dengue virus envelope E protein has identified compounds with antiviral activity against multiple flaviviruses” Thorsten Kampmann, Ragothaman Yennamalli, Phillipa Campbell, Martin J. Stoermer, David P. Fairlie, Bostjan Kobe, Paul R. Young Antiviral Research, 2009, 84(3), 234-41. Full text via ScienceDirect.

[4] The 3.8 Å resolution cryo-EM structure of Zika virus.
Sirohi D, Chen Z, Sun L, Klose T, Pierson TC, Rossmann MG, Kuhn RJ.
Science, 2016, 352(6284), 467-70. Pubmed Link.


About martin

almost on holidays
This entry was posted in Chem, Chem_Comp, mac, Pymol, Uncategorized and tagged , , . Bookmark the permalink.

2 Responses to Pymol and very large PDB files. The Zika Cryo-EM structure as a case study

  1. Paul Emsley says:

    For the record, it’s not PyMOL that that the problem with more than 99999 atoms, it’s the PDB format itself. The time is well past for the PDB format. Say hello to pdbx (sometimes known as mmCIF).

    • martin says:

      You’re quite right. It is inherent in the file format. Some PDB file viewers just ignore the atomID filed in such cases (VMD does I think), and others just display the coordinates and report errors (Chimera)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s