Structural Bioinformatics

DEFINING BIOINFORMATICS AND STRUCTURAL BIOINFORMATICS

Russ B. Altman and Jonathan M. Dugan

WHAT IS BIOINFORMATICS?

The precise definition of bioinformatics is a matter of debate. Some define it narrowly as the development of databases to store and manipulate genomic information. Others define it broadly as encompassing all of computational biology. Based on its current use in the scientific literature, bioinformatics can be defined as the study of two information flows in molecular biology (Altman, 1998This paper established the foundational framework for thinking about bioinformatics as information flow analysis rather than just database management.). The first information flow is based on the central dogma of molecular biologyThe central dogma is a fundamental principle in biology that describes how genetic information flows: DNA → RNA → Proteins. Think of it like a compiler pipeline: DNA is the source code, RNA is the intermediate representation, and proteins are the executable code.: DNA sequences are transcribed into mRNA sequencesTranscription is like copying source code - DNA's A,T,G,C bases are converted to RNA's A,U,G,C bases. The 'messenger' RNA (mRNA) carries the instructions from DNA to the protein-making machinery.; mRNA sequences are translated into protein sequencesTranslation is like compilation - the RNA sequence is read in groups of 3 bases (codons) and converted into amino acids that form proteins. Each codon maps to a specific amino acid, similar to how assembly instructions map to machine code.; and protein sequences fold into three-dimensional structures that have functionsThis is like a program being loaded into memory and executing. The linear sequence of amino acids (like assembly code) folds into a complex 3D shape that can perform specific tasks in the cell.. These functions are selected, in a Darwinian sense, by the environment of the organism, which drives the evolution of the DNA sequence within a population. The first class of bioinformatics applications, then, can address the transfer of information at any stage in the central dogma, including the organization and control of genes in the DNA sequence, the identification of transcriptional units in DNA, the prediction of protein structure from sequence, and the analysis of molecular function. These applications include the emergence of system-wide analyses of biological phenomenon, now called systems biologySystems biology is like studying distributed systems - instead of looking at individual components (proteins/genes), it examines how they work together as a network. Similar to how we study how microservices interact rather than just individual services.. Systems biology aims to achieve quantitative understanding not only of the individual players in a biological system but also of the properties of the system itself that emerge from the interaction of all its parts. This field also includes the new field of metagenomicsMetagenomics is like analyzing all running processes on a distributed system without being able to isolate individual services. Instead of studying single organisms, it looks at all DNA from all organisms in an environment (like soil or ocean water) simultaneously., where we study entire ecosystems of interacting organisms.

The second information flow is based on the scientific method: we create hypotheses regarding biological activity, design experiments to test these hypotheses, evaluate the resulting data for compatibility with the hypotheses, and extend or modify the hypotheses in response to the data. The second class of bioinformatics applications addresses the transfer of information within this protocol, including systems that generate hypotheses, design experiments, store and organize the data from these experiments in databases, test the compatibility of the data with models, and modify hypotheses. The emergence and emphasis on systems-level modeling and interactions in both systems—biology and metagenomics— create major new challenges for our field.

The explosion of interest in bioinformatics has been driven by the emergence of experimental techniques that generate data in a high-throughput fashionHigh-throughput techniques are like parallel processing - they can analyze thousands or millions of samples simultaneously. For example, DNA sequencing can read billions of DNA fragments in parallel, similar to how GPUs process many calculations at once.—such as high-throughput DNA sequencingModern DNA sequencing is like having millions of tiny cameras taking pictures of individual DNA fragments simultaneously. Each fragment is labeled with fluorescent "tags" that identify the bases (A,T,G,C), similar to how pixels in an image sensor detect different colors., mass spectrometryMass spectrometry is like having a very precise scale that can weigh individual molecules. It breaks proteins into pieces and measures their mass, which helps identify what proteins are present in a sample - similar to how you might identify a song by analyzing its frequency components. or microarray expression analysisMicroarrays are like a massive parallel testing system - imagine a grid with thousands of spots, each designed to detect a specific gene. When genes are active, they'll bind to their matching spot, similar to how a key fits into a specific lock.. Bioinformatics depends on the availability of large data sets that are too complex to allow manual analysis. The rapid increase in the number of three-dimensional macromolecular structuresThese are like 3D models of molecular "machines" in the cell. Just as a CAD model shows how parts of an engine fit together, these structures show how atoms in proteins and other molecules are arranged in space. available in databases such as the Protein Data Bank (PDBThe PDB is like GitHub for protein structures - it's a central repository where researchers deposit 3D coordinates of proteins they've determined experimentally. Just as source code has a specific format, protein structures are stored in standardized formats that describe the position of every atom.) has driven the emergence of a subdiscipline of bioinformatics: structural bioinformatics. Structural bioinformatics is the subdiscipline of bioinformatics that focuses on the representation, storage, retrieval, analysis, and display of structural information at the atomic and subcellular spatial scales.

Structural bioinformatics, like many other subdisciplines within bioinformatics,² is characterized by two goals: the creation of general purpose methods for manipulating information about biological macromolecules and the application of these methods to solve problems in biology and create new knowledge. These two goals are intricately linked because part of the validation of new methods involves their successful use in solving real problems. At the same time, the current challenges in biology demand the development of new methods that can handle the volume of data now available and the complexity of models that scientists must create to explain these data.

Structural Bioinformatics Has Been Catalyzed by Large Amounts of Data

Biology has attracted computational scientists over the past 30 years in two distinct ways. First, the increasing availability of sequence data has been a magnet for those with an interest in string analysis, algorithms, and probabilistic models (Gusfield, 1997; Durbin et al., 1998). The major accomplishments have been the development of algorithms for pair-wise sequence alignment, multiple alignment, the definition and discovery of sequence motifs, and the use of probabilistic models, such as hidden Markov models to find genes, align sequences, and summarize protein families. Second, the increasing availability of structural data has been a magnet for those with an interest in computational geometry, computer graphics, and algorithms for analyzing crystallographic data (Chapter 4) and NMR data (Chapter 5) to create credible molecular models. Structural bioinformatics has its roots in this second group. The development of molecular graphics was one of the first applications of computer graphics. The elucidation of the structure of DNA in the mid-1950s and the publication of the first protein crystal structures in the early 1960s created a demand for computerized methods for examining these complex molecules. At the same time, the need for computational algorithms to deconvolute X-ray crystallographic data and fit the resulting electron densities to the more manageable ball-and-stick models created a cadre of structural biologists who were very well versed in computational technologies. The challenges of interpreting NMR-derived distance constraints into three-dimensional structures further introduced computational technologies to biological structure. As the number of three-dimensional structures increased, the need to create methods for storing and disseminating this data led to the creation of the PDB, one of the earliest scientific databases.¹ In the past 10 years, we have seen a third wave of interest in biological problems from a group that was not engaged by the availability of 1D sequence data or 3D structural data. This third wave has arisen in response to the increased availability of RNA expression data and has captured the interest of computational scientists with an interest in statistical analysis and machine learning, particularly in clustering methodologies and classification techniques. The problems posed by these data are different from those seen in both sequence and structural analysis data. The recent introduction of high-throughput DNA sequencing technologies that produce short-length (25-50) snippets of DNA sequence is re-energizing the sequence analysis community with new challenges.

Structural bioinformatics is now in a renaissance with the success of the genome sequencing projects, the emergence of high-throughput methods for expression analysis, and identification of compounds via mass spectrometry. There are now organized efforts in structural genomics (Chapter 40) to collect and analyze macromolecular structures in a high-throughput manner. These efforts include challenges in the selection of molecules to study, the robotic preparation and manipulation of samples to find crystallization conditions, the analysis of X-ray diffraction data, and the annotation of these structures as they are stored in databases (Section II). In addition, there have been advancements in the capabilities of NMR structure determination, which previously could only study proteins in a limited range of sizes. The solution of the malate synthase G complex from E. coli with 731 residues has pushed the frontier for NMR spectroscopy and suggests that NMR is having its own renaissance (Tugarinov et al., 2005). The PDB now has a critical mass of structures that allow (indeed require!) statistical analysis to learn the rules of how active and binding sites are constructed which allow us to develop knowledge-based methods for the prediction of structure and function. Finally, the emergence ofthis structural information, when linked to the increasing amount of genomic information and expression data, provides opportunities for linking structural information to other data sources to understand how cellular pathways and processes work at a molecular level.

Toward a High-Resolution Understanding of Biology. The great promise of structural bioinformatics is predicated on the belief that the availability of high-resolution structural information about biological systems will allow us to precisely reason about the function of these systems and the effects of modifications or perturbations. The genetic analyses can only associate genetic sequences with their functional consequences, whereas the structural biological analyses offer the additional promise of ultimate insight into the mechanisms of these consequences, and therefore a more profound understanding of how biological function follows from the structure. The promise for structural bioinformatics lies in four areas: (1) creating an infrastructure for building up structural models from component parts, (2) gaining the ability to understand the design principles of proteins, so that new functionalities can be created, (3) learning how to design drugs efficiently based on structural knowledge of their target, and (4) catalyzing the development of simulation models that can give insight into function based on structural simulations. Each of these areas has already seen success, and the structural genomics projects promise to create data sets sufficient to catalyze accelerated progress in all these areas.

With respect to creating an infrastructure for modeling larger structural ensembles, we are already seeing the emergence of a new generation of structures larger by an order of magnitude than the structures submitted to the PDB a few years ago. Some achievements in recent years include (1) the elucidation of the structure of the bacterial ribosome (with more than 250,000 atoms) (Ban et al., 2000; Clemons Jr et al., 2001; Yusupov et al., 2001), (2) the publication of the RNA polymerase structure (with about 500,000 atoms) (Cramer et al., 2000), and (3) the increased ability to solve the structure of membrane proteins (transporters and receptors, in particular) that have proven technically difficult in the past. Each of these allows us to examine the principles of how a large number of component protein and nucleic acid structures can assemble to create macromolecular machines. With these successes, we can now target numerous other cellular ensembles for structural studies.

The design principles of proteins are now in reach both because we have a large "training set" of example proteins to study and because methods for structure prediction are beginning to allow us to identify structures that are unlikely to be stable. There have been preliminary successes in the design of four-helix bundle proteins (DeGrado, Regan, and Ho, 1987) and in the engineering of TIM barrels (Silverman, Balakrishnan, and Harbury, 2001). There has been interesting work in "reverse folding" in which a set of amino acid side chains is collected to stabilize a desired protein backbone conformation (Koehl and Levitt, 1999).

Rational drug design has not been the primary way for discovering major therapeutics (Chapters 27, 34 and 35). However, recent successes in this area give reason to expect that drug discovery projects will increasingly be structure based. One of the most famous examples of rational drug design was the creation of HIV protease inhibitors based on the known three-dimensional crystal structure (Kempf, 1994; Vacca, 1994). Methods for matching combinatorial libraries of chemicals against protein binding sites have matured and are in routine use at most pharmaceutical companies.

The simulation of biological macromolecular dynamics dates almost as far backas the elucidation of the first protein structure (Doniach and Eastman, 1999). These simulations are based on the integration of classical equations of motion and computation of electrostatic forces between atoms in a molecule. Methods for simulation now routinely include water molecules and are able to remain stable (the molecule does not fall apart) and reproduce experimental measurements with some fidelity. The simulation of larger ensembles and structural variants (such as based on known genetic variations in sequence) should lead to a more profound understanding of how structural properties produce functional behavior. The NIH has recognized the importance of simulation and created a national center devoted to physics-based simulation of biological structure (SIMBIOS, http://simbios.stanford.edu/).

Special Challenges in Computing with Structural Data

Structural bioinformatics must overcome some special challenges that are either not present or not dominant in other types of bioinformatics domains (such as the analysis of sequence or microarray data). It is important to remember these challenges when assessing the opportunities in the field. They include the following:

Structural data are not linear and therefore not easily amenable to algorithms based on strings. In addition to this obvious nonlinearity, there are nonlinear relationships between atoms (the forces are not linear)This is like trying to analyze a complex 3D game engine versus analyzing text files. While text processing can use simple string operations, 3D structures require complex vector math and physics calculations. The forces between atoms can change dramatically with small distance changes, similar to how gravity or electromagnetic forces work.
The search space for most structural problems is continuous. Structures are represented generally by atomic Cartesian coordinates (or internal angular coordinates) that are continuous variablesThis is similar to the difference between discrete and continuous optimization. While in sequence analysis you might search through discrete possibilities (like A,T,G,C), in structural analysis you're dealing with infinite possible 3D coordinates - like trying to find the optimal camera position in a 3D space.. Thus, there are infinite search spaces for algorithms attempting to assign atomic coordinate values. Many simplifications can be applied (such as lattice models for 3D structure; Hinds and Levitt, 1994), but these are attempts to manage the inherent continuous nature of these problems.
There is a fundamental connection between molecular structure and physics. While this statement seems obvious and trivial, it means that when reduced representations, such as pseudoatoms or lattice models are applied, they become more difficult to relate to the underlying physics that governs the interactionsThis is like the trade-off between physics accuracy and performance in game engines. You can simplify physics calculations to make them faster (like using bounding boxes instead of detailed collision meshes), but you lose accuracy. In molecular modeling, simplifying atomic interactions can make calculations feasible but might miss important physical effects.
Reasoning about structure requires visualization. As mentioned above, the creation of computer graphics was driven, in part, by the need of structural biologists to look at molecules (Chapter 9). This is both a benefit and a detriment; structure is well defined, and well-designed visualizations can provide insight into structural problems. However, graphical displays have a human user as a target and are not easily parsed or understood by computers, and thus represent something of a computational "dead end." The need to have expressive data structures underlying these visualizations allows the information to be understood and analyzed by computerprograms and thus opens the possibility of further downstream analysis.
Structural data, like all biological data, can be noisy and imperfect. Despite some amazing successes in the elucidation of very high-resolution structures, the precision of our knowledge about many structures is likely to be limited by their flexibility, dynamics, or experimental noiseThis is similar to dealing with sensor data in robotics - while you can get precise measurements in controlled conditions, real-world data is often noisy and uncertain. Proteins are not static structures but constantly moving and changing shape, making it hard to get a "perfect" structure, just like it's hard to get perfect sensor readings from a moving robot.
Protein and nucleic acid structures are generally conserved more than their associated sequenceThis is like how different programming languages can implement the same algorithm - the code (sequence) might look very different, but the underlying logic (structure) remains the same. Two proteins might have very different amino acid sequences but fold into very similar 3D shapes to perform the same function.
Structural genomics will likely produce a large number of structures at the level of the domain—relatively well-defined modules that associate to form larger ensembles. The principles by which these domains associate and cooperatively function pose a major challenge to structural biology (Chapters 17, 18, 20, and 26).
Finally, we must recognize that there is a major gap in our knowledge of a large fraction of proteins that are not globular and water soluble. In particular, membrane-bound and fibrous proteins are simply not well understood and structures have not been available in the numbers required to allow routine statistical and informatics approaches to their study. The importance of this shortcoming cannot be over emphasized, since these classes of proteins are among the most important ones for understanding a large number of cellular processes of great interest, including signal transduction, cytoskeletal dynamics, and cellular localizations and compart-mentalization. Recently, some fascinating structures of membrane-bound transporter proteins, such as a zinc transporter (Lu and Fu, 2007), have improved our understanding of membrane protein structure (Chapter 36).

TECHNICAL CHALLENGES WITHIN STRUCTURAL BIOINFORMATICS

The scientific challenges within structural bioinformatics fall into two rough categories: the creation of methods to support structural biology and structural genomics and the creation of methods to elucidate new biological knowledge. This distinction is not absolute, but is useful for dividing much structural bioinformatics work. The support of experimental structural biology is currently an area of particular interest with the emergence of efforts in high-throughput structural genomics. Informatics approaches are required for many aspects of this enterprise, and can be briefly reviewed here:

Target Selection: Structural genomics efforts with finite resources must select proteins to study carefully. Informatics methods are used to compare the database of existing structures and known sequences with potential targets to identify those that are most likely to add to our structural knowledge base. This selection can be informed by the expected novelty of the structure, and even its importance as reflected in the published literature (Linial and Yona, 2000). A critical part of target selection is the identification of domains within large proteins. Domains are often easier to study initially in isolation, and then in complexes. The definition of domains from sequence data alone is a challenging problem (Chapter 20).
Tracking Experimental Crystallization Trials: One of the major bottlenecks in structural genomics is the discovery of crystallization conditions that work for proteins of interest. In addition to the obvious need of storing and tracking information on proteins, the conditions attempted, and the results, there is also an opportunity to apply machine learning methods to these data to extract rules that may help increase the yield of crystals based on previous experience (Hennessy et al., 2000). Until recently, the results offailed crystallization experiments were not generally available, thus making it difficult to apply automated machine learning methods to these data sets.
Analysis of Crystallographic Data: A long-standing area of computation within structural biology is the algorithm for deconvoluting the X-ray diffraction pattern, which involves computing an inverse Fourier transform with partial information (i.e., with missing phase information). There is interest in ab initio methods for automating these computations, and success in this area reduces the number of heavy atom derivatives that must be created for structures of interest (Gilmore, Dong, and Bricogne, 1998). Multiwavelength anomalous diffraction (MAD) (Hendrickson, 1991) is now the preferred method for solving the crystallo-graphic phase problem. Over one-half of all structures are determined by MAD, a development in keeping with the availability of tunable synchrotron sources. Similarly, once the electron density is computed, there is a challenge in fitting the density to a standard ball-and-stick model of the atoms. While this has been done manually (with graphical computer assistance), there is interest in finding methods for using image processing techniques to automatically identify connected densities and match them to the known shape of protein backbone and side chain elements (Barr and Feigenbaum, 1982). Recent progress has been made on automated electron density map fitting and refinement (Chapter 4).
The Analysis of NMR Data: NMR experiments provide complementary data to the crystallographic analyses. NMR experiments produce two (or higher) dimensional spectra for which each individual peak must be assigned to an atomic interaction. The automated analysis and assignment of atoms in these spectra is a difficult search problem, but the one in which progress has been made to accelerate the analysis of structure (Zimmerman and Montelione, 1995). Given a set of atomic proximities from NMR, we need methods to "embed" these distance measures into three-dimensional structures that satisfy these constraints. Distance geometry (Moré and Wu, 1999), restrained molecular dynamics (Bassolino-Klimas et al., 1996), and other nonlinear optimization methods have been developed for this purpose (Altman, 1993; Williams, Dugan, and Altman, 2001).
Assessment and Evaluation of Structures: Given the results of a crystallographic or NMR structure determination effort, we must check the structures to be sure that they meet certain quality standards. Algorithms have been developed for assessing the basic chemistry of structural models and also for identifying active and binding sites in these structures (Laskowski et al., 1993; Feng, Westbrook, and Berman, 1998; Vaguine, Richelle, and Wodak, 1999). Computational methods are still needed for automatically annotating 3D structures with functional information, based on an understanding of how molecular properties aggregate in three dimensions to produce function (such as binding, catalysis, motion, and signal transduction) (Wei, Huang, and Altman, 1999, Chapter 5).
Storing Molecular Structures in Databases: The storage of the results of structural genomics efforts is an important task, requiring data structures and organizations that facilitate the most common queries. Ideally, databases of structure will store not only the resulting model but also the raw data upon which it is based. The PDB (Chapter 11) is the major repository for three-dimensional structural information of proteins; the Nucleic Acids Database (NDB, Chapter 12) serves this function for nucleic acids. There is also an effort to store the raw data associated with crystallography in the PDB/NDB and the raw data associated with NMR in the BioMagResBank (BMRB).³
Correlating Molecular Structural Information with Structural and Functional Information Gained from Other Types of Experimentation: In the end, we perform structural studies in order to get an insight into how the molecules work. Structural studies with crystallography and NMR are two methods that can be used to probe structure-function relationships. The integration of the results of these methods with other structural and functional data allows us to build comprehensive models of mechanism, specificity, and dynamics. A major bottleneck for using informatics methods for this integration is the lack of repositories of structural and functional data that can be accessed by computer programs doing systematic analyses. One exception is the noncrystallographic structural data on the 30S and 50S ribosomal subunits stored in the Ribo WEB (http://riboweb.stanford.edu/), a knowledge base of ribosomal structural components that stores more than 8000 noncrystallographic structural and functional observations about the bacterial ribosome. It stores its information in structured "information templates" that are easily parsed by computer programs, thus making possible automated comparison and evaluation of structural models. For example, Ribo WEB has been used to assess the compatibility of the published ribosomal crystal structures with over 1000 proximity measurements from cross-linking, chemical protection, and labeling experiments (collected during the past 25 years). Incompatibilities between these data and the crystal structures may suggest artifactual data or (more usefully) may suggest areas of important dynamic motion for the ribosome (Whirl-Carrillo et al., 2002).

Understanding the Structural Basis for Biological Phenomenon

Given the structural information created by efforts in X-ray crystallography and NMR, there is a wide range of analytic and scientific challenges to informatics. It is not possible to cover the full scope of activities, but they can be reviewed briefly to show the richness of opportunities in the analysis of structural data.

Visualization: The creation of images of molecular structure remains a primary activity within structural biology (Chapter 9). The complexity of these molecules seems to demand novel display methods that are able to combine structural information with other information sources (such as electrostatic fields, the location of functional sites, and areas of structural or genetic variability). The issues for informatics include the creation of flexible software infrastructures for extending display capabilities and the use of novel methods for rapidly rendering complex molecular structures (Huang et al.,1996; Sanner et al., 1999).
Classification: The database of known structures is already sufficiently large, making it necessary to cluster similar structures together to form families of proteins. These families are often aggregated into superfamilies, and indeed entire structural hierarchies have been created. The structural classification of proteins (SCOP; Chapter 17) is an example of a semiautomated classification of all protein structures (Murzin et al., 1995), and there have been numerous efforts to create automated classification—usually based on the pair-wise comparison of all structures to create a matrix of distances (Chapter 18; Holm and Sander, 1996; Orengo et al., 1997).
Prediction: Despite the growth of the structural databases, the number of known three-dimensional structures has lagged far behind the availability of sequence information. Thus, the prediction of three-dimensional structure remains an area of keen interest. The Critical Assessment for Structure Prediction (CASP; Chapter 28) meetings have provided a biennial forum for the comparison of methods for structure prediction. The main categories of prediction have been homology modeling (based on high sequence homology to a known structure; Chapter 30; Sanchez and Sali, 1997), threading (based on homology (Chapter 31); Bryant and Altschul, 1995), and ab initio prediction (based on no detectable homology; Chapter 32; Osguthorpe, 2000). The diversity of methods invented and evaluated is quite inspiring, and the resulting lessons about how proteins are put together have been significant.
Simulation: The results of crystallographic studies (and to some extent, NMR studies) are primarily static structural models. However, the properties of these molecules that are of the greatest interest are often the results of their dynamic motions. The definition of energy functions that govern the folding of proteins and their subsequent stable dynamics has been an area of great interest since the first structure was determined. Unfortunately, the timescales on which macromolecular dynamics must be sampled (fractions of picoseconds) are much shorter than the timescale on which biologically important phenomena occur (from microseconds to seconds). Nevertheless, the availability of increasingly powerful computers and clever approximation and search methods is enabling molecular simulations of sufficient length and accuracy to emerge, making contributions to our understanding of protein function. The associated computation of electrostatic fields of macromo-lecular structures (Chapter 24) has emerged as an important component of understanding molecular function (Sheinerman, Norel, and Honig, 1992).

It should be emphasized that although there has been primary focus on protein structures, with respect to the challenges outlined above, there is increasing interest in the same issues for RNA structure. The last decade has shown that the role of RNA molecules in the cell goes far beyond being a passive information carrier as messenger RNAThink of RNA as more than just a temporary copy of code (mRNA). We've discovered that RNA can act like a program itself - it can regulate other genes (like a control system), catalyze chemical reactions (like an enzyme), and form complex structures (like a protein). This is similar to how we've discovered that scripting languages, originally designed for simple tasks, can be used for complex applications.. A large number of structured RNA molecules are involved in gene regulation (through RNA inhibition and other mechanisms), whose 3D structure is critical for understanding their function. The overall challenges for RNA structure are similar to proteins, but the details differ—RNA structure is dominated by electrostatics and not hydrophobic interactions, the secondary structure is easier to predict but offers a more limited repertoire for structural uses, and the molecules are more prone to finding stable misfolded states. Nonetheless, our understanding of structural biology will necessarily include the structure of RNA and RNA-protein complexes (Chapter 3, 12, and 33).

INTEGRATING STRUCTURAL DATA WITH OTHER DATA SOURCES

Structural bioinformatics has existed, in one form or another, since the determination of the first myoglobin structure. One could argue that the roots go back to the time when small molecular structure determination was introduced. In any case, the challenges for the field are clearly abundant and significant. As we look into the coming decade, it appears that a primary challenge in structural bioinformatics will be the integration of structural information with other biological information, to yield a higher resolution understanding of biological function. The success of genome sequencing projects has created information about all the structures that are present in individual organisms, as well as both shared and unique features of these organisms. Even with the success of structural genomics projects, bioinformatics techniques will most likely be used to create homology models of most of these genomic components. The resulting structures will be studied with respect to how they interact and perform their functions. Similarly, the emergence of high-throughput expression measurements provides an opportunity to understand how the assembly of macromolecular structures is regulated (including the key structural machinery associated with transcription, translation, and degradation). Mass spectroscopic methods that allow the identification of structural modifications and variations (such as genetic mutation or post-translational modifications) will need to be integrated with structural models to understand how they alter functional characteristics. Cross-linking data, particularly in vivo, will provide valuable information about the physical association between macromolecules and ligands and the dynamics of molecular ensembles, thus helping us to create a structural portrait of a cell in three dimensions at near-atomic resolution (Tsutsui and Wintrode, 2007). Finally, cellular localization data will allow us to place three-dimensional molecular structures into compartments within the cell, as we build more complex models of how cells are organized structurally to optimize their function. This exciting activity will mark the next phase of structural bioinformatics—when the organization and physical structure of entire cells are understood and represented in computational models that provide insight into how thousands of structures within a cell work together to create the functions associated with life.