1

DEFINING BIOINFORMATICS AND STRUCTURAL BIOINFORMATICS

Russ B. Altman and Jonathan M. Dugan

The precise definition of bioinformatics is a matter of debate. Some define it narrowly as the development of databases to store and manipulate genomic information. Others define it broadly as encompassing all of computational biology. Based on its current use in the scientific literature, bioinformatics can be defined as the study of two information flows in molecular biology (Altman, 1998This paper established the foundational framework for thinking about bioinformatics as information flow analysis rather than just database management.). The first information flow is based on the central dogma of molecular biologyThe central dogma is a fundamental principle in biology that describes how genetic information flows: DNA → RNA → Proteins. Think of it like a compiler pipeline: DNA is the source code, RNA is the intermediate representation, and proteins are the executable code.: DNA sequences are transcribed into mRNA sequencesTranscription is like copying source code - DNA's A,T,G,C bases are converted to RNA's A,U,G,C bases. The 'messenger' RNA (mRNA) carries the instructions from DNA to the protein-making machinery.; mRNA sequences are translated into protein sequencesTranslation is like compilation - the RNA sequence is read in groups of 3 bases (codons) and converted into amino acids that form proteins. Each codon maps to a specific amino acid, similar to how assembly instructions map to machine code.; and protein sequences fold into three-dimensional structures that have functionsThis is like a program being loaded into memory and executing. The linear sequence of amino acids (like assembly code) folds into a complex 3D shape that can perform specific tasks in the cell.. These functions are selected, in a Darwinian sense, by the environment of the organism, which drives the evolution of the DNA sequence within a population. The first class of bioinformatics applications, then, can address the transfer of information at any stage in the central dogma, including the organization and control of genes in the DNA sequence, the identification of transcriptional units in DNA, the prediction of protein structure from sequence, and the analysis of molecular function. These applications include the emergence of system-wide analyses of biological phenomenon, now called systems biologySystems biology is like studying distributed systems - instead of looking at individual components (proteins/genes), it examines how they work together as a network. Similar to how we study how microservices interact rather than just individual services.. Systems biology aims to achieve quantitative understanding not only of the individual players in a biological system but also of the properties of the system itself that emerge from the interaction of all its parts. This field also includes the new field of metagenomicsMetagenomics is like analyzing all running processes on a distributed system without being able to isolate individual services. Instead of studying single organisms, it looks at all DNA from all organisms in an environment (like soil or ocean water) simultaneously., where we study entire ecosystems of interacting organisms.

The second information flow is based on the scientific method: we create hypotheses regarding biological activity, design experiments to test these hypotheses, evaluate the resulting data for compatibility with the hypotheses, and extend or modify the hypotheses in response to the data. The second class of bioinformatics applications addresses the transfer of information within this protocol, including systems that generate hypotheses, design experiments, store and organize the data from these experiments in databases, test the compatibility of the data with models, and modify hypotheses. The emergence and emphasis on systems-level modeling and interactions in both systems—biology and metagenomics— create major new challenges for our field.

The explosion of interest in bioinformatics has been driven by the emergence of experimental techniques that generate data in a high-throughput fashionHigh-throughput techniques are like parallel processing - they can analyze thousands or millions of samples simultaneously. For example, DNA sequencing can read billions of DNA fragments in parallel, similar to how GPUs process many calculations at once.—such as high-throughput DNA sequencingModern DNA sequencing is like having millions of tiny cameras taking pictures of individual DNA fragments simultaneously. Each fragment is labeled with fluorescent "tags" that identify the bases (A,T,G,C), similar to how pixels in an image sensor detect different colors., mass spectrometryMass spectrometry is like having a very precise scale that can weigh individual molecules. It breaks proteins into pieces and measures their mass, which helps identify what proteins are present in a sample - similar to how you might identify a song by analyzing its frequency components. or microarray expression analysisMicroarrays are like a massive parallel testing system - imagine a grid with thousands of spots, each designed to detect a specific gene. When genes are active, they'll bind to their matching spot, similar to how a key fits into a specific lock.. Bioinformatics depends on the availability of large data sets that are too complex to allow manual analysis. The rapid increase in the number of three-dimensional macromolecular structuresThese are like 3D models of molecular "machines" in the cell. Just as a CAD model shows how parts of an engine fit together, these structures show how atoms in proteins and other molecules are arranged in space. available in databases such as the Protein Data Bank (PDBThe PDB is like GitHub for protein structures - it's a central repository where researchers deposit 3D coordinates of proteins they've determined experimentally. Just as source code has a specific format, protein structures are stored in standardized formats that describe the position of every atom.) has driven the emergence of a subdiscipline of bioinformatics: structural bioinformatics. Structural bioinformatics is the subdiscipline of bioinformatics that focuses on the representation, storage, retrieval, analysis, and display of structural information at the atomic and subcellular spatial scales.

Structural bioinformatics, like many other subdisciplines within bioinformatics,2 is characterized by two goals: the creation of general purpose methods for manipulating information about biological macromolecules and the application of these methods to solve problems in biology and create new knowledge. These two goals are intricately linked because part of the validation of new methods involves their successful use in solving real problems. At the same time, the current challenges in biology demand the development of new methods that can handle the volume of data now available and the complexity of models that scientists must create to explain these data.

Structural Bioinformatics Has Been Catalyzed by Large Amounts of Data

Biology has attracted computational scientists over the past 30 years in two distinct ways. First, the increasing availability of sequence data has been a magnet for those with an interest in string analysis, algorithms, and probabilistic models (Gusfield, 1997; Durbin et al., 1998). The major accomplishments have been the development of algorithms for pair-wise sequence alignment, multiple alignment, the definition and discovery of sequence motifs, and the use of probabilistic models, such as hidden Markov models to find genes, align sequences, and summarize protein families. Second, the increasing availability of structural data has been a magnet for those with an interest in computational geometry, computer graphics, and algorithms for analyzing crystallographic data (Chapter 4) and NMR data (Chapter 5) to create credible molecular models. Structural bioinformatics has its roots in this second group. The development of molecular graphics was one of the first applications of computer graphics. The elucidation of the structure of DNA in the mid-1950s and the publication of the first protein crystal structures in the early 1960s created a demand for computerized methods for examining these complex molecules. At the same time, the need for computational algorithms to deconvolute X-ray crystallographic data and fit the resulting electron densities to the more manageable ball-and-stick models created a cadre of structural biologists who were very well versed in computational technologies. The challenges of interpreting NMR-derived distance constraints into three-dimensional structures further introduced computational technologies to biological structure. As the number of three-dimensional structures increased, the need to create methods for storing and disseminating this data led to the creation of the PDB, one of the earliest scientific databases.1 In the past 10 years, we have seen a third wave of interest in biological problems from a group that was not engaged by the availability of 1D sequence data or 3D structural data. This third wave has arisen in response to the increased availability of RNA expression data and has captured the interest of computational scientists with an interest in statistical analysis and machine learning, particularly in clustering methodologies and classification techniques. The problems posed by these data are different from those seen in both sequence and structural analysis data. The recent introduction of high-throughput DNA sequencing technologies that produce short-length (25-50) snippets of DNA sequence is re-energizing the sequence analysis community with new challenges.

Structural bioinformatics is now in a renaissance with the success of the genome sequencing projects, the emergence of high-throughput methods for expression analysis, and identification of compounds via mass spectrometry. There are now organized efforts in structural genomics (Chapter 40) to collect and analyze macromolecular structures in a high-throughput manner. These efforts include challenges in the selection of molecules to study, the robotic preparation and manipulation of samples to find crystallization conditions, the analysis of X-ray diffraction data, and the annotation of these structures as they are stored in databases (Section II). In addition, there have been advancements in the capabilities of NMR structure determination, which previously could only study proteins in a limited range of sizes. The solution of the malate synthase G complex from E. coli with 731 residues has pushed the frontier for NMR spectroscopy and suggests that NMR is having its own renaissance (Tugarinov et al., 2005). The PDB now has a critical mass of structures that allow (indeed require!) statistical analysis to learn the rules of how active and binding sites are constructed which allow us to develop knowledge-based methods for the prediction of structure and function. Finally, the emergence ofthis structural information, when linked to the increasing amount of genomic information and expression data, provides opportunities for linking structural information to other data sources to understand how cellular pathways and processes work at a molecular level.

Toward a High-Resolution Understanding of Biology. The great promise of structural bioinformatics is predicated on the belief that the availability of high-resolution structural information about biological systems will allow us to precisely reason about the function of these systems and the effects of modifications or perturbations. The genetic analyses can only associate genetic sequences with their functional consequences, whereas the structural biological analyses offer the additional promise of ultimate insight into the mechanisms of these consequences, and therefore a more profound understanding of how biological function follows from the structure. The promise for structural bioinformatics lies in four areas: (1) creating an infrastructure for building up structural models from component parts, (2) gaining the ability to understand the design principles of proteins, so that new functionalities can be created, (3) learning how to design drugs efficiently based on structural knowledge of their target, and (4) catalyzing the development of simulation models that can give insight into function based on structural simulations. Each of these areas has already seen success, and the structural genomics projects promise to create data sets sufficient to catalyze accelerated progress in all these areas.

With respect to creating an infrastructure for modeling larger structural ensembles, we are already seeing the emergence of a new generation of structures larger by an order of magnitude than the structures submitted to the PDB a few years ago. Some achievements in recent years include (1) the elucidation of the structure of the bacterial ribosome (with more than 250,000 atoms) (Ban et al., 2000; Clemons Jr et al., 2001; Yusupov et al., 2001), (2) the publication of the RNA polymerase structure (with about 500,000 atoms) (Cramer et al., 2000), and (3) the increased ability to solve the structure of membrane proteins (transporters and receptors, in particular) that have proven technically difficult in the past. Each of these allows us to examine the principles of how a large number of component protein and nucleic acid structures can assemble to create macromolecular machines. With these successes, we can now target numerous other cellular ensembles for structural studies.

The design principles of proteins are now in reach both because we have a large "training set" of example proteins to study and because methods for structure prediction are beginning to allow us to identify structures that are unlikely to be stable. There have been preliminary successes in the design of four-helix bundle proteins (DeGrado, Regan, and Ho, 1987) and in the engineering of TIM barrels (Silverman, Balakrishnan, and Harbury, 2001). There has been interesting work in "reverse folding" in which a set of amino acid side chains is collected to stabilize a desired protein backbone conformation (Koehl and Levitt, 1999).

Rational drug design has not been the primary way for discovering major therapeutics (Chapters 27, 34 and 35). However, recent successes in this area give reason to expect that drug discovery projects will increasingly be structure based. One of the most famous examples of rational drug design was the creation of HIV protease inhibitors based on the known three-dimensional crystal structure (Kempf, 1994; Vacca, 1994). Methods for matching combinatorial libraries of chemicals against protein binding sites have matured and are in routine use at most pharmaceutical companies.

The simulation of biological macromolecular dynamics dates almost as far backas the elucidation of the first protein structure (Doniach and Eastman, 1999). These simulations are based on the integration of classical equations of motion and computation of electrostatic forces between atoms in a molecule. Methods for simulation now routinely include water molecules and are able to remain stable (the molecule does not fall apart) and reproduce experimental measurements with some fidelity. The simulation of larger ensembles and structural variants (such as based on known genetic variations in sequence) should lead to a more profound understanding of how structural properties produce functional behavior. The NIH has recognized the importance of simulation and created a national center devoted to physics-based simulation of biological structure (SIMBIOS, http://simbios.stanford.edu/).

Special Challenges in Computing with Structural Data

Structural bioinformatics must overcome some special challenges that are either not present or not dominant in other types of bioinformatics domains (such as the analysis of sequence or microarray data). It is important to remember these challenges when assessing the opportunities in the field. They include the following:

The scientific challenges within structural bioinformatics fall into two rough categories: the creation of methods to support structural biology and structural genomics and the creation of methods to elucidate new biological knowledge. This distinction is not absolute, but is useful for dividing much structural bioinformatics work. The support of experimental structural biology is currently an area of particular interest with the emergence of efforts in high-throughput structural genomics. Informatics approaches are required for many aspects of this enterprise, and can be briefly reviewed here:

Understanding the Structural Basis for Biological Phenomenon

Given the structural information created by efforts in X-ray crystallography and NMR, there is a wide range of analytic and scientific challenges to informatics. It is not possible to cover the full scope of activities, but they can be reviewed briefly to show the richness of opportunities in the analysis of structural data.

It should be emphasized that although there has been primary focus on protein structures, with respect to the challenges outlined above, there is increasing interest in the same issues for RNA structure. The last decade has shown that the role of RNA molecules in the cell goes far beyond being a passive information carrier as messenger RNAThink of RNA as more than just a temporary copy of code (mRNA). We've discovered that RNA can act like a program itself - it can regulate other genes (like a control system), catalyze chemical reactions (like an enzyme), and form complex structures (like a protein). This is similar to how we've discovered that scripting languages, originally designed for simple tasks, can be used for complex applications.. A large number of structured RNA molecules are involved in gene regulation (through RNA inhibition and other mechanisms), whose 3D structure is critical for understanding their function. The overall challenges for RNA structure are similar to proteins, but the details differ—RNA structure is dominated by electrostatics and not hydrophobic interactions, the secondary structure is easier to predict but offers a more limited repertoire for structural uses, and the molecules are more prone to finding stable misfolded states. Nonetheless, our understanding of structural biology will necessarily include the structure of RNA and RNA-protein complexes (Chapter 3, 12, and 33).

Structural bioinformatics has existed, in one form or another, since the determination of the first myoglobin structure. One could argue that the roots go back to the time when small molecular structure determination was introduced. In any case, the challenges for the field are clearly abundant and significant. As we look into the coming decade, it appears that a primary challenge in structural bioinformatics will be the integration of structural information with other biological information, to yield a higher resolution understanding of biological function. The success of genome sequencing projects has created information about all the structures that are present in individual organisms, as well as both shared and unique features of these organisms. Even with the success of structural genomics projects, bioinformatics techniques will most likely be used to create homology models of most of these genomic components. The resulting structures will be studied with respect to how they interact and perform their functions. Similarly, the emergence of high-throughput expression measurements provides an opportunity to understand how the assembly of macromolecular structures is regulated (including the key structural machinery associated with transcription, translation, and degradation). Mass spectroscopic methods that allow the identification of structural modifications and variations (such as genetic mutation or post-translational modifications) will need to be integrated with structural models to understand how they alter functional characteristics. Cross-linking data, particularly in vivo, will provide valuable information about the physical association between macromolecules and ligands and the dynamics of molecular ensembles, thus helping us to create a structural portrait of a cell in three dimensions at near-atomic resolution (Tsutsui and Wintrode, 2007). Finally, cellular localization data will allow us to place three-dimensional molecular structures into compartments within the cell, as we build more complex models of how cells are organized structurally to optimize their function. This exciting activity will mark the next phase of structural bioinformatics—when the organization and physical structure of entire cells are understood and represented in computational models that provide insight into how thousands of structures within a cell work together to create the functions associated with life.