• Volume 32, Issue 5

August 2007,   pages  807-1039

• Foreword

• Theoretical analysis of noncanonical base pairing interactions in RNA molecules

Noncanonical base pairs in RNA have strong structural and functional implications but are currently not considered for secondary structure predictions. We present results of comparative ab initio studies of stabilities and interaction energies for the three standard and 24 selected unusual RNA base pairs reported in the literature. Hydrogen added models of isolated base pairs, with heavy atoms frozen in their ‘away from equilibrium’ geometries, built from coordinates extracted from NDB, were geometry optimized using HF/6-31G** basis set, both before and after unfreezing the heavy atoms. Interaction energies, including BSSE and deformation energy corrections, were calculated, compared with respective single point MP2 energies, and correlated with occurrence frequencies and with types and geometries of hydrogen bonding interactions. Systems having two or more N-H…O/N hydrogen bonds had reasonable interaction energies which correlated well with respective occurrence frequencies and highlighted the possibility of some of them playing important roles in improved secondary structure prediction methods. Several of the remaining base pairs with one N-H…O/N and/or one C-H…O/N interactions respectively, had poor interaction energies and negligible occurrences. High geometry variations on optimization of some of these were suggestive of their conformational switch like characteristics.

• Mechanism of DNA–binding loss upon single-point mutation in p53

Over 50% of all human cancers involve p53 mutations, which occur mostly in the sequence−specific DNA−binding central domain (p53c), yielding little/non–detectable affinity to the DNA consensus site. Despite our current understanding of protein−DNA recognition, the mechanism(s) underlying the loss in protein−DNA binding affinity/specificity upon single−point mutation are not well understood. Our goal is to identify the common factors governing the DNA−binding loss of p53c upon substitution of Arg 273 to His or Cys, which are abundant in human tumours. By computing the free energies of wild–type and mutant p53c binding to DNA and decomposing them into contributions from individual residues, the DNA−binding loss upon charge/noncharge–conserving mutation of Arg 273 was attributed not only to the loss of DNA phosphate contacts, but also to longer–range structural changes caused by the loss of the Asp 281 salt–bridge. The results herein and in previous works suggest that Asp 281 plays a critical role in the sequence−specific DNA−binding function of p53c by

1. orienting Arg 273 and Arg 280 in an optimal position to interact with the phosphate and base groups of the consensus DNA, respectively, and

2. helping to maintain the proper DNA–binding protein conformation.

• Incorporating evolution of transcription factor binding sites into annotated alignments

Identifying transcription factor binding sites (TFBSs) is essential to elucidate putative regulatory mechanisms. A common strategy is to combine cross-species conservation with single sequence TFBS annotation to yield conserved TFBSs”. Most current methods in this field adopt a multi-step approach that segregates the two aspects. Again, it is widely accepted that the evolutionary dynamics of binding sites differ from those of the surrounding sequence. Hence, it is desirable to have an approach that explicitly takes this factor into account. Although a plethora of approaches have been proposed for the prediction of conserved TFBSs, very few explicitly model TFBS evolutionary properties, while additionally being multi-step. Recently, we introduced a novel approach to simultaneously align and annotate conserved TFBSs in a pair of sequences. Building upon the standard Smith-Waterman algorithm for local alignments, SimAnn introduces additional states for profiles to output extended alignments or annotated alignments. That is, alignments with parts annotated as gaplessly aligned TFBSs (pair-profile hits) are generated. Moreover, the pair-profile related parameters are derived in a sound statistical framework.

In this article, we extend this approach to explicitly incorporate evolution of binding sites in the SimAnn framework. We demonstrate the extension in the theoretical derivations through two position-specific evolutionary models, previously used for modelling TFBS evolution. In a simulated setting, we provide a proof of concept that the approach works given the underlying assumptions, as compared to the original work. Finally, using a real dataset of experimentally verified binding sites in human-mouse sequence pairs, we compare the new approach (eSimAnn) to an existing multi-step tool that also considers TFBS evolution.

Although it is widely accepted that binding sites evolve differently from the surrounding sequences, most comparative TFBS identification methods do not explicitly consider this. Additionally, predic tion of conserved binding sites is carried out in a multi-step approach that segregates alignment from TFBS annotation. In this paper, we demonstrate how the simultaneous alignment and annotation approach of SimAnn can be further extended to incorporate TFBS evolutionary relationships. We study how alignments and binding site predictions interplay at varying evolutionary distances and for various profile qualities.

• Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability

Analysis of various predicted structural properties of promoter regions in prokaryotic as well as eukaryotic genomes had earlier indicated that they have several common features, such as lower stability, higher curvature and less bendability, when compared with their neighboring regions. Based on the difference in stability between neighboring upstream and downstream regions in the vicinity of experimentally determined transcription start sites, a promoter prediction algorithm has been developed to identify prokaryotic promoter sequences in whole genomes. The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over the entire genome (G) are used to search for promoters in the genomic sequences. Using these cutoff values to predict promoter regions across entire Escherichia coli genome, we achieved a reliability of 70% when the predicted promoters were cross verified against the 960 transcription start sites (TSSs) listed in the Ecocyc database. Annotation of the whole E. coli genome for promoter region could be carried out with 49% accuracy. The method is quite general and it can be used to annotate the promoter regions of other prokaryotic genomes.

• Parsing regulatory DNA: General tasks, techniques, and the PhyloGibbs approach

In this review, we discuss the general problem of understanding transcrip tional regulation from DNA sequence and prior information. The main tasks we discuss are predicting local regions of DNA, cis-regulatory modules (CRMs) that contain binding sites for transcription factors (TFs), and predicting individ ual binding sites. We review various existing methods, and then describe the approach taken by PhyloGibbs, a recent motif-finding algorithm that we developed to predict TF binding sites, and PhyloGibbs-MP, an extension to PhyloGibbs that tackles other tasks in regulatory genomics, particularly prediction of CRMs.

• Evolutionary insights from suffix array-based genome sequence analysis

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples.

The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a ‘meaning’ for tetra and higher n-grams.

The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG, coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

• A method for computing the inter-residue interaction potentials for reduced amino acid alphabet

Inter-residue potentials are extensively used in the design and evaluation of protein structures. However, dealing with all (20×20) interactions becomes computationally difficult in extensive investigations. Hence, it is desirable to reduce the alphabet of 20 amino acids to a smaller number. Currently, several methods of reducing the residue types exist; however a critical assessment of these methods is not available. Towards this goal, here we review and evaluate different methods by comparing with the complete (20×20) matrix of Miyazawa-Jernigan potential, including a method of grouping adopted by us, based on multi dimensional scaling (MDS). The second goal of this paper is the computation of inter-residue interaction energies for the reduced amino acid alphabet, which has not been explicitly addressed in the literature until now. By using a least squares technique, we present a systematic method of obtaining the interaction energy values for any type of grouping scheme that reduces the amino acid alphabet. This can be valuable in designing the protein structures.

• Protein mechanics: a route from structure to function

In order to better understand the mechanical properties of proteins, we have developed simulation tools which enable these properties to be analysed on a residue-by-residue basis. Although these calculations are relatively expensive with all-atom protein models, good results can be obtained much faster using coarse-grained approaches. The results show that proteins are surprisingly heterogeneous from a mechanical point of view and that functionally important residues often exhibit unusual mechanical behaviour. This finding offers a novel means for detecting functional sites and also potentially provides a route for understanding the links between structure and function in more general terms.

• Protein local conformations arise from a mixture of Gaussian distributions

The classical approaches for protein structure prediction rely either on homology of the protein sequence with a template structure or on ab initio calculations for energy minimization. These methods suffer from disadvantages such as the lack of availability of homologous template structures or intractably large conformational search space, respectively. The recently proposed fragment library based approaches first predict the local structures, which can be used in conjunction with the classical approaches of protein structure prediction. The accuracy of the predictions is dependent on the quality of the fragment library. In this work, we have constructed a library of local conformation classes purely based on geometric similarity. The local conformations are represented using Geometric Invariants, properties that remain unchanged under transformations such as translation and rotation, followed by dimension reduction via principal component analysis. The local conformations are then modeled as a mixture of Gaussian probability distribution functions (PDF). Each one of the Gaussian PDF’s corresponds to a conformational class with the centroid representing the average structure of that class. We find 46 classes when we use an octapeptide as a unit of local conformation. The protein 3-D structure can now be described as a sequence of local conformational classes. Further, it was of interest to see whether the local conformations can be predicted from the amino acid sequences. To that end, we have analyzed the correlation between sequence features and the conformational classes.

• Exploring conformational space using a mean field technique with MOLS sampling

The computational identification of all the low energy structures of a peptide given only its sequence is not an easy task even for small peptides, due to the multiple-minima problem and combinatorial explosion. We have developed an algorithm, called the MOLS technique, that addresses this problem, and have applied it to a number of different aspects of the study of peptide and protein structure. Conformational studies of oligopeptides, including loop sequences in proteins have been carried out using this technique. In general the calculations identified all the folds determined by previous studies, and in addition picked up other energetically favorable structures. The method was also used to map the energy surface of the peptides. In another application, we have combined the MOLS technique, using it to generate a library of low energy structures of an oligopeptide, with a genetic algorithm to predict protein structures. The method has also been applied to explore the conformational space of loops in protein structures. Further, it has been applied to the problem of docking a ligand in its receptor site, with encouraging results.

• Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases

Protein structural alignments are generally considered as ‘golden standard’ for the alignment at the level of amino acid residues. In this study we have compared the quality of pairwise and multiple structural alignments of about 5900 homologous proteins from 718 families of known 3-D structures. We observe shifts in the alignment of regular secondary structural elements (helices and strands) between pairwise and multiple structural alignments. The differences between pairwise and multiple structural alignments within helical and 𝛽-strand regions often correspond to 4 and 2 residue positions respectively. Such shifts correspond approximately to “one turn” of these regular secondary structures. We have performed manual analysis explicitly on the family of protein kinases. We note shifts of one or two turns in helix-helix alignments obtained using pairwise and multiple structural alignments. Investigations on the quality of the equivalent helix-helix, strand-strand pairs in terms of their residue side-chain accessibilities have been made. Our results indicate that the quality of the pairwise alignments is comparable to that of the multiple structural alignments and, in fact, is often better. We propose that pairwise alignment of protein structures should also be used in formulation of methods for structure prediction and evolutionary analysis.

• Use of secondary structural information and C𝛼-C𝛼 distance restraints to model protein structures with MODELLER

Protein secondary structure predictions and amino acid long range contact map predictions from primary sequence of proteins have been explored to aid in modelling protein tertiary structures. In order to evaluate the usefulness of secondary structure and 3D-residue contact prediction methods to model protein structures we have used the known Q3 (alpha-helix, beta-strands and irregular turns/loops) secondary structure information, along with residue-residue contact information as restraints for MODELLER. We present here results of our modelling studies on 30 best resolved single domain protein structures of varied lengths. The results shows that it is very difficult to obtain useful models even with 100% accurate secondary structure predictions and accurate residue contact predictions for up to 30% of residues in a sequence. The best models that we obtained for proteins of lengths 37, 70, 118, 136 and 193 amino acid residues are of RMSDs 4.17, 5.27, 9.12, 7.89 and 9.69, respectively. The results show that one can obtain better models for the proteins which have high percent of alpha-helix content. This analysis further shows that MODELLER restrain optimization program can be useful only if we have truly homologous structure(s) as a template where it derives numerous restraints, almost identical to the templates used. This analysis also clearly indicates that even if we satisfy several true residue-residue contact distances, up to 30% of their sequence length with fully known secondary structural information, we end up predicting model structures much distant from their corresponding native structures.

• ARC: Automated Resource Classifier for agglomerative functional classification of prokaryotic proteins using annotation texts

Functional classification of proteins is central to comparative genomics. The need for algorithms tuned to enable integrative interpretation of analytical data is felt globally. The availability of a general, automated software with built-in flexibility will significantly aid this activity. We have prepared ARC (Automated Resource Classifier), which is an open source software meeting the user requirements of flexibility. The default classification scheme based on keyword match is agglomerative and directs entries into any of the 7 basic non-overlapping functional classes: Cell wall, Cell membrane and Transporters ($\mathcal{C}$), Cell division ($\mathcal{D}$), Information ($\mathcal{I}$), Translocation ($\mathcal{L}$), Metabolism ($\mathcal{M}$), Stress($\mathcal{R}$), Signal and communication($\mathcal{S}$) and 2 ancillary classes: Others ($\mathcal{O}$) and Hypothetical ($\mathcal{H}$). The keyword library of ARC was built serially by first drawing keywords from Bacillus subtilis and Escherichia coli K12. In subsequent steps, this library was further enriched by collecting terms from archaeal representative Archaeoglobus fulgidus, Gene Ontology, and Gene Symbols. ARC is 94.04% successful on 6,75,663 annotated proteins from 348 prokaryotes. Three examples are provided to illuminate the current perspectives on mycobacterial physiology and costs of proteins in 333 prokaryotes. ARC is available at http://arc.igib.res.in.

• Synonymous codon usage in different protein secondary structural classes of human genes: Implication for increased non-randomness of GC3 rich genes towards protein stability

The relationship between the synonymous codon usage and different protein secondary structural classes were investigated using 401 Homo sapiens proteins extracted from Protein Data Bank (PDB). A simple Chi-square test was used to assess the significance of deviation of the observed and expected frequencies of 59 codons at the level of individual synonymous families in the four different protein secondary structural classes. It was observed that synonymous codon families show non-randomness in codon usage in four different secondary structural classes. However, when the genes were classified according to their GC3 levels there was an increase in non-randomness in high GC3 group of genes. The non-randomness in codon usage was further tested among the same protein secondary structures belonging to four different protein folding classes of high GC3 group of genes. The results show that in each of the protein secondary structural unit there exist some synonymous family that shows class specific codonusage pattern. Moreover, there is an increased non-random behaviour of synonymous codons in sheet structure of all secondary structural classes in high GC3 group of genes. Biological implications of these results have been discussed.

• Cytoview: Development of a cell modelling framework

The biological cell, a natural self-contained unit of prime biological importance, is an enormously complex machine that can be understood at many levels. A higher-level perspective of the entire cell requires integration of various features into coherent, biologically meaningful descriptions. There are some efforts to model cells based on their genome, proteome or metabolome descriptions. However, there are no established methods as yet to describe cell morphologies, capture similarities and differences between different cells or between healthy and disease states. Here we report a framework to model various aspects of a cell and integrate knowledge encoded at different levels of abstraction, with cell morphologies at one end to atomic structures at the other. The different issues that have been addressed are ontologies, feature description and model building. The framework describes dotted representations and tree data structures to integrate diverse pieces of data and parametric models enabling size, shape and location descriptions. The framework serves as a first step in integrating different levels of data available for a biological cell and has the potential to lead to development of computational models in our pursuit to model cell structure and function, from which several applications can flow out.

• Sub classification and targeted characterization of prophage-encoded two-component cell lysis cassette

Bacteriophage induced lysis of host bacterial cell is mediated by a two component cell lysis cassette comprised of holin and lysozyme. Prophages are integrated forms of bacteriophages in bacterial genomes providing a repertoire for bacterial evolution. Analysis using the prophage database (http://bicmku.in:8082) constructed by us showed 47 prophages were associated with putative two component cell lysis genes. These proteins cluster into four different subgroups. In this process, a putative holin (essd) and endolysin (ybcS), encoded by the defective lambdoid prophage DLP12 was found to be similar to two component cell lysis genes in functional bacteriophages like p21 and P1. The holin essd was found to have a characteristic dual start motif with two transmembrane regions and C-terminal charged residues as in class II holins. Expression of a fusion construct of essd in Escherichia coli showed slow growth. However, under appropriate conditions, this protein could be over expressed and purified for structure function studies. The second component of the cell lysis cassette, ybcS, was found to have an N-terminal SAR (Signal Arrest Release) transmembrane domain. The construct of ybcS has been over expressed in E. coli and the purified protein was functional, exhibiting lytic activity against E. coli and Salmonella typhi cell wall substrate. Such targeted sequence-structure-function characterization of proteins encoded by cryptic prophages will help understand the contribution of prophage proteins to bacterial evolution.

• The p53-MDM2 network: from oscillations to apoptosis

The p53 protein is well-known for its tumour suppressor function. The p53-MDM2 negative feedback loop constitutes the core module of a network of regulatory interactions activated under cellular stress. In normal cells, the level of p53 proteins is kept low by MDM2, i.e. MDM2 negatively regulates the activity of p53. In the case of DNA damage, the p53-mediated pathways are activated leading to cell cycle arrest and repair of the DNA. If repair is not possible due to excessive damage, the p53-mediated apoptotic pathway is activated bringing about cell death. In this paper, we give an overview of our studies on the p53-MDM2 module and the associated pathways from a systems biology perspective. We discuss a number of key predictions, related to some specific aspects of cell cycle arrest and cell death, which could be tested in experiments.

• Type 2 diabetes mellitus: phylogenetic motifs for predicting protein functional sites

Diabetes mellitus, commonly referred to as diabetes, is a medical condition associated with abnormally high levels of glucose (or sugar) in the blood. Keeping this view, we demonstrate the phylogenetic motifs (PMs) identification in type 2 diabetes mellitus very likely corresponding to protein functional sites. In this article, we have identified PMs for all the candidate genes for type 2 diabetes mellitus. Glycine 310 remains conserved for glucokinase and potassium channel KCNJ11. Isoleucine 137 was conserved for insulin receptor and regulatory subunit of a phosphorylating enzyme. Whereas residues valine, leucine, methionine were highly conserved for insulin receptor. Occurrence of proline was very high for calpain 10 gene and glucose transporter

• The next step in biology: A periodic table?

Systems biology is an approach to explain the behaviour of a system in relation to its individual components. Synthetic biology uses key hierarchical and modular concepts of systems biology to engineer novel biological systems. In my opinion the next step in biology is to use molecule-to-phenotype data using these approaches and integrate them in the form a periodic table. A periodic table in biology would provide chassis to classify, systematize and compare diversity of component properties vis-a-vis system behaviour. Using periodic table it could be possible to compute higher-level interactions from component properties. This paper examines the concept of building a bio-periodic table using protein fold as the fundamental unit.

• Modularized study of human calcium signalling pathway

Signalling pathways are complex biochemical networks responsible for reg ulation of numerous cellular functions. These networks function by serial and successive interactions among a large number of vital biomolecules and chemical compounds. For deciphering and analysing the underlying mechanism of such networks, a modularized study is quite helpful. Here we propose an algorithm for modularization of calcium signalling pathway of H. sapiens. The idea that a node whose function is dependant on maximum number of other nodes tends to be the center of a sub network” is used to divide a large signalling network into smaller sub networks. Inclusion of node(s) into sub networks(s) is dependant on the outdegree of the node(s). Here outdegree of a node refers to the number of re lations of the considered node lying outside the constructed sub network. Node(s) having more than c relations lying outside the expanding subnetwork have to be excluded from it. Here 𝑐 is a specified variable based on user preference, which is finally fixed during adjustments of created subnetworks, so that certain biological significance can be conferred on them.

• Gene ordering in partitive clustering using microarray expressions

A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering and ordering the genes using gene expression data into homogeneous groups was shown to be useful in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on gene ordering in hierarchical clustering framework for gene expression analysis, there is no work addressing and evaluating the importance of gene ordering in partitive clustering framework, to the best knowledge of the authors. Outside the framework of hierarchical clustering, different gene ordering algorithms are applied on the whole data set, and the domain of partitive clustering is still unexplored with gene ordering approaches. A new hybrid method is proposed for ordering genes in each of the clusters obtained from partitive clustering solution, using microarray gene expressions. Two existing algorithms for optimally ordering cities in travelling salesman problem (TSP), namely, FRAG_GALK and Concorde, are hybridized individually with self organizing MAP to show the importance of gene ordering in partitive clustering framework. We validated our hybrid approach using yeast and fibroblast data and showed that our approach improves the result quality of partitive clustering solution, by identifying subclusters within big clusters, grouping functionally correlated genes within clusters, minimization of summation of gene expression distances, and the maximization of biological gene ordering using MIPS categorization. Moreover, the new hybrid approach, finds comparable or sometimes superior biological gene order in less computation time than those obtained by optimal leaf ordering in hierarchical clustering solution.

• Analysis of breast cancer progression using principal component analysis and clustering

We develop a new technique to analyse microarray data which uses a combination of principal components analysis and consensus ensemble 𝑘-clustering to find robust clusters and gene markers in the data. We apply our method to a public microarray breast cancer dataset which has expression levels of genes in normal samples as well as in three pathological stages of disease; namely, atypical ductal hyperplasia or ADH, ductal carcinoma in situ or DCIS and invasive ductal carcinoma or IDC. Our method averages over clustering techniques and data perturbation to find stable, robust clusters and gene markers. We identify the clusters and their pathways with distinct subtypes of breast cancer (Luminal, Basal and Her2+). We confirm that the cancer phenotype develops early (in early hyperplasia or ADH stage) and find from our analysis that each subtype progresses from ADH to DCIS to IDC along its own specific pathway, as if each was a distinct disease.

• # Journal of Biosciences

Current Issue
Volume 44 | Issue 5
October 2019

• # Editorial Note on Continuous Article Publication

Posted on July 25, 2019