pp 671-682 Articles
The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, `Evolving role of diabetes educators', `Cancer risk assessment' and `Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.r- project.org/web/packages/pubmed.mineR.
pp 683-699 Articles
The representation of proteins as networks of interacting amino acids, referred to as protein contact networks (PCN), and their subsequent analyses using graph theoretic tools, can provide novel insights into the key functional roles of specific groups of residues. We have characterized the networks corresponding to the native states of 66 proteins (belonging to different families) in terms of their core–periphery organization. The resulting hierarchical classification of the amino acid constituents of a protein arranges the residues into successive layers – having higher core order– with increasing connection density, ranging from a sparsely linked periphery to a densely intra-connected core (distinct from the earlier concept of protein core defined in terms of the three-dimensional geometry of the native state, which has least solvent accessibility). Our results show that residues in the inner cores are more conserved than those at the periphery. Underlining the functional importance of the network core, we see that the receptor sites for known ligand molecules of most proteins occur in the innermost core. Furthermore, the association of residues with structural pockets and cavities in binding or active sites increases with the core order. From mutation sensitivity analysis, we show that the probability of deleterious or intolerant mutations also increases with the core order. We also show that stabilization centre residues are in the innermost cores, suggesting that the network core is critically important in maintaining the structural stability of the protein. A publicly available Web resource for performing core–periphery analysis of any protein whose native state is known has been made available by us at http://www.imsc.res.in/ ~ sitabhra/proteinKcore/index.html.
pp 701-708 Articles
Protein–protein interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. Identifying protein complexes is of great importance for understanding cellular organization and functions of organisms. In this work, a method is proposed, referred to as MIPCE, to find protein complexes in a PPI network based on mutual information. MIPCE has been biologically validated by GO-based score and satisfactory results have been obtained. We have also compared our method with some well-known methods and obtained better results in terms of various parameters such as precession, recall and F-measure.
pp 709-719 Articles
We performed canonical correlation analysis as an unsupervised statistical tool to describe related views of the same semantic object for identifying patterns. A pattern recognition technique based on canonical correlation analysis (CCA) was proposed for finding required genetic code in the DNA sequence. Two related but different objects were considered: one was a particular pattern, and other was test DNA sequence. CCA found correlations between two observations of the same semantic pattern and test sequence. It is concluded that the relationship possesses maximum value in the position where the pattern exists. As a case study, the potential of CCA was demonstrated on the sequence found from HIV-1 preferred integration sites. The subsequences on the left and right flanking from the integration site were considered as the two views, and statistically significant relationships were established between these two views to elucidate the viral preference as an important factor for the correlation.
pp 721-730 Articles
Reduction of dimensionality has emerged as a routine process in modelling complex biological systems. A large number of feature selection techniques have been reported in the literature to improve model performance in terms of accuracy and speed. In the present article an unsupervised feature selection technique is proposed, using maximum information compression index as the dissimilarity measure and the well-known density-based cluster identification technique DBSCAN for identifying the largest natural group of dissimilar features. The algorithm is fast and less sensitive to the user-supplied parameters. Moreover, the method automatically determines the required number of features and identifies them. We used the proposed method for reducing dimensionality of a number of benchmark data sets of varying sizes. Its performance was also extensively compared with some other well-known feature selection methods.
pp 731-740 Articles
Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely, support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabelled data for training. We investigated inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabelled data. We then applied our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluated the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.
pp 741-754 Articles
In this article, we have used an index, called Gaussian fuzzy index (GFI), recently developed by the authors, based on the notion of fuzzy set theory, for validating the clusters obtained by a clustering algorithm applied on cancer gene expression data. GFI is then used for the identification of genes that have altered quite significantly from normal state to carcinogenic state with respect to their mRNA expression patterns. The effectiveness of the methodology has been demonstrated on three gene expression cancer datasets dealing with human lung, colon and leukemia. The performance of GFI is compared with 19 exiting cluster validity indices. The results are appropriately validated biologically and statistically. In this context, we have used biochemical pathways, 𝑝-value statistics of GO attributes, 𝑡-test and 𝑧-score for the validation of the results. It has been reported that GFI is capable of identifying high-quality enriched clusters of genes, and thereby is able to select more cancer-mediating genes.
pp 755-767 Articles
A challenge in bioinformatics is to analyse volumes of gene expression data generated through microarray experiments and obtain useful information. Consequently, most microarray studies demand complex data analysis to infer biologically meaningful information from such high-throughput data. Selection of informative genes is an important data analysis step to identify a set of genes which can further help in finding the biological information embedded in microarray data, and thus assists in diagnosis, prognosis and treatment of the disease. In this article we present an unsupervised feature selection technique which attempts to address the goal of explorative data analysis, unfolding the multi-faceted nature of data. It focuses on extracting multiple clustering views considering the diversity of each view from high-dimensional data. We evaluated our technique on benchmark data sets and the experimental results indicates the potential and effectiveness of the proposed model in comparison to the traditional single view clustering models, as well as other existing methods used in the literature for the studied datasets.
pp 769-789 Articles
Various T-cell co-receptor molecules and calcium channel CRAC play a pivotal role in the maintenance of cell’s functional responses by regulating the production of effector molecules (mostly cytokines) that aids in immune clearance and also maintaining the cell in a functionally active state. Any defect in these co-receptor signalling pathways may lead to an altered expression pattern of the effector molecules. To study the propagation of such defects with time and their effect on the intracellular protein expression patterns, a comprehensive and largest pathway map of T-cell activation network is reconstructed manually. The entire pathway reactions are then translated using logical equations and simulated using the published time series microarray expression data as inputs. After validating the model, the effect of in silico knock down of co-receptor molecules on the expression patterns of their downstream proteins is studied and simultaneously the changes in the phenotypic behaviours of the T-cell population are predicted, which shows significant variations among the proteins expression and the signalling routes through which the response is propagated in the cytoplasm. This integrative computational approach serves as a valuable technique to study the changes in protein expression patterns and helps to predict variations in the cellular behaviour.
pp 791-798 Articles
MicroRNAs are a class of important post-transcriptional regulators. Genetic and somatic mutations in miRNAs, especially those in the seed regions, have profound and broad impacts on gene expression and physiological and pathological processes. Over 500 SNPs were mapped to the miRNA seeds, which are located at position 2–8 of the mature miRNA sequences. We found that the central positions of the miRNA seeds contain fewer genetic variants and therefore are more evolutionary conserved than the peripheral positions in the seeds. We developed a knowledge-based method to analyse the functional impacts of mutations in miRNA seed regions. We computed the gene ontology-based similarity score GOSS and the GOSS percentile score for all 517 SNPs in miRNA seeds. In addition to the annotation of SNPs for their functional effects, in the present article we also present a detailed analysis pipeline for finding the key functional changes for seed SNPs. We performed a detailed gene ontology graph-based analysis of enriched functional categories for miRNA target gene sets. In the analysis of a SNP in the seed region of hsa-miR-96 we found that two key biological processes for progressive hearing loss `Neurotrophin TRK receptor signaling pathway' and `Epidermal growth factor receptor signaling pathway' were significantly and differentially enriched by the two sets of allele-specific target genes of miRNA hsa-miR-96.
pp 799-808 Articles
Many methods have been developed for finding the commonalities between different organisms in order to study their phylogeny. The structure of metabolic networks also reveals valuable insights into metabolic capacity of species as well as into the habitats where they have evolved. We constructed metabolic networks of 79 fully sequenced organisms and compared their architectures. We used spectral density of normalized Laplacian matrix for comparing the structure of networks. The eigenvalues of this matrix reflect not only the global architecture of a network but also the local topologies that are produced by different graph evolutionary processes like motif duplication or joining. A divergence measure on spectral densities is used to quantify the distances between various metabolic networks, and a split network is constructed to analyse the phylogeny from these distances. In our analysis, we focused on the species that belong to different classes, but appear more related to each other in the phylogeny. We tried to explore whether they have evolved under similar environmental conditions or have similar life histories. With this focus, we have obtained interesting insights into the phylogenetic commonality between different organisms.
pp 809-818 Articles
Protein–protein interaction (PPI) site prediction aids to ascertain the interface residues that participate in interaction processes. Fuzzy support vector machine (F-SVM) is proposed as an effective method to solve this problem, and we have shown that the performance of the classical SVM can be enhanced with the help of an interaction-affinity based fuzzy membership function. The performances of both SVM and F-SVM on the PPI databases of the Homo sapiens and E. coli organisms are evaluated and estimated the statistical significance of the developed method over classical SVM and other fuzzy membership-based SVM methods available in the literature. Our membership function uses the residue-level interaction affinity scores for each pair of positive and negative sequence fragments. The average AUC scores in the 10-fold cross-validation experiments are measured as 79.94% and 80.48% for the Homo sapiens and E. coli organisms respectively. On the independent test datasets, AUC scores are obtained as 76.59% and 80.17% respectively for the two organisms. In almost all cases, the developed F-SVM method improves the performances obtained by the corresponding classical SVM and the other classifiers, available in the literature.
pp 819-828 Articles
Due to socio-economic reasons, it is essential to design efficient stress-tolerant, more nutritious, high yielding rice varieties. A systematic understanding of the rice cellular metabolism is essential for this purpose. Here, we analyse a genome-scale metabolic model of rice leaf using Flux Balance Analysis to investigate whether it has potential metabolic flexibility to increase the biosynthesis of any of the biomass components. We initially simulate the metabolic responses under an objective to maximize the biomass components. Using the estimated maximum value of biomass synthesis as a constraint, we further simulate the metabolic responses optimizing the cellular economy. Depending on the physiological conditions of a cell, the transport capacities of intracellular transporters (ICTs) can vary. To mimic this physiological state, we randomly vary the ICTs’ transport capacities and investigate their effects. The results show that the rice leaf has the potential to increase glycine and starch in a wide range depending on the ICTs’ transport capacities. The predicted biosynthesis pathways vary slightly at the two different optimization conditions. With the constraint of biomass composition, the cell also has the metabolic plasticity to fix a wide range of carbon-nitrogen ratio.
Volume 42 | Issue 4