In this article, we present some simple yet effective statistical techniques for analysing and comparing large DNA sequences. These techniques are based on frequency distributions of DNA words in a large sequence, and have been packaged into a software called SWORDS. Using sequences available in public domain databases housed in the Internet, we demonstrate how SWORDS can be conveniently used by molecular biologists and geneticists to unmask biologically important features hidden in large sequences and assess their statistical significance.
We compare the annotation of three complete genomes using the ab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail.
We have analysed the genomes of representatives of three kingdoms of life, namely, archaea, eubacteria and eukaryota using data mining tools based on compositional analyses of the protein sequences. The representatives chosen in this analysis were Methanococcus jannaschii, Haemophilus influenzae and Saccharomyces cerevisiae. We have identified the common and different features between the three genomes in the protein evolution patterns. M. jannaschii has been seen to have a greater number of proteins with more charged amino acids whereas S. cerevisiae has been observed to have a greater number of hydrophilic proteins. Despite the differences in intrinsic compositional characteristics between the proteins from the different genomes we have also identified certain common characteristics. We have carried out exploratory Principal Component Analysis of the multivariate data on the proteins of each organism in an effort to classify the proteins into clusters. Interestingly, we found that most of the proteins in each organism cluster closely together, but there are a few ‘outliers’. We focus on the outliers for the functional investigations, which may aid in revealing any unique features of the biology of the respective organisms.
Bacterial genomes are extremely dynamic and mosaic in nature. A substantial amount of genetic information is inserted into or deleted from such genomes through the process of horizontal transfer. Through the introduction of novel physiological traits from distantly related organisms, horizontal gene transfer often causes drastic changes in the ecological and pathogenic character of bacterial species and thereby promotes microbial diversification and speciation. This review discusses how the recent influx of complete chromosomal sequences of various microorganisms has allowed for a quantitative assessment of the scope, rate and impact of horizontally transmitted information on microbial evolution.
Schizophrenia is a severe neuropsychiatric disorder with a polygenic mode of inheritance which is also governed by non-genetic factors. Candidate genes identified on the basis of biochemical and pharmacological evidence are being tested for linkage and association studies. Neurotransmitters, especially dopamine and serotonin have been widely implicated in its etiology. Genome scan of all human chromosomes with closely spaced polymorphic markers is being used for linkage studies. The completion and availability of the first draft of Human Genome Sequence has provided a treasure-trove that can be utilized to gain insight into the so far inaccessible regions of the human genome. Significant technological advances for identification of single nucleotide polymorphisms (SNPs) and use of microarrays have further strengthened research methodologies for genetic analysis of complex traits. In this review, we summarize the evolution of schizophrenia genetics from the past to the present, current trends and future direction of research.
Fourteen genetic neurodegenerative diseases and three fragile sites have been associated with the expansion of (CTG)n•(CAG)n, (CGG)n•(CCG)n, or (GAA)n•(TTC)n repeat tracts. Different models have been proposed for the expansion of triplet repeats, most of which presume the formation of alternative DNA structures in repeat tracts. One of the most likely structures, slipped strand DNA, may stably and reproducibly form within triplet repeat sequences. The propensity to form slipped strand DNA is proportional to the length and homogeneity of the repeat tract. The remarkable stability of slipped strand DNA may, in part, be due to loop-loop interactions facilitated by the sequence complementarity of the loops and the dynamic structure of three-way junctions formed at the loop-outs.
The pattern of angiotensin-converting enzyme (ACE) gene insertion/deletion (I/D) polymorphism in the Indian population is poorly known. In order to determine the status of the polymorphism, young unrelated male army recruits were screened. The population had cultural and linguistic differences and lived in an environment that varied significantly from one region to another. Analysis of the genotype, showed higher frequency of the insertion allele in four of the five groups i.e. I allele frequency was significantly higher (𝑃 < 0.05) in Dogras, Assamese and Kumaonese. The deletion allele frequency was comparatively higher in the fifth group that belonged to Punjab. A correlation was observed between the genotype and enzyme activity. Involvement of a single D allele in the genotype enhanced the activity up to 37.56 ± 3.13%. The results suggested ethnic heterogeneity with a significant gene cline with higher insertion allele frequency. Such population-based data on various polymorphisms can ultimately be exploited in pharmacogenomics.
Volume 42 | Issue 4