Open access toolkit for nonparametric explorative pattern mining to detect genomic events relating to disease in large scale genome sequences
Winner - BioinformaticsMadhavi Ganapathiraju ; Assistant Professor;
Content:
Events such as gene duplication, variations in tandem repeats, and abnormal methylation of CpG islands, are often markers of diseases (Alzheimers disease, coronary heart disease and cancer respectively). Locating regions in individual genomes containing these events can allow diagnosis of these disorders, including the severity and age of onset. Currently there are no tools for nonparametric pattern mining on large-scale genome sequence data. We developed a suite of nonparametric tools for large scale explorative pattern mining on genome sequences and applied them to study human chromosomes X and 19. The tools allow the biomedical and genomics community to study genomes efficiently at many resolutions and in many flexible ways to and draw inferences at a fast pace.
Technology:
Algorithms have been developed in C, and may be compiled and run on any Unix platform or on Windows platform with Cygwin software.
Design:
The genome sequence is preprocessed into an efficient data structure called suffix array, and its well-known augmentations, the longest common prefix array and rank array. The language modeling toolkit computes perplexity, which is a measure of how predictable the nth nucleotide is, given the (n-1) preceding nucleotides. Repeat rich regions are indicated by a drop in perplexity.
Results:
The tools can perform two types of analyses: (1) compute how complex a genome sequence is, (2) find where repetitive sequences are and (2) compare how alike or different two sequences are. When applied to human X-chromosome, it revealed the repeat rich p-arm and centromere, (indicated by dips in Perplexity in Figure). Upon closer analysis of these regions, specifically around windows 47-50 it revealed highly abundant n-grams in this region (see high peaks of n-gram counts in this region in Figure) which are rare in the rest of the chromosome. Analysis of chromosome 19 showed a number of repeat elements, a possible location of centromere and at least one reverse complement region. The tools are parameter free, are scalable, and can aid discovery of patterns that have biomedical significance.
Conclusion:
The toolkit which will be released in open access with complete functionality can be applied to discover a number of genomic events.
