Bioinformatics isn’t just for research anymore; it’s being used for treatments that can save lives. Rapid and high-accuracy analysis at lower cost is needed. Next-generation genetic sequencing systems are creating a huge amount of fragmented data that have to be analyzed and re-assembled. Current methods aren’t ideal—they’re either fast but lack accuracy, or require impractical long runtimes for the highest accuracy algorithms. Other follow-on analytics are equally tough compute challenges. AP technology can speed up solutions manyfold.
Next Gen Sequencing Analysis
Next-generation sequencing (NGS) of genetic information with its associated massive data output is at the heart of many of the most exciting scientific and medical advances—especially for cancers and hereditary diseases. These discoveries depend on high-performance computing to enable these insights, but the bioinformatics compute load is growing faster than Moore’s law for conventional processors. Modern NGS systems are many orders of magnitude faster than “first generation” Sanger-type systems.
This chart shows the enormous NGS-driven growth rate for the number of bases stored just in the National Institute of Health’s GenBank genetic database. There is no slowdown in sight, as NGS system companies drive down the cost and increase throughput, to the point where in the not-so-distant future, individual human sequencing for more routine clinical use will be quite affordable. For example, there is an ongoing revolution in “targeted meds” designed to work precisely against cancer tumors with very specific mutations—making treatment more effective and less toxic than traditional “chemo.”
However, as genomic data accummulates, motif search and RNA sequencing are just a few examples among many for follow-on analytics of NGS re-assembled genetic data.
NGS Sequence Re-Assembly
Why does the NGS approach create such an enormous compute task in the first place? The answer is that NGS systems must heavily fragment the whole-gene samples in order to achieve massively parallel throughput in the instrument. This “shotgun sequencing” technique does not produce entire genomes. Instead, it generates many thousands of small DNA fragments. The ends of these fragments overlap and, when aligned properly by sequence re-assembly algorithms, can be used to reconstruct the complete digital version of the genome. For a genome as large as the human genome (with over 3 billion base pairs), it can take many days of CPU time on very large server clusters to assemble the fragments. Fast and accurate assembly algorithms are a critical area of bioinformatics research.
There are two very different computational and algorithmic approaches to assembling the fragments into a complete genome. Dynamic programming algorithms are highly accurate but terribly compute-intense. Heuristic algorithms are orders of magnitude faster but significantly less accurate.
Dynamic algorithms for sequence alignment, such as Smith-Waterman or Burrows-Wheeler, give the highest quality results. These methods involve constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Dynamic programming computations are first used on all pairs of query sequences and then the “alignment space” is filled in by considering possible matches or gaps at intermediate positions. With AP hardware acceleration, there is potential to use these much more accurate analysis algorithms on much bigger datasets, or perhaps even entire genomes.
Even more heavily used for reassembly are heuristic algorithms such as FASTA or the popular Basic Local Alignment Search Tool (BLAST). These probabilistic, less accurate methods were designed for orders of magnitude faster runtimes for large-scale genetic database analysis. However, they do not guarantee finding optimal matches. BLAST finds similar sequences, not by comparing either sequence in its entirety, but rather by heuristically locating short matches between the two sequences (“seeding”). Then BLAST begins to make local DNA alignments. These results will then be used to build an alignment.
After making words for the sequence of interest, neighboring words are also assembled using thresholds and a scoring matrix. They can still take significant time to finish, from days to weeks, so there can be a large runtime advantage for AP hardware acceleration for heuristic approaches as well.
Searching and analyzing symbols of the genetic alphabet is precisely the type of task that Micron’s Automata Processing is so well suited for. It is massively parallel, cost effective, and straightforward to program for accelerating host bioinformatics software.
Finding approximately conserved sequences, called “motifs,” across multiple DNA or protein sequences is an important problem in biology. Discovery of these patterns or motifs helps in many ways:
- Identification of transcription factor binding sites and their functional consequences
- Identification of transcriptional regulatory elements and their functional consequences
- Finding variants that cause genetic diseases, including cancers
- Suggesting therapeutic drug targets, such as for specific genetic or tumor mutations
Motif search has three types:
- PMP - Planted Motif search Problem
- SMP - Simple Motif search Problem
- EMP - Edit-distance-based e-based Motif search Problem
Much like NGS re-assembly, the same algorithmic quandary has been the case for motif search as well─you can either have rapid, but often “false-negative” outputs with the heuristic class of algorithms, or terribly slow but thorough results with exact algorithms.
AP technology can break this logjam.
The Georgia Institute of Techndology designed an exact algorithm for solving PMP and qPMP using non-deterministic finite automata programmed into the Micron Automata Processor. The algorithm, termed MOTOMATA (for MOTif automata), not only has lower execution times but can also handle more challenging instances of the problem which have yet to be solved.
# of (l,d) Pairs (for 20 cliques using an exact algorithm)
Micron Automata Board + 1 CPU (est)
Preprocessing Sequence Data
Preprocessing can improve the quality of sequence assembly. Sometimes, the raw read fragments produced by the sequencer are correct and precise only in a fraction of their length. Using the entire read fragment may introduce artifacts in downstream analyses such as NGS re-assembly, SNP calling, or gene expression estimation. Read trimming aims at removing low-quality portions while preserving the longest high-quality part of NGS reads. Trimming has also been used in transcriptome assembly, metagenome reconstruction, RNA-Seq, epigenetic studies, and comparative genomics. Although not as big a task as NGS genome sequence re-assembly, pre-processing is compute-intense and may benefit from AP hardware acceleration.
Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Bionformatics is enabling researchers to trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone. Comparing entire genomes permits the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction of factors important in bacterial speciation. These analyses can be accelerated with AP technology.
Oncogenomics uses high-throughput genomic analysis to characterize genes associated with cancer and to identify specific tumor mutations amenable to targeted drugs. The goal of oncogenomics is to identify new oncogenes or tumor suppressor genes that may provide new insights into cancer diagnosis, predicting clinical outcome of cancers, and new targets for cancer therapies. The successes of targeted cancer drugs such as Gleevec, Herceptin, and Avastin have proven the approach.
Cancer mutation detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per session. Many of the targeted drugs are only effective for patients with very specific tumor mutations, and time is critical for late-stage cancer patients. AP technology can help speed analysis of results for both cancer researchers and clinicians.
Genomic literature Search
The growth in the number of published literature makes it virtually impossible to read every paper, resulting in disjointed sub-fields of research. Literature analysis aims to employ computational and statistical linguistics to mine this growing library of text resources. For example, searches of scholarly literature that mention certain DNA sequences can be mapped to genetic databases and vice versa. With Micron AP acceleration, there is potential for more rapidly doing these massive symbol sequence searches to link to massive amounts of scholarly documents.
Bioinformatics Research at Universities - Sponsored by Micron
Georgia Institute of Technology
Micron is working closely with ecosystem partners and research institutions to grow awareness and engagement for this new technology. Srinivas Aluru, Professor of Computational Science and Engineering at G.I.T., is a leader in the field of high-performance computational biology. He has been deeply involved in early research efforts using Micron’s Automata Processors to solve bioinformatics problems. “Micron’s Automata Processor offers a refreshingly new way of solving problems that is very different from all other accelerator technologies,” said Aluru. “By deploying this in interesting ways, we have been able to solve a much larger instance of the NP-hard biological motif-finding problem than was previously reported, using the resources within a single Automata Processor board.”
Iowa State University
A graduate student of Professor Aluru’s wrote in his thesis about the impressive acceleration achieved. He also described his experience with Micron’s AP platform from both a computer engineering and a bioinformatics points of view. Christopher Sabotta, at Iowa State University, stated, “The Automata Processor is the first semiconductor device where the execution of an NFA (nondeterministic finite automata) can be directly emulated on the device. Besides, the large capacity of the processor allows the execution of many NFA in parallel, leading to large-scale speedup. Thus the Automata Processor can work as an effective accelerator for many applications in bioinformatics. An intuitive use of the Automata Processor is to aid in the characterization of protein sequences through the identification of motif patterns… at many orders of magnitude faster than state-of-the-art CPU based algorithms. For example, all the proteins present in the proteome of Escherichia coli (E.coli ) can be scanned for all the motif patterns present in PROSITE in less than 100 milliseconds, in contrast to the several minutes reported by state-of-art CPU based solutions… with board capacity left over for 47 other simultaneous jobs."
Regarding other acceleration hardware, Sabotta added that, “While NFA are difficult to emulate with a sequential processor, Field Programmable Gate Arrays (FPGAs) are capable of NFA modeling with the use of lookup tables. Similarly, GPU-based solutions have been devised. Due to its specialization, the Automata Processor is able to show a decisive advantage over [competing] solutions. Furthermore, the Automata Processor remains the only such specialized non-FPGA, non-GPU hardware.”