Fragment Assembly in DNA Sequencing
with Dumitru Brinza and Mark Chaisson

Imagine several copies of a book cut by scissors into 10 million small pieces. Each copy is cut in an individual way so that a piece from one copy may overlap a piece from another copy. Assuming that 1 million pieces are lost and the remaining 9 million are splashed with ink, try to recover the original text. After doing this you'll get a feeling of what a DNA sequencing problem is like. Classical sequencing technology allows a biologist to read short (300- to 500-letter) fragments per experiment (each of these fragments corresponds to one of the 10 million pieces). Computational biologists have to assemble the entire genome from these short fragments, a task not unlike assembling the book from millions of slips of paper. The problem is complicated by unavoidable experimental errors(ink splashes).

For the last twenty years, fragment assembly in DNA sequencing followed the "overlap - layout - consensus" paradigm that is used in all currently available assembly tools. Although this approach proved to be useful.The existing algorithms make assembly errors and are often unable to resolve repeats even in prokaryotic genomes. Biologists are well-aware of these errors and are forced to carry additional experiments to verify the assembled contigs.

 We abandon the classical "overlap - layout - consensus" approach in favor of a new Eulerian approach. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem. This reduction opens new possibilities for repeat resolution and allows one to generate solutions of the fragment assembly problems. We are currently exploring applications of the Eulerian approach to the emerging short read technologies developed by 454 Life Sciences, Illumina and ABI.