Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.
Nowadays, researchers typically rely on next-generation sequencers to translate the unique sequences of DNA bases (there are only four) into letters: A, G, C and T. While DNA strands can be billions of bases long, these machines produce very short reads, about 50 to 300 characters at a time. To extract meaning from these letters, scientists need to reconstruct portions of the genome — a process akin to rebuilding the sentences and paragraphs of a book from snippets of text.
But this process can quickly become complicated and time-consuming, especially because some genomes are enormous. For example, while the human genome contains about 3 billion bases, the wheat genome contains nearly 17 billion bases and the pine genome contains about 23 billion bases. Sometimes the sequencers will also introduce errors into the dataset, which need to be filtered out. And most of the time, the genomes need to be assembled de novo, or from scratch. Think of it like putting together a ten billion-piece jigsaw puzzle without a complete picture to reference.
By applying some novel algorithms, computational techniques and the innovative programming language Unified Parallel C (UPC) to the cutting-edge de novo genome assembly tool Meraculous, a team of scientists from the Lawrence Berkeley National Laboratory (Berkeley Lab)’s Computational Research Division (CRD), Joint Genome Institute (JGI) and UC Berkeley, simplified and sped up genome assembly, reducing a months-long process to mere minutes. This was primarily achieved by “parallelizing” the code to harness the processing power of supercomputers, such as the National Energy Research Scientific Computing Center’s (NERSC’s) Edison system. Put simply, parallelizing code means splitting up tasks once executed one-by-one and modifying or rewriting the code to run on the many nodes (processor clusters) of a supercomputer all at once.
“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes using 15,360 computer processor cores. With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of NERSC’s Edison supercomputer,” says Evangelos Georganas, a UC Berkeley graduate student who led the effort to parallelize Meraculous. He is also the lead author of a paper published and presented at the SC Conference in November 2014.
“This work has dramatically improved the speed of genome assembly,” says Leonid Oliker computer scientist in CRD. “The new parallel algorithms enable assembly calculations to be performed rapidly, with near linear scaling over thousands of cores. Now genomics researchers can assemble large genomes like wheat and pine in minutes instead of months using several hundred nodes on NERSC’s Edison.”
Supercomputers: A Game Changer for Assembly
High throughput and relatively low cost next-generation DNA sequencers have allowed researchers to look for biological solutions to everything from generating clean energy and environmental cleanup to identifying connections between genetic mutations and cancer. For the most part, these machines are very accurate at recording the sequence of DNA bases. But sometimes errors such as substitutions, repetitions, transpositions and omissions do occur — akin to “typos” in a book. These errors complicate analysis by making it harder to assemble genomes and identify genetic mutations. They can also lead researchers to misinterpret the function of a gene.
One technique that researchers often use to identify errors is called shotgun sequencing. This involves taking numerous copies of a DNA strand, breaking it up randomly into numerous smaller pieces and then sequencing each piece separately. This produces a number of overlapping short reads that allow scientists to eventually reassemble the whole DNA strand. Sequencing numerous copies of the same DNA strand also helps identify errors. But for a particularly complex genome, this process also generates a tremendous amount of data, sometimes several terabytes.
To identify errors in this data quickly and effectively, the Berkeley Lab and UC Berkeley team relied on “Bloom filters” and massively parallel supercomputers. Conceived by Burton H. Bloom in 1970, Bloom filters are very efficient at recognizing whether or not an element is a member of the set. Thus, researchers can rely on this tool to tell them if a base is out of place and is likely a mistake. Because bit arrays comprise a Bloom filter’s underlying structure, they also require relatively little memory, making them ideal for querying massive datasets.
“Applying Bloom filters to this part of the genome assembly problem is not new, it has been done before. What we have done differently is to get Bloom filters to work with distributed memory systems,” says Aydin Buluç, a research scientist in CRD. “This task was not trivial, it required some computing expertise to accomplish.”
The team also developed solutions for parallelizing data input and output (I/O). “When you have several terabytes of data, just getting the computer to read your data and output results can be a huge bottleneck,” says Steven Hofmeyr, a research scientist in CRD who developed these solutions. “By allowing the computer to download the data in multiple threads, we were able to speed up the I/O process from hours to minutes.”
The Latest on: Genome Assembly
via Google News
The Latest on: Genome Assembly
- Kansas and global researchers make wheat genome sequencing breakthroughon November 27, 2020 at 9:16 am
Researchers in America's breadbasket and across the globe may have found a way to enhance global wheat production through genome sequencing of over a dozen wheat varieties.
- Wilhelm J. Ansorge, Famed Scientist Now Serves on European Scientific Advisory Boardson November 27, 2020 at 8:45 am
Professor Wilhelm J. Ansorge, after a long scientific career at top laboratories (Texas Instruments, CERN, and EMBL), is now a ...
- Sime Darby Plantation’s Oil Palm Genome Research Available on Open-access Journal in Support of a Deforestation-Free Industryon November 27, 2020 at 8:38 am
Over a decade of research by its R&D team has led to the development of the Company’s latest higher yielding seeds, the GenomeSelect™, recognised by the Edison Award 2017 (Energy and Sustainable ...
- New genome sequencing rekindles hope for fighting wheat blaston November 26, 2020 at 9:18 am
CIMMYT, USask contribute to decode genetic maps of 15 wheat varieties In a landmark discovery for global wheat production, an international team led by the University of Saskatchewan (USask) and ...
- New wheat and barley genomes will help feed worldon November 25, 2020 at 3:26 pm
Associate Professor Ken Chalmers inspecting wheat grain An international research collaboration, including scientists from ...
- Global collaboration is unlocking wheat's genetic potentialon November 25, 2020 at 9:12 am
In a paper published Wednesday, Nov. 25, 2020, in Nature, Kansas State University researchers, in collaboration with the international 10+ Genome Project led by the University of Saskatchewan, have ...
- Wheat and barley are incredibly diverseon November 25, 2020 at 8:42 am
Langridge is a co-author of a paper describing the wheat study, which was led by Curtis Pozniak from Canada’s University of Saskatchewan. The barley study was led by Nils Stein from Leibniz Institute ...
- New wheat and barley genomes will help feed the worldon November 25, 2020 at 8:11 am
An international research collaboration, including scientists from the University of Adelaide's Waite Research Institute, has unlocked new genetic variation in wheat and barley—a major boost for the ...
- How To Make Almost Anything – Exclusive Interview with Dr. Neil Gershenfeld, MIT Center for Bits and Atomson November 24, 2020 at 2:19 pm
Dr. Neil Gershenfeld, Director of MIT’s Center for Bits and Atoms, meshes together the physical and virtual worlds through quantum computing and digital fabrication of the IoT, and pushes the ...
- Decoding gigantic insect genome could help tackle devastating locust criseson November 24, 2020 at 7:42 am
A 'game changing' study deciphering the genetic material of the desert locust by researchers at the University of Leicester, could help combat the crop-ravaging behaviour of the notorious insect pest ...
via Bing News