Key Takeaways
- ETH Zurich has developed “MetaGraph,” a digital tool that rapidly searches through millions of DNA records.
- This tool simplifies and accelerates genetic research, aiding in the study of antibiotic resistance and unknown pathogens.
- MetaGraph is available as open source and currently indexes nearly half of the world’s sequence data.
Revolutionizing DNA Data Search
Computer scientists at ETH Zurich have introduced an innovative digital tool, “MetaGraph,” that can sift through vast amounts of DNA data—millions of published records—within seconds. This advancement promises to enhance research concerning antibiotic resistance and unidentified pathogens significantly.
The progression of DNA sequencing has transformed biomedical research since its inception, particularly through next-generation sequencing techniques that made rapid decoding of viral genomes, such as SARS-CoV-2, possible during the COVID-19 pandemic. Increasingly, researchers are making sequenced DNA publicly accessible, leading to the accumulation of extensive data, particularly in significant repositories like the American Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA). These databases collectively hold approximately 100 petabytes of data, equivalent to the total text available online.
However, conventional methods for accessing this wealth of information required extensive computational power. Researchers faced challenges in efficiently searching through extensive DNA sequences and comparing them with their sequences. ETH Zurich’s team has addressed this obstacle with their MetaGraph tool, which allows users to conduct full-text searches similarly to a standard Internet search engine. By simply entering a desired DNA sequence into a search interface, researchers can swiftly discover its prior occurrences in the databases.
Professor Gunnar Rätsch of ETH Zurich likened the MetaGraph to “Google for DNA.” Previously, researchers sought descriptive metadata for their searches but had to download complete datasets to access raw data, a process that was time-consuming, costly, and often incomplete.
MetaGraph streamlines the search process while being cost-effective, with the potential to handle larger datasets on just a few computer hard drives. The researchers estimate that larger queries would cost approximately $0.74 per megabase, making it an economical choice for scientific investigations.
Enhancements in the tool include data compression techniques that index information in a more structured format, akin to organizing data in spreadsheet programs. MetaGraph can compress data by a factor of 300, thereby maintaining essential information yet reducing redundancy, similar to summarizing a lengthy book without losing main storylines. As Dr. André Kahles from the Biomedical Informatics Group at ETH Zurich describes, the researchers aim to maximize data compactness without sacrificing critical information.
The scalability of MetaGraph differentiates it from other tools currently in development, as it requires less additional computing power with larger data queries. Since its initial introduction in 2020, the ETH team has continuously refined the tool, which now facilitates queries across millions of DNA, RNA, and protein sequences from various organisms. At present, nearly half of the global sequence data sets are indexed, with plans to encompass the remainder by the year’s end.
MetaGraph’s open-source nature enhances its appeal to pharmaceutical companies possessing vast internal research data. The tool’s simplification of DNA searches might also become accessible to the general public, with potential applications in personal genomics, such as accurately identifying plant species at home.
As advancements in DNA sequencing technology continue, MetaGraph stands poised to become a crucial resource in the future of genetic research.
The content above is a summary. For more details, see the source article.