MinHash for Rapid Pathogen Identification

Brian Ondov from the National Biodefense Analysis and Countermeasures Centre (NBACC) spoke at the Nanopore Community Meeting 2015 on “Leveraging MinHash for rapid identification of nanopore data on mobile hardware.” Ondov explained how the portability of the MinION device can be paired with portable hardware. They noted that memory expansion on a small device can be challenging. Ondov said that k-mer-based distance estimates can be used to determine the composition of reads rapidly. However, k-mer-based methods require large tables and memory. In 1997 Andrei Broder, as part of the Altavista search engine, MinHash was created to subset and organize pages. This approach used “shingles” and measured distance. Ondov shared data comparing the MinHash distance and Average Nucleotide Identity (ANI) of 500 E. coli genomes. The k-mer size was changed. As the k-mer size increases, you obtain “diminishing gains,” and 21-mer seems to be the best balance. They also took the RefSeq microbial genome database with 55,000 entries and “compressed” 6000x with an optimized k-mer table and MinHash. The system does about 1,000 reads per second. Ondov explained how much coverage is needed to identify an organism. They shared a video indicating that 30x coverage is needed to detect B. anthracis. With MinHash, the RefSeq microbe database is only 100 Mb. The software Ondov and the team developed is called Mash. In the question and answer session, Ondov said that the system is being developed for streaming base-calling results to analyze. I wondered how Mash worked and appreciated the history and clear explanation Ondov provided.

What do k-mer-based methods and MinHash have to do with shingles? Photo by Catherine Leclert on Pexels.com

Post Categories

Credits

Website images were purchased from and edited in Canva.com. Blog post images are from the WordPress free image library powered by Pexels. Gallery images used were taken or created by Carlos C. Goller or otherwise attribution is stated. Blog posts represent my reflections and reference relevant sources of information, including conferences, podcasts, books, and workshops when applicable. I strive for proper attribution of sources and accessibility of content. I am still early in the journey. I appreciate feedback!

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

MinHash for Rapid Pathogen Identification

Understanding Alternative Splicing with Blessy R Package

Advancements in Antisense Oligonucleotide Design

Ultra-Fast Classifiers for Pediatric Tumors: Insights from Lennart Kester