Brian Ondov from the National Biodefense Analysis and Countermeasures Centre (NBACC) spoke at the Nanopore Community Meeting 2015 on “Leveraging MinHash for rapid identification of nanopore data on mobile hardware.” Ondov explained how the portability of the MinION device can be paired with portable hardware. They noted that memory expansion on a small device can be challenging. Ondov said that k-mer-based distance estimates can be used to determine the composition of reads rapidly. However, k-mer-based methods require large tables and memory. In 1997 Andrei Broder, as part of the Altavista search engine, MinHash was created to subset and organize pages. This approach used “shingles” and measured distance. Ondov shared data comparing the MinHash distance and Average Nucleotide Identity (ANI) of 500 E. coli genomes. The k-mer size was changed. As the k-mer size increases, you obtain “diminishing gains,” and 21-mer seems to be the best balance. They also took the RefSeq microbial genome database with 55,000 entries and “compressed” 6000x with an optimized k-mer table and MinHash. The system does about 1,000 reads per second. Ondov explained how much coverage is needed to identify an organism. They shared a video indicating that 30x coverage is needed to detect B. anthracis. With MinHash, the RefSeq microbe database is only 100 Mb. The software Ondov and the team developed is called Mash. In the question and answer session, Ondov said that the system is being developed for streaming base-calling results to analyze. I wondered how Mash worked and appreciated the history and clear explanation Ondov provided.
