Tonight I watched the introduction to day 3 of the LISA workshop. Lauren Liu from Lawrence Berkeley Nation Laboratory spoke about how genomics research can be limited by incomplete genomes. They noted that “genomes are hypotheses about what microbes are doing… but with environmental sequencing we often don’t have complete genomes.” Liu explained that assembly graphs represent possibilities in genome structures. She explained how long and short reads help resolve genome assembly issues (“hybrid assemblies“), and how long reads can aid in capturing plasmid sequences. They discussed what a “complete prokaryotic genome assembly” is and how prokaryotic genomes are circular but assembling them from metagenomic data will likely not produce circularized genomes. CheckM and machine learning algorithms are used to evaluate completeness. Liu explained that error rates of second-generation Illumina sequencing are about one in every prokaryotic gene, and these often are resolved through coverage/error correction. Oxford Nanopore Technologies (ONT) sequencing has higher error rates. However, error rates are decreasing with updates to base calling algorithms, training datasets, and pore improvements. Liu explained polishing assemblies with short reads. There are two approaches to hybrid assemblies: long read assembly first and then polish with short reads or short reads and then try to use long reads to find paths in the assembly graph. Liu said that now that long read sequencing costs have decreased, more people are doing long read sequencing and assembly first followed by polishing with short reads. Liu explained that the workflow for prokaryotic genome assembly is typically sequencing, basecalling, assembly, error correction/polishing.
