Tonight I watched Lauren Lui from Lawrence Berkley National Laboratory present a “Long Read Isolate Sequencing and Assembly Workshop.” This session was recorded April 2, 2024 and was part of the ENIGMA program. Lui provided an overview of the apps they use: Filtlong, Unicycler, Polypolish, and Flye. The team was motivated by evolving long-read technologies to implement apps into KBase. The ENIGMA focuses on characterizing field sites. ENIGMA stands for Ecosystems & Networks Integrated with Genes and Molecular Assemblies. Liu shared a graph from ENIGMA isolate assemblies with Illumina assemblies and assemblies finished with long reads. Those finished with long reads could overcome repetitive region limitations and reduce the number of contigs. Liu shared a assembly graphs that looked like figure eights because of repetitive regions. The two possibilities could be resolved with long reads. Liu shared data from Jennifer Goff for the assembly of a Bacillus cereus strain from a low pH metal contaminated site. The 5.6 Mb chromosome had eight plasmids that accounted for ~13% of the genome length and included metal resistance genes. Liu explained that her background was in mathematics and then did a Ph.D. in bioinformatics with some wet lab. Liu did the LISA Workshop on December 6-8, 2023. Participants came with samples and extracted high-molecular weight DNA. Slides from the workshop are available at bit.ly/LISAWorkshop and I will check them out for BIT 295. During the workshop, they sequenced on a Nanopore MinION. They then used KBase for making Genome Announcement Resources. Liu emphasized that sample preparation does impact assembly. The workshop was supported by several assistants and people from KBase. After 1 hour they had an N50 of 11 Kb! Read length, error rate, and depth of sequencing will affect assembly. Liu spoke about error rates in comparison to Illumina. Oxford Nanopore Technologies (ONT) and PacBio have higher error rates than Illumina, but Liu noted that new chemistry and methods do better. Coverage is critical because it helps you remove errors: by “stacking up” the reads with coverage, you “gain” confidence in the assembly. Liu shared that 50x-200x coverage for Illumina only genomes and >20-40x (50-100x is ideal) coverage for long read only genomes is needed. Liu then started using a KBase narrative. She noted that there is a bug in the upload of FASTQ files with reads shorten than 10 bp. These short reads should be filtered out before upload. Liu uses Timmomatic for removing adapters for Illumina reads and Dorado/MinKNOW for removing adapters for nanopore reads. For long reads, Liu uses Filtlong to select for longer reads. the demo included a short read assembly with Spades. Long read assembly with Flye has options for different error rates from ONT and the option for PacBio. If the coverage is not high enough, the assembled genome won’t be circularized. Liu then used Unicycler to input trimmed Illumina reads and filtered long reads. Interestingly, Flye filters out long reads that are shorter than the overlap you set. Unicycler generated one contig. Using the Unicycler single output as the “Gold Standard,” Liu noted that the number of predicted genes is high with Flye assembly alone, with more hypothetical genes. This emphasizes the importance of polishing steps. Liu also did an assembly with a subsampled long read set that also produced a “good” assembly. Ben Allen from Oak Ridge National Lab came next to talk about publishing with KBase. Allen shared that KBase has over 35k users and thousands of narratives. Allen shared the KBase Genome Announcement/MRA template. Allen demonstrated both apps from narratives and data from static narratives. The new collections, including ENIGMA collections, allow sharing and analyses of genomes. I wonder if I can create a Delftia collection?!
