I continued watching the ISME19 Workshop “From Reads to Function” day 1 sessions. Next, Ben Allen from KBase spoke about “Metagenomics in KBase.” Allen went over metagenomics workflows in KBase. They explained that the workflow is based on the Chivian et al. (2023) Nature Protocols publication. Allen shared a graph with the NCBI Sequencing Read Archive (SRA) DNA sequencing data growth in recent years. Several graphs described the amount of genomic data generated in the last decade. Allen shared a table comparing first, second, and third generation sequencing. The first generation sequencing technologies refers to Sanger with 0.001-0.01% error in base calling. Next, second generation Illumina, 454, and now Element short read lengths have 0.10% base calling error rates (1 error in 1000 bp). Third generation sequencing refers to Pacific Bioscience and Oxford Nanopore Technologies (ONT) capable of ultra long reads (1000-60,000 bases) and base calling error rates 10-15% for PacBio and 1-15% with ONT. Second generation Illumina short-reads were used for the workshop. This technology is “sequencing by synthesis” and every sequenced base generates a “light” signature. Nanopore sequencing relies of DNA passing through pores and causing changes in current. Read quality assessment and control are needed before using reads for assembly. The two tools in KBase are FASTQC and Trimmomatic. Allen noted that FASTQC is intended for Illumina and not very useful for Nanopore reads. Allen explained the per tile sequence quality as a way of identifying potential issues with the run and/or chemistry. Adapter contamination can also be identified with this interactive tool. Trimmomatic can trim reads to remove low-quality reads. For the basics of genome assembly section, Allen noted that coverage is the number of reads at a position. Allen noted that coverage helps remove errors. Allen spoke about metaSPAdes and how it builds out consensus sequences. Post assembly, binning can take place with MaxBin2 or CONCOCT (Clustering cONtigs with Coverage and ComposiTion). These two algorithms are similar with the key difference: MaxBin2 has an algorithm and also checks against reference phylogenetic markers. CONCOCT is not supervised. Bins can be optimized with DASTool which uses dereplication, aggregation, and scoring strategies to compare bins from multiple binning tools. CheckM is then used to estimate completeness and contamination using single-copy gene sets. The assessment is based on what is expected. CheckM is used to filter MAGs based on completeness and contamination. I will continue watching the next steps tomorrow!
