Insights from KBase LISA: Base Calling in Sequencing

  • Home
  • KBase
  • Insights from KBase LISA: Base Calling in Sequencing

Tonight I started watching the recordings from the KBase LISA workshop. I watched the base calling session by Torben Nielsen from Lawrence Berkeley National Laboratory. Nielsen compared PacBio and Nanopore sequencing. PacBio is sequencing by synthesis with limited length. Base calling is performed by image processing. The native error rate for PacBio is about 15%, according to Nielsen’s slide. Nanopore sequencing is a direct readout of the nucleic acid. Basecalling is performed using machine learning algorithms. The native error rate is variable, according to Nielsen. One interesting comparison on Nielsen’s slide was that “Nanopore now does duplex processing which is sort of like HiFi” from PacBio. I had not considered this comparison! Basecallers evolved and the FAST5 format was too slow. The POD5 data format was motivated by the SLOW5 format. Nielsen explained that “all Nanopore basecallers use machine learning. They are all trained on known genomes.” Questions from online participants included: why are basecallers using machine learning and can eukaryotic genomes be basecalled accurately. Nielsen spoke about error correction and quality thresholds for PacBio. They discussed error rates and why homopolymer sequencing improved with longer pores in R10.4.1 flow cells. During the discussion they noted that training basecallers on different genomes will be critical. If you are doing long-read only assemblies, the coverage and error-correction with medaka are important considerations. Services that offer reasonable long and short read sequencing packages, Nielsen explained, may be cheaper than your time.

What do we know about error rates and long-read sequencing? AI-generated image.