Exploring N50 and Genome Assembly Techniques in KBase

  • Home
  • KBase
  • Exploring N50 and Genome Assembly Techniques in KBase

Tonight I continued watching the LISA workshop videos. Lauren Lui from Laurence Berkley National Lab explained the apps available on KBase for long-read only and hybrid assemblies. Using a KBase narrative, Lui shared a Spades short-read-only assembly with 30~ contigs. They described the N50 as a metric for comparing genome assemblies. To calculate N50, contigs are lined up by length and the middle length is the N50. Flye is a long-read assembler. Lui noted that filtered reads are not needed since Flye filters out shorter length reads and considers the overlap between reads. This is good to know! Flye, in Lui’s example, assembled the complete genome. PolyPolish can be used with trimmed Illumina reads to correct errors. In Lui’s example, 1% of positions changed! Lui’s example of a hybrid assembly with Unicycler, trimmed Illumina reads and long reads are used to improve the assembly. Lui did note that plasmids can have a copy number of less than one because not all cells may have the plasmid. Unicycler rotates the position of the genome to the origin, but Flye doesn’t. Prokka can be used for annotation in KBase. Liu compared the number of annotated genes in polished and unpolished genomes. There were more genes in the assemblies that likely contained more errors. Some services polish with long reads. Medaka can be used for polishing. There was discussion about Medaka (and Dorado) not allowed to be used for commercial purposes. I didn’t know that! Liu calculated coverage with the number of bases from Illumina reads and the genome size. The genome was 6.2 Mbases and coverage was 120x. Using 10% of the data, the coverage was only 12x. Liu recommends at least 20x for short-read only assembly of bacterial genomes. One comment was that you can have issues with too much data and require too many computational resources. When you have a lot of data, Tricycler was recommended since Ryan Wick wrote this series of scripts to address this issue and obtain the “best genome.” Liu noted that after about 200x coverage, you don’t improve too much in continuity. Liu explained that for Illumina-only assemblies, fewer than fifty contigs is “good” and complexity of the genome and repetitive repeats will factor into assembly. To calculate depth, Liu used trimmed (using Trimmomatic) reads. Liu shared Bandage graphs: the Unicycler assembly was a single circle. Liu concluded that she uses Flye for her genomes. For variant analysis, they recommended Clair3. One question was whether apps could be run automatically one after the next. Apps must be run one by one. Each KBase user can run up to ten apps simultaneously.

What assembly approaches are available for bacterial genomes using KBase? AI-generated image.