Starting to Assemble Genomes

Eoghan Harrington and David Stoddard from Oxford Nanopore Technologies presented on December 18, 2018 about Assembly as part of the Nanopore Learning Knowledge Exchange series. They are both in the Applications group. Harrington started by sharing how they approach new assembly projects from planning to sequencing. They listed the steps: planning, sequencing, assembly, and quality control. Throughout the talk they used the chicken genome they sequenced as an example. Harrington defined an assembly as a model of the genome sequence of your organism of interest. The assembly can be used for functional studies of gene models, evolutionary analysis (variation in sequence/structure of genome across species) or population genetics. The assembly graph can have ambiguity because of repeats, for example. Some repeat agents are satellites (VNTR), transposable elements, segmental duplications, and homologous chromosomes. Higher throughput increases the chances that all of the genome will be represented in the assembly graph and increases the chances of sequencing repeats. Longer reads are helpful in spanning repetitive elements. Harrington noted that before starting estimating genome size and aiming for 40X coverage are important considerations. The chicken genome they sequenced was about 1 Gigabase with 12% repetitive elements based on RepBase. There are 33 chromosomes + W and Z… and microchromosomes with the smallest 0.73 Mb. Harrington noted that they wanted to test the nuclease wash and assembling with one flow cell of data. Stoddard explained that they wanted to start with a fresh prep with sufficient DNA for PCR-free library prep. They noted that to keep the pores full, loading 5-50 fmol of “good quality” library is desired. Pore blocking can lead to “unavailable” pores. Nuclease wash removes all DNA. Size selection could help enrich for larger fragments. However, longer strands seem to be harder for MinKNOW to clear. Quality control for DNA requires mass quantification and assessment of contamination. Stoddard warned against RNase A use that may have nuclease activity. Stoddard explained that NanoDrop can help identify potential contaminants. Stoddard recommended avoiding vortexing and freeze/thaw cycles. For the chicken genome pilot studies, the team used commercially available chicken blood – nucleated RBCs. They tested QIAGEN Genomic Tip and QIAamp. The team used the Genomic Tip for the pilot and one nuclease wash and reload to obtain ~99 Gbases on a PromethION flow cell. Harrington explained that for assembly reads are prepared based on length and mean Q-score. Adapters and barcodes are trimmed, and read correction is attempted. For assembly, overlap, layout, consensus is used and polishing boosts consensus accuracy by using raw signal data or an algorithm to correct errors. With the Miniasm workflow, the team started with Fitlong followed by Porechop and Racon. Miniasm and Minimap2 were used for assembly. Racon/Medaka were used for base space polishing and Nanopolish. The ONT team put together a package called Promoxis. For quality control, contig length N50 and gene content are often used. BUSCO is a tool that uses clade-specific gene lists. QUAST bundles several assembly quality metrics that are useful, Harrington noted. They ended by sharing several references about genome assembly. I shared this video with the wax worm team.

puzzle pieces
How did the ONT team improve chicken genome assemblies? Photo by Pixabay on Pexels.com