Alaina Shumate from John Hopkins University presented at the Nanopore Community Meeting 2021 about “The annotation of novel genes in a complete human genome.” They began by describing how in 2003 scientists “finished” the Human genome Project but there still was missing sequence! In 2021, the Telomere-to-Telomere (T2T) Consortium actually completed the sequence of a human genome. Long-read sequencing allowed the T2T completion, with 126X coverage with 58 Kbp N50 and a 1.3 Mbp maximum read length! The new assembly added 238 Mbp of sequence that were connected… and 182 Mbp of entirely novel sequence! Shumate focused on the close to 2000 new genes that were discovered. They developed a tool called Liftoff that can help find paralogs of currently annotated genes. Liftoff was designed “to address challenges specific to lifting over gene annotations.” Liftoff uses Minimap2 to align complete gene sequences including introns. Blocks without exons are discarded. Gaps are penalized. A graph is used to represent alignments. Liftoff cannot find totally novel genes. For this, the Comparative Annotation Toolkit was used to identify 8 entirely novel genes (from the nearly 2,000). Medically-relevant paralogs were identified, including one that affects HIV susceptibility. Shumate summarized that long-read sequencing established the complete human genome and Liftoff and CAT helped identify novel genes and paralogs.
