Kiran V. Garimella from the Broad Institute spoke at the Nanopore Community Meeting 2019 about “Long-read genomes and transcriptomes on the cloud.” Garimella shared photos of the PromethION 48 at the Broad. They shared a graphic of the amount of data produced at the institute over the last decade. Garimella explained that at one point, the amount of data produced by sequencing at the Broad outpaced the ability to provide storage space. The HPC at the Broad is robust… yet there are limitations with the amount of work and number of people using that cluster. The goal with long-read data has been to migrate the tools to the cloud. Garimella explained that they worked to develop a workflow definition language (WDL). The group put together Docker images and containers for their applications. The new setup has some local development, and most production is on the cloud. The ability to create and spin down compute resources reduces costs. One project that is supported by this infrastructure is the Rare Genomes Project. Long-read approaches are used when plausible variants are not found with short-read data. Garimella’s group has rewritten assemblers to launch in Google Cloud, for example. Transcriptome sequencing with long reads is particularly powerful. Direct RNA sequencing was too challenging because of the low throughput. cDNA PCR produced sufficient yield. The team selected samples with high RNA quality. The researchers leveraged cloud-based GPUs. They used Flair for isoform discovery after re-basecalling data. This process resulted in the discovery of a tremendous number of potential isoforms. With this workflow, they have generated transcriptome data and workflows that benefit the community.
