The Advantages of POD5

I have been thinking and learning about POD5 files. Tonight, I watched a relevant Nanopore Community Meeting 2022 session by Alex Merry, Instrument Software Fellow at Oxford Nanopore Technologies. The session was entitled “Arrow: pointing the way forward for high-performance nanopore signal handling with POD5.” They began talking about FAST5 files to store signal data and then introduced POD5. They defined signal as the measurement of the actual electrical current across the nanopore over time. It is the signal for a single strand of RNA or DNA. Basecalling uses signal data models to turn signal into sequences. Merry asked: why store signal if MinKNOW can generate FASTQ files? They mentioned that perhaps you want to analyze the signals on an HPC or otherwise basecall the data in the future. Currently, FASTS uses HDF5 format/convention to store a variety of different data. However, it takes time to read and write. It is also, I didn’t know, hard to recover partially written files. Merry mentioned that their motivation to change this convention was to increase performance and allow basecallers to read quickly. POD5, Merry explained, consists of 3 Apache Arrow tables patched together: run info, run metadata, and signal data. Arrow is fast and efficient to use, Merry noted. To retain the simplicity, they wanted the POD5 files to contain needed information. The core table is the reads data. The signal data is in a separate table. There is a Github repository for POD5 information. Online, there is a Nanopore converter tool through a website. MinKNOW will now be able to work with and generate POD5 files. I want to start generating POD5 files and using them. I think this will be great for the processing and storage setup we have.

bean pod with green beans onn table
Why use POD5 files instead of FAST5? Pod image from Openverse.