Enhancing Genome Predictions through Data Integration

  • Home
  • KBase
  • Enhancing Genome Predictions through Data Integration

I started another KBase session because I want to continue learning and be prepared for courses. This semester I want to share genomes and narratives. Tonight, I started watching the KBase Science Session: Data integration to support (or refute) predictions. Elisha Wood-Charlson from the Lawrence Berkeley National Laboratory was the first speaker and presented a session titled “Why bother with data integration, and why does my data matter?” There were several speakers and Wood-Charlson started the talks and wrapped up the sessions. The FAIR principle of findable, accessible, interoperable, and reusable data was introduced along with moving to COPE-ing with big, complex biological data! I had not heard of this! COPE: Comparable, Organized, Predictive, and Engaged. Data quality and value is important for comparisons. Organizing and creating shareable data that allows other users to navigate and query. Results should be propagated across domains to be more generalizable. Finally, engaging the community “to advance a new science culture” to understand “where there are gaps in the models.” Dale Pelletier from Oak Ridge National Laboratory presented next on “Measuring microbial phenotypes for improving genome-based predictions.” This work was a collaboration between KBase, ENIGMA, and Plant-Microbe Interfaces (PMI). In PMI, Plant Microbe Interactions are studied in Populus trees. ENIGMA focuses on subsurface microbiome. Both groups have taken “reductionist approaches” isolating genomes and studying their physiology in the lab. PMI has a culture collection and they have generated genome sequences for these strains. The goal is to perform functional predictions. Pelletier explained that they have run thousands of pairwise interactions to learn about community assembly, niche preference, inhibition, metabolic interactions, and growth phenotypes. The team has also developed synthetic communities to study. The goal was to develop a “robust pipeline for predicting phenotype from genotype.” Some of the challenges Pelletier noted include incomplete or inaccurate gene annotations, poor accuracy of model predictions, and lack of standardized large training data sets. The process PMI took was to develop standardized framework for feature representation and phenotype training sets. This all started with a MAG ENIGMA identified, and PMI had in their strain repository. The two groups wanted to develop standardized phenotyping datasets to improve model predictions. The project started with 12 strains (6 from PMI and 6 from ENIGMA isolates) and replicated at two locations: ORNL and LBNL. They used standardized media: R2A and NLDM (defined vs. rich). They used the Epoch 2 plate reader for plate reader growth curves with OD 600 at 25C and obtained lag time, maximum growth rate, diauxic growth, carry capacity, and death rate. The groups standardized initial ODs and conditions. There was variability even though the replicates within labs were almost identical. They then used a machine learning approach to develop a standardized classifier for phenotypic prediction to learn which features were best. They were able to train the classifier on 30+ carbon sources. Next, they want to define protocols and expand tests with additional growth conditions. I was intrigued and visited the PMI site to learn about the strains they have and resources.

What can we learn from sharing data and resources with KBase and partners? AI-generated image.