A large number of BCB students from the Bioinformatics and Computational Biology Graduate Program at Iowa State University will be presenting at an international conference to be held in Chicago, IL, July 6 through 10th. The conference entitled, Intelligent Systems for Molecular Biology (ISMB) 2018, is in its 26th year.
From their website: "ISMB brings together scientists from computer science, molecular biology, mathematics, statistics and related fields. Its principal focus is on the development and application of advanced computational methods for biological problems.
ISMB 2018 provides an intense multidisciplinary forum for disseminating the latest developments in bioinformatics/computational biology, fostering fresh dialogues and perspectives to learn about and shape the future of the field. With intensified community involvement form the ISCB Communities of Special Interest (COSIs), ISMB is the leading conference in the field, with the strongest scientific and technical program that showcases the best international developments in bioinformatics and computational biology."
Here is the information on BCB student presentations:
Major Professors: Carolyn Lawrence-Dill, GDCB, and Marna Yandeau Nelson, BBMB
Short Description: Conversion of text-based phenotypic data to EQ statements composed of ontology terms can drive biological discovery by enabling large-scale diversity analyses. We are develop-ing computational methods to automate this conversion process.
Abstract: Phenotypic diversity analyses are the basis for research discoveries that span the spectrum from basic biology (e.g., gene function and pathway membership) to applied research (e.g., plant breeding). In cases where equivalent phenotypes across individuals or groups are not anatomically similar, high-throughput, computational classification is possible if the traits and phenotypes are documented using standardized, language-based descriptions. In the case of text phenotype data, conversion to computer-readable “EQ” statements enables such large-scale analyses. EQ statements are composed of entities (e.g., leaf) and qualities (e.g., length) drawn from terms in ontologies. We present a method for automatically converting free-text descriptions of plant phenotypes to EQ statements using a machine learning approach. Random forest classifiers identify potential matches between phenotype descriptions and terms from a set of ontologies including GO, PO, and PATO. Features used include semantic, syntactic, and context similarity measures. The classifiers are trained and tested using a dataset of text descriptions and EQ statements from the Plant PhenomeNET project (Oellrich, Walls et al., 2015). The most likely candidate terms are used to compose EQ statements with confidence scores. Results of evaluating the accuracy of this approach are presented, and potential use to enable automated phenolog discovery are discussed.
Major Professors: Carolyn Lawrence-Dill, GDCB, and Matthew Hufford, EEOB
Short Description: We build a shiny web application to help researchers to compare RNA-seq data, mainly hypothesis based test between pre-defined gene groups.
Abstract: Determining gene function remains a fundamental problem in biology. Measuring gene expression levels via transcript analysis across various treatments and developmental stages from many tissues greatly facilitate gene, pathway, and genomic functional annotation and interpretation. Here we present a method that involves transforming expression data to approximate a normal distribution followed by dividing the genes into groups, then applying Gaussian parametric methods to assess the significance of observed differences. This method enables the assessment of differences in gene expression distributions within and across samples, enabling hypothesis-based comparison among groups of genes. We have implemented our method on a shiny server with a user-friendly interface. It has the following features 1) visualization of gene expression values distributions; 2) visualization of the data normality assumption; and 3) performance of Student’s T-test or Wilcoxon signed-rank test between two groups of genes. This application enables biologists to access a de novo statistics pipeline and to conduct an entirely new kind of RNAseq-based, hypothesis-driven research.
Major Professors: Geetu Tuteja, GDCB, and Karin Dorman, Statistics
Track: HitSeq: High-throughput Sequencing
Short Description: We develop and use bioinformatics tools to study the regulatory networks behind early placental development using next-generation sequencing (NGS) datasets.
Title: TissueEnrich: A tool to calculate tissue-specific gene enrichment
Abstract: The development of RNA-Seq technology has enabled large-scale comparison of gene expression in a multitude of developmental stages, cell-types, and conditions. RNA-Seq data analysis typically results in the identification of genes that likely have a shared function. Although gene ontology enrichment analysis is widely used to identify enriched pathways in gene groups, it does not determine enrichment of tissue-specific genes. Understanding which groups of genes are tissue-specific is valuable, as tissue-specific genes are more likely to be associated with human disease. Therefore, we developed “TissueEnrich”, a tool to carry out tissue-specific gene enrichment. TissueEnrich uses RNA-Seq data sets from mouse and human to define tissue-specific genes, using a method developed by the Human Protein Atlas. It uses the hypergeometric test to calculate the enrichment of tissue-specific genes in an input gene list. We tested TissueEnrich using genes that are highly expressed in cardiomyocytes and trophoblast-like cells, each differentiated from embryonic stem cells. We found tissue-specific enrichment for heart and placenta, respectively, validating the robustness of our tool. TissueEnrich is available as a web application, allowing the user to visualize tissue-specific gene enrichment as a barplot, as well as visualize expression of tissue-specific genes.
Major Professors: Julie Dickerson, E CPE, and Carolyn Lawrence-Dill, GDCB
Track: Machine Learning in Computational and Systems Biology Track
Short Description: We present tissue specific mRNA isoform functional networks in addition to an organism level reference functional network for mouse. Our workflow integrates heterogeneous sequence and expression level features using random forest trained with single isoform genes and "NOT" tagged Gene Ontology annotations.
Abstract: Alternative Splicing produces multiple mRNA isoforms of a gene which have important diverse roles such as regulation of gene expression, human heritable diseases, and response to environmental stresses. However, very little has been done to assign functions at the mRNA isoform level. Functional networks, where the interactions are quantified by their probability of being involved in the same biological process have also been usually generated at the gene level. We have developed 17 tissue-specific mRNA isoform functional networks in addition to an organism level reference functional network for mouse. Using the leave-one-out strategy with a diverse array of tissue-specific RNA-Seq datasets and sequence information, we trained a random forest model to predict the functional networks. Because there is no mRNA isoform-level gold standard, we have used the single isoform genes co-annotated to Gene Ontology (GO) biological process, KEGG/BioCyc pathways and protein-protein interactions as functionally related (positive pair). The non-functional pairs (negative pair) were generated by using the GO annotations tagged with “NOT” qualifier. We have validated our network by comparing its performance with previous methods, randomized positive and negative class labels, and by literature evidence. These networks will be made available to the mouse genetics community.
Major Professors: Matthew Hufford, EEOB, and Carolyn Lawrence-Dill, GDCB
Track: Evolution and Comparative Genomics
Short Description: Creation of a quality assessment pipeline for maize genome assemblies and annotations and development of a system for their comparison
Abstract: Emergence of new gene editing tools like CRISPR-Cas systems and TALENs have heightened the need to develop new and efficient transformation technologies in plants including maize. However, maize transformation is not straightforward. Not all lines are easily transformed and can therefore not be directly subjected to genome editing. The maize genome reference sequence, for example, is currently based on the B73 inbred line, which is not readily transformed. Thus, the motivation for this project is to assemble a maize inbred line which is highly similar to the B73 reference genome but transformable. B104 meets these criteria. In fact, in recent years, transformation facilities have heavily relied on B104 for successful maize transformation. The availability of a B104 genome sequence would allow mapping of genetic variants between B73 and B104 and help illuminate the genetic architecture of the transformability trait. Here we report progress toward a de novo hybrid assembly of the B104 genome using PacBio and Illumina mate pair and paired end data. Additionally, we describe our strategy for annotation of B104 gene models using a customized version of the Maker-P pipeline that is informed by RNA-Seq data collected across ten distinct plant tissues. This new assembly of B104 will represent an improvement in both sequence contiguity and completeness of gene annotations relative to the existing draft assembly and will provide basic insights into the genetic basis of plant transformation.
Major Professors: Drena Dobbs, GDCB, and Robert Jernigan, BBMB
Short Description: Machine learning-based prediction of RNA-protein interaction partners is enhanced over utilizing primary sequence information alone by incorporating information concerning intrinsically disordered regions in proteins. We have created a method, RPIDisorder, to leverage this information.
Abstract: RNA-protein interactions are implicated in a wide range of critical regulatory and structural roles whose disruption can lead to numerous diseases. Computational methods for predicting RNA-protein interaction partners (RPIPs) are valuable because experimentally characterizing these interactions is time-consuming and expensive. Published prediction methods utilize various sequence and structural features, but are generally limited by high false positive rates (FPRs) and/or query sequence length. Because intrinsically disordered regions (IDRs) are abundant in RNA-binding sites of proteins, we hypothesized that incorporating IDR information with sequence features could improve prediction of RPIPs. We developed a new random forest machine learning classifier, RPIDisorder, which requires only primary sequences of potential RNA and protein interaction partners as input. RPIDisorder outperformed our published classifier, RPISeq, on an independent test set of 11,281 RPIPs and 971 non-interacting pairs, with MCC 0.68 (vs 0.47) and FPR 21% (vs 55%). In a case study, RPIDisorder was used to identify RNAs bound to the Fragile-X Mental Retardation Protein (FMRP). On a test set of 30 RNAs (14 binding and 16 non-binding ncRNAs), RPIDisorder achieved an MCC of 0.73 and FPR 6.3%.
These results indicate that incorporating IDR information can improve the reliability of RNA-protein partner prediction over sequence composition alone.
Major Professors: Karin Dorman, Statistics
Short Description: We propose Ampliclust, a new reference-free model-based method to resolve amplicon sequence variants from Illumina Amplicon data. This fully probabilistic model enables us to classify amplicon sequences through maximizing the likelihood based on the whole dataset.
Abstract: Next-generation amplicon sequencing is a powerful tool for understanding microbial communites.
Downstream analysis is often based on the construction of Operational Taxonomic Units (OTUs) with dissimilarity threshold 3%.
The arbitrary threshold and reliance on OTU references can lead to low resolution, false positives, and misestimation of alpha and beta microbial diversity.
We introduce Ampliclust, a reference-free method to resolve the number, abundance and identity of error-free sequences in Illumina Amplicon data.
Unlike existing methods, Ampliclust is a fully probabilistic model, allowing the data, rather than an algorithm or an external database, drive the conclusions.
We use a modified Bayesian information criterion to estimate the number of sequence variants and obtain maximum likelihood estimates of the abundance and identity of error-free sequences.
Our model is able to match the performance of DADA2 on well-separated mock communities, but in simulated communities with more similar real sequences, Ampliclust can achieve better accuracy.
The major challenge is the computational scalability, which we begin to address through principled iterative schemes and improved initialization methods.
Major Professors: Dianne Cook, Statistics and Amy Toth, EEOB
Abstract: Next-generation amplicon sequencing is a powerful tool for understanding microbial communites. The downstream analysis is often based on the construction of Operational Taxonomic Units (OTUs) with dissimilarity threshold 3\%. Though many methods for OTU construction have been proposed, the arbitrary threshold and incomplete reference database are the limitations for those methods and thus lead to low resolution and high false positive rate, which may cause bias for estimating alpha and beta diversity of microbial communities. Here we propose Ampliclust, a new reference-free model-based method to resolve amplicon sequence variants from Illumina Amplicon data. This fully probabilistic model enables us to classify amplicon sequences through maximizing the likelihood based on the whole dataset. A modified Bayesian information criterion is proposed for a better estimate of number of clusters K.
Major Professors: Eve Wurtele, GDCB, and Karin Dorman, Statistics
Abstract: The primary analysis on simulated communities with similar true sequence variants suggest Ampliclust could achieve better accuracy under some conditions, compared to DADA2. The current challenge of Ampliclust is the computation costs, which is scalable through our under-developing iterative version along with a better initialization method.
Major Professors: Geetu Tuteja, GDCB, and Heike Hofmann, Statistics
Short Description: I use next generation sequencing techniques to understand how the placenta develops and functions on a molecular level. Specifically I am interested in how the placenta changes to regulate the influx of nutrients and transfer them to the fetus.
Abstract: The placenta is crucial during pregnancy, regulating proper fetal growth and development. However, many aspects of placental function and development are not yet fully understood. We therefore aimed to identify active and repressed gene networks in mouse placenta at e9.5. We generated open chromatin data using ATAC-seq, and integrated it with previously published transcriptomic data. RNA-seq reads were quantified using transcripts per million (TPM), and ATAC-seq reads were quantified at gene promoters using the maximum read pileup (coverage). We then grouped genes based on their TPM and promoter coverage values. Genes with high expression and high coverage were enriched for house-keeping functions. Surprisingly, we identified genes that have high expression and medium-low coverage, that were enriched for placenta related terms including vasculogenesis and endothelial cell migration. We also identified genes that have low expression and high promoter coverage and, within this group, we extracted a protein-protein interaction network enriched for neuronal functions. Finally, we generalized these findings by running our analysis pipeline on eight other tissues/cell-lines. We found that the genes with medium-low coverage and high expression are consistently enriched for tissue-specific terms and genes. We also identified potentially repressed neuronal networks in placental cells and embryonic stem cells.
Valeria Velasquez Zapata
Major Professors: Roger Wise, Plant Path and Microbiology
Short Description: We outlined computational steps to discover transcriptional regulators that govern the temporal dynamics of plant immunity. We paired the barley genome assembly with extensive barley-Bgh expression to predict defense gene modules, immune-active cis-regulatory elements (CRE) and their cognate transcription factors (TFs)
Abstract: Obligate fungal pathogens are a major threat to cereal grain production worldwide, and represent ideal tools for exploring interdependent signaling between disease agents and their hosts. We performed an expression Quantitative Trait Locus (eQTL) analysis to interrogate the temporal control of immunity-associated gene expression in barley (Hordeum vulgare L.) challenged with the powdery mildew fungus, Blumeria graminis f. sp. hordei (Bgh), identifying two highly significant clusters of trans eQTL. Using these data, we outlined computational steps to discover transcriptional regulators that govern the temporal dynamics of plant immunity. We paired the barley genome assembly with extensive barley-Bgh expression data using two complementary approaches to predict defense gene modules, immune-active cis-regulatory elements (CRE) and their cognate transcription factors (TFs): First, we compared experimentally validated TF-CRE pairs with barley promoter sequences and calculated an enrichment score and FDR-adjusted p-value using Fisher's exact test. Second, we performed de novo CRE discovery. Consistent with our hypothesis, we identified overrepresented CREs in promoters of the trans eQTL-associated gene sets. Over 70% of the recovered motifs were consistent between the analyses, some of them novel. These results were represented with unrooted phylogenetic trees of each barley TF family, and used as selection tool for experimental validation.
Major Professors: Carolyn Lawrence-Dill, GDCB, and Erik Vollbrecht, GDCB
Track: Function COSI
Abstract: Maize is both a crop species and a model for genetics and genomics research. Maize GO annotations from Gramene and Phytozome are widely used to derive hypotheses for crop improvement and basic science. The maize-GAMER project is an effort to assess existing maize GO annotations and to improve the quality and quantity of annotations. We designed and implemented a plant-specific reproducible meta-annotator (GO-MAP) that uses diverse component methods including sequence-similarity, domain presence, and three CAFA tools (Argot2, FANN-GO, and Pannzer), to predict GO terms to maize genes and aggregates the predicted annotations as an aggregate dataset. Annotations from Gramene, Phytozome, and maize-GAMER were assessed and compared. Compared to Gramene and Phytozome, the maize-GAMER dataset annotates more genes and assigns more GO terms per gene. The quality of annotations was evaluated using an independent gold-standard dataset (2002 GO annotations for 1,619 genes) from MaizeGDB. In the CC category, maize-GAMER was the top performer, but it ranked slightly behind Gramene in both MF and BP categories. The maize-GAMER GO annotations have been released publicly, and the containerized GO-MAP tool will soon be released to facilitate annotation of other plant proteomes.
Naihui (Ashley) Zhou
Major Professors: Iddo Friedberg, VMPM, and Mark Kaiser, Statistics
Short Description: The updates to the Critical Assessment of Functional Annotation (CAFA) Challenge
Abstract: The third CAFA challenge (CAFA3) released its prediction targets in September 2016, and preliminary results were announced in July 2017. CAFA3 featured a term-centric track where predictors were asked to associate a large set of genes (the complete genomes of Candida albicans and Pseudomonas aeruginosa) with a limited set of functions. By collaborating with experimental biologists, we were able to use unpublished whole-genome screen results to evaluate these predictions. To specifically address this question, we hosted an additional challenge CAFA 3.14 (CAFA-Pi) that is dedicated to evaluating term-centric predictions. The final CAFA3 results as well as preliminary CAFA-Pi results will be released and discussed, in addition to highlights of the term-centric evaluations and benchmark proteins.
Short Description: We used Markov Random Field as a statistical tool to study the three-dimensional correlation of gene expression
Abstract: The chromatin and its 3D organization plays important roles in cellular function in the eukaryotic cell, with the advance in the 3C (HiC) technology, more long-range intra-chromosomal and inter-chromosomal interactions between genomic loci have come to light. Specifically, the 3D organization of the genome may play important roles in transcription regulation. The theory of “transcription factory” is one such hypothesis. These nuclear subcompartments are dynamically organized so that the genes in these compartments have coordinated transcription. This study is an attempt to further consolidate the theory of “transcription factory” using a spatial Markov Random Field (MRF) model. By directly modelling gene expression values on a spatial neighborhood network inferred from HiC data, we were able to estimate the level of spatial dependency among protein-coding genes in the human IMR90 cell. We overcame computational challenges of large matrices using the double Metropolis algorithm to carry out the Markov Chain Monte Carlo (MCMC) simulation for this Bayesian model. Our study confirms the spatial dependency of transcription among neighboring genes in the 3D genome organization on a global scale. Further insights can be made into the mechanism of differential expression as a response to stimuli involving the chromatin compartments.