Bioinformatics & Computational Biology Bioinformatics & Computational Biology

BCB 690 Spring 2008

Menu:
Iowa State University
BCB 690. Student Seminar in Computational Biology
Friday 2:10-3:00 PM
0296 Town Engineering

Date Presenters Title
January 18, 2008

No class

 
January 25

Chris Tuggle

How to Make Scientific Seminars
Feb. 1 No Class  
Feb. 8

Saras Saraswathi
Matthew Moscou

Saras-An Improved Gene Selection Method for Accurate Classification of Microarray Gene Expression Data
Matt-Gene-For-Gene-Mediated Transcriptome Reprogramming In Barley-Powdery Mildew Interactions
Feb. 15

No Class

Feb. 22

Fadi Towfic
Xiaoyong Sun

Fadi - Exploring Gene Expression Networks With RetinaWorkbench
Xiaoyong - BioIDMapper: a R package for mapping biological IDs

Feb. 29

Scott Boyken
Yuanyuan Huang

Scott - Itk SH3/SH2: Non-Canonical Interactions Regulate Kinase Activity
Yuanyuan - Apply Dominations to Prediction of RNA Secondary Structure
March 7 Ben Lewis
Long Qu
Ben - Combining Structural Modeling with Machine Learning Approaches to Improve Prediction of Nucleic Acid Binding Sites in Telomerase
Long
- Subsampling based bias reduction in estimating the number of differentially expressed genes from microarray data

March 14

Deepak Reyon
Haining Lin

Haining - Characterization of paralogous protein families in rice
Deepak - Determination of Protein Structure using X-Ray Crystallography CusB

March 17-March 21 Spring Break  
March 28

Jon Hurst
Li Xue

Li - PCA on microsphere of biodegradable polymer adjuvant data
April 4

John Van Hemert
Shreyartha Mukherjee

Shreyartha -  Improving Secondary Protein Structure Prediction by generating decoys

John - TurtleBase: a facilitative ecoinformatics system

April 11

Kyoungmin Roh
Ataur Katebi

Kyoungmin - Analysis of gene network using simulation
April 18 Tian Xia
Mike Zimmermann
Mike - Normal Mode Analysis for Protein Dynamics
Tian - Omics Viz
April 25

Ankit Agrawal
Bob Farnham

Ankit - PairwiseStatSig: Pairwise Statistical Significance Estimation for Local Sequence Alignment

Bob - An Algorithm for Finding Optimal Gene Network Models from Microarray Data

May 2

Fengli Fu
Wengang Zhou

Fengli - Improve the Engineering of Zinc Finger Proteins (ZFPs) by Modular Design

Wengang - A Predicted Interactome for Vitis Vinifera

May 9 Finals Week  


BCB 690 Student Seminar

Friday, Jan. 18 at 2:10 p.m.

No Class


Friday, Jan. 25 at 2:10 p.m.

Speaker: Chris Tuggle, Chair, BCB

Title: How to Make Scientific Seminars

Good seminars tell a story and engage the listener in your journey

INTRODUCTION:
Lay the ground work for YOUR seminar
-Relevant background information
-What are the key questions remaining
-What is the problem you are addressing
-What is the approach you are taking

ORGANIZATION:
Does the seminar flow logically
-WHY Background/Rationale
-WHAT Hypothesis and/or Specific Aims
-HOW Experimental Approach/Methods
-WHAT Experimental Results/Significance
-WHERE Future experiments/Applications

CLARITY:
Can the audience follow you?
How clear are the answers to why/what/how/where?
Seminar should be understandable to those outside immediate field.

DATA EVALUATION:
Audience should understand how results are used to answer questions posed. Statistical analyses should be included where appropriate.
Not applicable to all seminar presentations.

SEMINAR SUMMARY:
The “take home” message
Recap major findings and indicate how they answer the initial questions asked. Indicate significance/future directions/applications.

SCIENTIFIC EVALUATION:
Is the research scientifically sound?
Is the hypothesis reasonable?
Will the experimental design address the hypothesis?
Are there appropriate controls?
Are the conclusions justified by the data?

VISUAL AIDS
DO: Simplify graphics; one concept per slide; use cartoons; have large fonts; label gels; include titles on each slide; avoid complex illustrations

DO NOT: Use dark colors on a dark background (e.g. red on a blue background); use yellow on a white background; use small or elaborate fonts; use complex tables; let visual aids detract from your data (i.e. colors/graphics/transitions/animations).

EYE CONTACT
Talk to several audience members, not the screen.

SPEAKING VOICE
Clear, loud and with enthusiasm

HANDLING QUESTIONS
Repeat questions asked from front of the room
Keep questions brief and to the point - answer the question asked.
Don't be afraid to ask for clarification of question if you don't understand.
Don't be afraid to say "I don't know"

BEWARE OF TIME
Stay within time limits allotted for seminar
Keep answers to questions brief and on point

Other suggestions for presentations

1. Structure the talk in a logical way

Introduction
-Let the audience know what you are going to tell them in advance
-give Background; Rationale; Hypothesis/Question; Methodology for testing the hypothesis (describe your system)
Results
-Keep data slides as simple as possible
-Describe in detail how you tested your hypothesis; present the results; INTERPERET the results/Why are your results significant
Conclusions
-Summarize the results
-Describe how these results support or did not support the hypothesis
-Provide model or summary showing significance of your results. Discuss
the results in terms of background, significance, and rationale presented in the introduction.
-Discuss future work. What is the next step(s) or discuss utility of the work.
Acknowledgements
-Acknowledge your colleagues, collaborators and funding sources
Questions
When you receive questions will be up to you. Either during and/or after
you are finished. Each speaker will let the class know if they would prefer
the questions be held until the end.

2. Other things to consider

-The audience. The audience is comprised of students who have a good understanding of the principles of computation and some aspects of biology and statistics, but not necessarily your specific field of study.

-Do not use “lab lingo”. Be precise and define the terms.

-If you get really nervous about speaking, try memorizing an opening statement to get you started. Practice the talk several times.

-Use transitions to move between sections of your talk to establish a logical flow.

3. For more stuff to think about when preparing a presentation, check out these websites.

http://www.lions.odu.edu/~kkilburn/semhome.htm;
http://www.kumc.edu/SAH/OTEd/jradel/effective.html;
http://www.swarthmore.edu/NatSci/cpurrin1/powerpointadvice.htm


Friday, Feb. 1

No class.


Friday, February 8

Presenter: Saras Saraswathi
Rotating First-Year Student
BCB Graduate Program

Title: An Improved Gene Selection Method for Accurate Classification of Microarray Gene Expression Data

Abstract:

We present an improved gene selection method for accurate classification of cancers based on micro-array gene expression data (MGED data set). The cancer classification problem has a small number of samples with large input features, which makes it difficult to classify the data using machine learning techniques. Hence, reduction of input features, in addition to finding the right combination of genes for maximizing the classification accuracy is an important problem in bioinformatics.

A two-step ‘Integer Coded Genetic algorithm using Extreme Learning Machine(IGA-ELM)’ is given for selecting the most relevant genes for maximizing the classification accuracy. For selecting the optimal parameters, Particle Swarm Optimization algorithm is employed. For classification, the recently developed fast learning neural algorithm called ‘Extreme Learning Machine’ (ELM) is used. For the GA, the genes are used as decision variables and fitness is determined by the classification accuracy obtained using ELM. Genetic operators have been suitably defined to generate valid solutions for this problem. Performance comparison of the GA based gene selection scheme with existing methods on the GCM dataset indicate superior performance of the proposed approach.

====================

Also Friday, Feb. 8

Presenter: Matthew Moscou
Major Professor: Roger Wise

Department of Plant Pathology

Title: Gene-For-Gene-Mediated Transcriptome Reprogramming In Barley-Powdery Mildew Interactions

Abstract: Matthew Moscou (a, b), Nick Lauter (b,c), Rico Caldo (d), and Roger Wise (a,b,c).

(a) Interdepartmental Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011-1020
(b) Department of Plant Pathology and Center for Plant Responses to Environmental Stresses, Iowa State University, Ames, IA 50011-1020
(c) Corn Insects and Crop Genetics Research, USDA-ARS, Iowa State University, Ames, IA 50011-1020
(d) Monsanto, St. Louis, MO, 63167, USA

Barley has a complex interaction with powdery mildew [Blumeria graminis f. sp. hordei (Bgh)] that begins with early recognition of microbe-associated molecular patterns (MAMPs) from the pathogen. During Bgh invasion of the epidermis, the fate of cells is decided by the presence of resistance (R) genes that mediate an immediate response, which halts the progress of the pathogen. To understand the regulatory role and response associated with R-gene-mediated defense, we surveyed the transcriptional response of barley upon pathogen inoculation using three Manchuria NILs carrying allelic variants at the Mla locus, null mutations mla1-m508 and mla6-m9472, as well as Sultan 5 (Mla12) and mutants derived there from, mla12-m66, rar-1-1, rar1-2, and rom1. Each experiment consisted of sampling at 0, 8, 16, 20, 24, and 32 hours after inoculation with three replications, varying only in the inclusion of non-inoculated material. We found that the resistance response manifests itself via dynamic reprogramming of the transcriptome, which includes over 5,000 genes and may likely exceed 10,000. The quantitative nature of the Mla control becomes apparent when we observe null mutations, where early signaling effects are compromised by this gene loss. Collectively, our results confirm the high-level regulatory control of Mla in gene-for-gene-mediated resistance and point to an essential early regulatory role as observed via massive transcriptome reprogramming.

Funding by NSF-Plant Genome Award #0500461.


February 15 - No Class


Friday, February 22

Presenter: Xiaoyong Sun
Major Professor: Dianne Cook
Home Department: Statistics

Title: BioIDMapper: a R package for mapping biological IDs

Abstract: Many new databases aiming at genes and proteins are developed as more and more species are sequenced. It becomes tedious job about how to navigate among different data resources, map various IDs, and collect and analyses separate biological knowledge.  Current popular databases include Entrez Gene, UniProt,
Gene Ontology, EMBL, OMIM, PubMed, KEGG, etc. Based on NCBI and UniProt, BioIDMapper can facilitate mapping between different databases, integrate
various ID systems and provide a full practical view from gene level, mRNA level and functional level regarding one specific ID. This R package is based on RCurl
and XML package.


Also, February 22:

Presenter: Fadi Towfic
Major Professor: Vasant Honavar
Home Department: Computer Science

Title: Exploring Gene Expression Networks With RetinaWorkbench

Abstract: Many cellular processes often involve the interaction of multiple gene-products that are produced at various time points in a pathway. The discovery of interacting genes/gene products is usually a central aim for hypotheses that strive to explain the dynamics of biological pathways. RetinaWorkbench is a Cytoscape plugin that aims to integrate easy access to public (as well as private) gene-expression datasets, gene ontologies and user-defined annotations with a user-friendly querying/visualization mechanism. RetinaWorkbench was used to reconstruct interactions between genes involved in photoreceptor differentiation in the mouse retina using publicly-available gene-expression datasets. The results showed that RetinaWorkbench can be used as a hypothesis-building tool for exploring relationships between genes of interest.


Friday, February 29

Presenter: Yuanyuan Huang
Rotating First-Year Graduate Student
BCB Graduate Program

Title: Apply Dominations to Prediction of RNA Secondary Structure

Abstract: Understanding RNA molecules is important to genomics research. Recently researchers at the Courant Institute of Mathematical Sciences used graph theory to model RNA molecules and provided a database of trees representing possible secondary RNA molecules. In this research I want to use domination parameters to predict which 9-10 degree trees are more likely to exist in nature as RNA structures. This approach appears to have promise in graph theory applications in genomics research.

Because the functional repertoire of RNA molecules, like proteins, is closely linked to the diversity of their shapes, uncovering RNA's structural repertoire is vital for identifying novel RNAs, especially in genomic sequences. To help expand the limited number of known RNA families, we can use graphical representation and clustering analysis of RNA secondary structures to predict novel RNA topologies and their abundance as a function of size. Representing the essential topological properties of RNA secondary structures as graphs enables enumeration, generation, and prediction of novel RNA motifs.

I will apply a graphic parameter and logistic regression methods to construct the 1-10 degree RNA structure space.Significantly that nearly all existing RNAs fall into one group, which I refer to as "RNA-like"; I consider the other group "non-RNA-like". My method will predicts many 9-10 degree candidates for novel RNA secondary topologies, some of which are remarkably similar to existing structures.

Also Friday, February 29

Presenter: Scott Boyken
Rotating First-Year Graduate Student
BCB Graduate Program

Title: Itk SH3/SH2: Non-Canonical Interactions Regulate Kinase Activity

Abstract: In Protein-tyrosine kinases, the SH3 domain usually binds proline-rich regions, and the SH2 domain usually binds phosphotyrosines; however, in TEC family non-receptor kinase, Itk, SH3 binds SH2 through non-canonicol, intermolecular interactions, regulating oligomerization and kinase activity.  The structure of the SH3/SH2 heterodimer has recently been solved, revealing the molecular details of this interaction.  To verify the structure, several mutants have been designed, and the binding affinity of these mutants as compared to wild-type has been measured vai NMR.  Our NMR experiments reveal a key interaction between Glu189 on the SH3 domain and Arg332 on the SH2 domain.


Friday, March 7

Presenter: Long Qu
Major Professor: Jack Dekkers
Co-Major Professor: Dan Nettleton
Home Department: Animal Science

Title: Subsampling based bias reduction in estimating the number of differentially expressed genes from microarray data

Abstract: In microarray experiments, the proportion of genes that change their expression levels in response to different treatment conditions is both a global measure of the strength of biological responses and a critical quantity for false discovery rates (FDR) control. However, current statistical procedures for estimating its complement, the proportion of nondifferentially expressed genes (π 0), often suffer from high biases and low statistical power. In this study, we will develop a bias reduction procedure through a novel use of data subsampling, with analogy to but extending the jackknife. Based on the fact that increasing sample sizes almost always increases power and the power reaches 1 for infinite sample size, our procedure repeatedly deletes some biological samples randomly to produce many subsamples of various sample sizes. For each subsample, the same set of hypotheses is tested and a histogram estimator is used to estimate the p-value density at 1. Unlike most existing methods that use the p-value density at 1 from the full sample as an estimate of π 0, our procedure takes a further step by robustly regressing the p-value density estimates at 1 over the sub-sample sizes and then extrapolating the regression curve to infinity to get the final estimate of π 0. This corresponds to estimating the p-value density at 1 with an infinite sample size, exactly the π 0 in theory. We derived the exact functional form between p-value density at 1 and the sample size based on the assumption that p-values are from t-tests and the standardized effect sizes for differentially expressed genes follow a normal distribution with 0 mean and common unknown variance. Motivated by this heuristic, we proposed a flexible regression function, which includes the above exact form as a special case, to increase robustness to parametric assumptions. Simulations showed that the new estimator has smaller mean squared error (MSE) compared with the currently most widely used q-value smoother method by greatly reducing the bias but mildly increasing the variance. For FDR control purposes, averaging the q-value smoother estimate with the new estimate can both provide a conservative safety margin and achieve smaller MSE. In conclusion, our new procedure leads to bias reduced estimation of π 0 and improved statistical power in FDR control and has a smaller MSE. (Supported by USDA-NRI-2005-3560415618)

Also Friday, March 7

Presenter: Ben Lewis
Rotating First-Year Graduate Student
BCB Graduate Program

Title: Combining Structural Modeling with Machine Learning Approaches to Improve Prediction of Nucleic Acid Binding Sites in Telomerase

Abstract: Telomerase is a ribonucleoprotein enzyme responsible for adding telomeric repeats to the ends of linear chromosomes and is overexpressed in ~90% of cancers as a method of circumventing the natural cell division limit. Unfortunately, the three-dimensional structure and RNA- and DNA-binding residues of human telomerase have not been determined. By combining structural models based on solved portions of the structure of /Tetrahymena/ telomerase with machine learning approaches, it was possible to predict nucleic acid binding residues in human telomerase which correspond closely to residues experimentally shown to affect binding. These results indicate that this method may be able to provide a starting point on which biochemical experiments may be based.


Friday, March 14

Haining Lin
Genetics, Development and Cell Biology Department
Major Professor: Xun Gu and Robin Buell

Title: Characterization of paralogous protein families in rice

Abstract: Background - High gene numbers in plant genomes reflect polyploidy and major gene duplication events. Oryza sativa, cultivated rice, is a diploid onocotyledonous species with a ~390 Mb genome that has undergone segmental duplication of a substantial portion of its genome. This, coupled with other genetic events such as tandem duplications, has resulted in a substantial number of its genes, and resulting proteins, occurring in paralogous families.

Results
Using a computational pipeline that utilizes Pfam and novel protein domains, we characterized paralogous families in rice and compared these with paralogous families in the model dicotyledonous diploid species, Arabidopsis thaliana. Arabidopsis, which has undergone genome duplication as well, has a substantially smaller genome (~120 Mb) and gene complement compared to rice. Overall, 53% and 68% of the non-transposable element-related rice and Arabidopsis proteins could be classified into paralogous protein families, respectively. Singleton and paralogous family genes differed substantially in their likelihood of encoding a protein of known or putative function; 26% and 66% of singleton genes compared to 73% and 96% of the paralogous family genes encode a known or putative protein in rice and Arabidopsis, respectively. Furthermore, a major skew in the distribution of specific gene function was observed; a total of 17 Gene Ontology categories in both rice and Arabidopsis were statistically significant in their differential distribution between paralogous family and singleton proteins. In contrast to mammalian organisms, we found that duplicated genes in rice and Arabidopsis tend to have more alternative splice forms. Using data from Massively Parallel Signature Sequencing, we show that a significant portion
of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very
young genes.

Conclusions
Collectively, these data suggest that while co-regulation and conserved function are present in some paralogous protein family members, evolutionary pressures have resulted in functional divergence with differential expression patterns.

Also Friday, March 14

Deepak Reyon
First-Year Graduate Student
BCB Graduate Program

Title: Determination of Protein Structure using X-Ray Crystallography CusB

Abstract: The Cus complex is a trans-membrane system in gram-negative bacteria that mediates resistance to copper and silver by cation efflux. Copper is essential
to the cell but also toxic, so a homeostatic environment must be maintained. The Cus complex consists of 3 core proteins: CusA (inner membrane), CusC (outer
membrane) and CusB (periplasmic). In Dr. Edward Yu's lab we are working solving the structure of this complex. I will present my progress in determining the
structure of CusB using X-Ray Crystallography.


Friday, March 21 - Spring Break


Friday, March 28

Presenter: Li Xue
BCB program, Home Department: MSE
Major professor: Krishna Rajan

Title: PCA on microsphere of biodegradable polymer adjuvant data

Abstract: In immunology, an adjuvant is an agent that may stimulate the immune system and increase the response to a vaccine, without having any specific antigenic effect in itself(wikipedia). Under multiple controlling factors, to decide optimum polymer chemistry that give predictable immune response and enhanced stability of protein
immunogens provides a challenge. In this study, microspheres fabricated by 1,6-bis(p-carboxyphenoxy)hexane (CPH), sebacic acid (SA), and 1,8-bis(p-carboxyphenoxy)-3,6-dioxaoctane (CPTEG)are used as adjuvant, and added to DCs (Dendritic Cells) to study cell markers expression behavior, which is one of the characteristics of DCs activation. Here PCA(Principle Component Analysis), a dimension deduction method, is applied to the multi-variable polymer adjuvant -
cell marker expression data. Some treads are detected showing that higher hydrophilic microspheres(CPTEG:CPH system) cause more cell marker expressed than more hydrophobic microspheres(CPH:SA system). Also, an optimum chemistry seems to sit between 50:50 CPTEG:CPH and 100%CPTEG, which is consistent with polymer film experiments.

Also Friday, March 28

Presenter: Jon Hurst
First-Year BCB Graduate Student
BCB Program

TITLE:  Markov Model Selection and Parameterization Using A Genetic Algorithm

ABSTRACT:  Modeling ion channel function is problematic because transitions between many conformational states cannot be directly observed. One solution to this problem is to use a genetic algorithm to create models that fit all desired data. I have developed software that not only parameterizes but creates the structure of Markov (state-based) models using this method. To determine whether this method can structure and parameterize mechanistically accurate models, test cases were conducted by using the genetic algorithm with known models. Preliminary results suggest that this method is well suited not only for parameterization of Markov models, but for model selection. This method could be also be applied to many disciplines beside neurophysiology that use hidden Markov models.


Friday, April 4

Presenter: Shreyartha Mukherjee
First Year BCB Rotating Student
BCB Graduate Program

Title:  Improving Secondary Protein Structure Prediction by generating decoys

Abstract: Predicting secondary protein structure using  amino acid sequence information alone is one of the fundamental  unsolved problems in computational molecular biology. Any algorithm that attempts to predict protein structure  requires a scoring or discriminatory function that can distinguish  between correct and incorrect conformations.  If we can generate  high-quality decoys with the aim of fooling scoring functions, we can take a step further at improving the existing scoring functions and leading to more accurate structure prediction.

Also Friday, April 4

Presenter: John Van Hemert
Home Department: Computer Engineering
Major Professor: Julie Dickerson

Title: TurtleBase: a facilitative ecoinformatics system

Abstract: The observation that many ecological survey-projects have led to massive collections of static data suggests the construction of a centralized platform for eco-informatics. Much ecological and environmental research is conducted by accumulating observational data across long timeframes.  Dr. Fred Janzen's lab has been and continues to conduct just such a project observing nesting Chrysemys picta (painted turtles) on a campground island in the Mississippi River near Clinton, Iowa.  Observations and measurements are taken of the turtle mothers, hatchlings, and nests over a four to six week period each summer.  Since 1989 this system has entailed handwriting notes on paper and manually transcribing data to digital tables and spreadsheets for small scale analysis.  I will present a project where a relational database was designed, all turtle data since 1989 has been imported to the database, and a web portal was created for access to the data.  Benefits were instantly available in several areas: consistency, automation, data input, data analysis tools, and data retrieval via download.  Results from preliminary manual and automatic data mining will be presented.


Friday, April 11


Kyoungmin Roh
Second-Year Graduate Student
BCB Graduate Program
Home Department: EEOB
Major professor: Professor Stephen Proulx

Title: Analysis of gene network using simulation

Abstract:
The traditional approach of Molecular biology research has been an inherently local one. The focus was on examining and collecting data on a single gene or a single reaction. However, recently, there has been much interest on the dynamics of gene regulatory networks (GRNs) being a collection of DNA segments in a cell that interacts with each other and with other substances in the cell. I applied mathematical approach for modeling of GRNs. This model describes the reaction kinetics of the constituent parts and the functions are ultimately derived from basic principle of simple expressions derived from Michaelis-Menten enzymatic kinetics. The functional forms are usually chosen as Hill functions that serve as an approximation for the real molecular dynamics. These dynamics depends on some parameters and I used simulated annealing algorithm to calculate the optimal fitness and the optimal parameters of gene network. I made a model that has two genes and experiences two different environments. From simulation, I may get the optimal gene interaction network and I will try more complicated evolutionary network in future.

Also Friday, April 11

Title:

Abstract:


Friday, April 18

Presenter: Mike Zimmermann
First Year Graduate Student

BCB Graduate Program

Title: Normal Mode Analysis for Protein Dynamics

Abstract : In recent years it has become clear that methods need to be developed to quickly calculate the molecular motions of proteins. Molecular Dynamics (MD) simulations are too computationally costly and require too much user time to be completed by most researchers. The amount of NMR and crystallographic structure data available to researchers is constantly growing while 3D structure prediction algorithms are also improving. This information usually only provides a static image of the protein structure rather than a dynamic depiction of the malleable entities that exist in cells. To analyze

protein dynamics in a computationally inexpensive manner the coarse grained Elastic Network Model (ENM) was developed. Time independent variants of ENM have been developed and studied by various groups. Recently, a time dependent model was derived which allows for interesting analysis of protein motions from a sound physical and intuitive perspective. It is hopped that with further experimentation more accurate and meaningful protein dynamics may be derived through use of this model.

Presenter: Tian Xia
Home Department: Computer Engineering
Major Professor: Julie Dickerson

Title : Omics Viz

Abstract: OmicsViz is a Cytoscape plugin (cytoscape2.4,2.5,2.6) dedicated to providing useful visualization and an integrated analysis tool for large-scale omics data. OmicsViz imports omics data into Cytoscape and visualizes it on a graph according to the change of gene experimental values. OmicsViz also provides a mapping function between two different species or between probe set and experimental names and node names in a network.

Title:

Abstract:

Also Friday, April 18

Title:

Abstract:


Friday, April 25

Presenter: Ankit Agrawal
BCB Minor
Major Professor: Xiaoqiu Huang
Home Department: Computer Science

Title: PairwiseStatSig: Pairwise Statistical Significance Estimation for Local Sequence Alignment

Abstract: Estimation of statistical significance of a pairwise alignment is of wide interest in sequence comparison. Currently, most of the popular alignment programs report the statistical significance of a pairwise alignment in context of a database search, which is dependent on the database. This work explores the use of pairwise statistical significance, which depends only on the pair of sequences being aligned and the alignment parameters, and can be useful in assessing the relatedness of two sequences (or a small number of sequences) in less time, independent of any database. We experimented with different methods to determine that censored maximum likelihood fitting of the score distribution (with censoring the distribution right of the peak) gives the most accurate estimates of pairwise statistical significance. Further, we evaluated this method in a homology detection experiment with a subset of CATH 2.3 database, which has been previously used by researchers as a benchmark data set for protein comparison. Comparison of results with popular database search programs like SSEARCH and PSI-BLAST on the same database indicate that the results of pairwise statistical significance are comparable, and sometimes better than those of database statistical significance (with SSEARCH). However, PSI-BLAST performs the best, presumably due to its use of query-specific substitution matrices. Pairwise statistical significance can be extremely useful in evaluating different parameter combinations for pairwise alignment - like alignment program, substitution matrices and gap penalties. As an application of pairwise statistical significance, we also conducted a series of homology detection experiments to empirically determine the effective gap opening penalties for pairwise protein alignment with the widely used BLOSUM substitution matrices - BLOSUM45, BLOSUM50, BLOSUM62 and BLOSUM80, on the same benchmark database. The proposed method is implemented in C language in a program PairwiseStatSig, and is expected to be a useful tool for computational biologists for pairwise statistical significance estimation purposes, especially for smaller set of sequences without having to perform time-consuming database searches. The program PairwiseStatSig is available for free academic use at
www.cs.iastate.edu/$\sim$ankitag/PairwiseStatSig.html.

Also Friday, April 25

Presenter: Bob Farnham
Major Professor:
Srinivas Aluru
Home Department:
Electrical and Computer Engineering

Title: An Algorithm for Finding Optimal Gene Network Models from Microarray Data

Abstract: Finding gene networks from microarray data is computationally NP-hard.  In this presentation, an O(n * 2^n) (time and space) sequential algorithm will be described.  On single-processor systems, such algorithms are limited to inferring gene networks of up to 32 or so genes.  Thus, gene network problems are a fruitful area of research for parallel systems.  Some thoughts on how this problem may be mitigated through parallel approaches will be offered.


Friday, May 2

Wengang Zhou
Major Professor Julie Dickerson
Department: Electrical and Computer Engineering

Title: A Predicted Interactome for Vitis Vinifera

Abstract:   High throughput technology such as yeast two-hybrid has produced a huge amount  of interaction data. One of the important goals of functional genomics is to identify the complete protein interaction network or Interactome. In this study, we collect 55146 available interactions including seven species from DIP database. By applying best reciprocal blast analysis, we found 3082 grape orthologs for 19665 unique DIP proteins. The latest published grape protein sequences by Italian-French Group are used. After removing redundant interactions, the predicted grape interactome contains 2380 interactions involving 1555 unique grape proteins. By mapping all involved grape proteins to their Arabidopsis orthologs, we further used BiNGO to find the overrepresented biological processes for three big subnetworks. Then, we presented a structure based feature to predict Subcellular locations for the entire grape proteome based on multi-class classifiers. The proteins secondary structures are predicted using PSIPRED system and the training data contains 7579 proteins within 12 locations. About 26% of all predicted interacting pairs come from the same Subcellular location. Surprisingly, even though we have few Mitochondrion and Chloroplast proteins, most of them are interacting with each other. We also obtained more evidence from TAIR which had 822 non-redundant proteins interactions for Arabidopsis. 11 of 822 interactions match with our predicted interactions.

Fengli Fu
Major Professor: Dan Voytas
Department: Genetics, Development and Cell Biology

Title:  Improve the Engineering of Zinc Finger Proteins (ZFPs) by Modular Design


Abstract: The zinc finger motif is one of the best understood DNA-binding domains. Because it is typically modular both in structure and in DNA binding activity, it is the most suitable scaffold for constructing engineered DNA binding proteins. By fusing with various functional effector domains to create artificial DNA modifiers, engineered ZFPs have many potential uses in both basic science and medical therapy. The dominant methodology currently available to academic laboratories for engineering zinc finger proteins is modular design, or modular assembly. But at present, modular assembly has a high failure rate. We hypothesize that there are some rules governing the construction of ZFPs by modular design and understanding these rules will improve the engineering of ZFPs. In order to facilitate and improve the engineering of ZFPs for academic researchers, we developed a web database, into which the zinc fingers and engineered ZFPs were collected. Using the ZFPs generated using selection method, we learned the frequency of each amino acid at the 9 key positions of a 3-finger ZFP contacting a 9bp binding site. Lab experiment showed that we can modify the zinc fingers according to the frequency we found to improve the success rate of engineering ZFPs.

Title:

Abstract:

Also Friday, May 2

Title:

Abstract:


Friday, May 9

Finals Week - Presentations if needed.


URL: 
Copyright© 2008, Iowa State University, all rights reserved.
Please direct corrections, suggestions, and comments to bcb@iastate.edu.
Last Modified: