|
| By |
Year |
|
|
|
| |
|
|
|
|
| 2001 |
2002 |
2003 |
2004 |
2005 |
|
|
|
|
|
| Xiang Gao |
Jianmin Feng |
Aspen Garry |
Junli Ji |
Pan Du |
| Yuan Lin |
Xiaosi Zhang |
Yingchun Wang |
Shiquan Wu |
Changhui Yan |
| Thomas Vigdal |
Xiangyun Wang |
Hua Zhou |
Nicole Leahy |
Feng Cui |
| Yufeng Wang |
Lixia Diao |
Huaijun Zhou |
Bradley Powers |
|
|
Hailong Zhang |
Wei Zhu |
Zhongqi Zhang |
|
|
|
Zhong Gao |
|
|
|
|
Jianying Gu |
|
|
|
|
Alain Laederach |
|
|
|
|
Wu Xu |
|
|
| 2006 |
2007 |
2008 |
2009 |
2010 |
| |
|
|
|
|
| Justin Recknor |
Jae-Hyung Lee |
Jie Li |
Robert Farnham |
|
| Shannon Schlueter |
Michael Sparks |
Jeffry Sander |
Haining Lin |
|
| Jing Ding |
Matthew Wilkerson |
Michael Terribilini |
Kent Vander Velden |
|
| Joset Etzel |
Lindsay Kasaga (Carroll) |
LaRon Hughes |
Feihong Wu |
|
| Fang Fang |
Peter Vedell |
Michael Lawrence |
Haitao Cheng |
|
| Darrin Lemmer |
Wuyan Zhang |
Erin Myers |
|
|
| Wieslawa Mentzen |
Lisa Borsuk |
Kyoungmin Roh |
|
|
| Justin Schonfeld |
Tyra Dunn |
Erin Boote Jones |
|
|
| Julie Hoy |
Scott Emrich |
Garrett Dancik |
|
|
| Jianmin Wang |
Ling Guo |
Brent Kronmiller |
|
|
| Di Wu |
Myron Peto |
Aimin Yan |
|
|
| Liang Ye |
|
Lei Yang |
|
|
Graduated: Spring, 2000 with a Ph.D. in Applied Math and Immunobiology
Current Position: Emergency Mgmt. Inc.as a Lead Scientist in Bel Air, MD
His Past work was as a Postdoctoral Research Associate at Iowa State University http://www.t10.lanl.gov/profiles/baccam.html
http://www.t10.lanl.gov/pbaccam/
Major and Co-Major Professors: Dr. Jim Cornette and Dr. Susan Carpenter
Dissertation Title: Genetic Variation and evolution of equine infectious anemia virus rev quasispecis during long term persistent infection
Abstract: Genetic variation has been observed in many viruses. Viruses that carry their genetic information in the form of RNA exhibit high mutation rates because the viral polymerase lacks proof-reading mechanisms commonly found in DNA polymerase complexes. The combination of high mutation rates, small genome size, and high replication rates results in a population of closely related viral genotypes, which are commonly referred to as a quasispecies. A consequence of the genetic variation in viruses is possible variation in viral phenotype of the quasispecies population. Furthermore, changes in viral phenotype may be a biologically important factor in progression of disease. Here, we undertook a longitudinal study to describe the quasispecies nature and genetic variation in a lentivirus regulatory protein, Rev, during the course of disease in a pony experimentally infected with equine infections anemia virus (EIAV). This study examined rev variants that comprised the quasispecies population in sequential sera samples. Over the course of disease, there was continual appearance of novel rev variants, with some variants growing in frequency to predominate certain time points. Phylogenetic and cluster analyses suggested that the Rev quasispecies was comprised of two distinct populations that co-existed during infection. These two quasispecies populations differed in their pattern of evolution, with one population accumulating changes in a linear, time-dependent manner, while the other population evolved radially from a common variant. Changes in the population size of the two Rev quasispecies coincided with changes in the clinical stages of disease. Rev variants from each population were biologically tested, and significant differences in Rev activity were detected between the two populations. Together, these results suggested that the distinct Rev populations differed in selective advantage. A statistical correlation was found between Rev quasispecies activity differed significantly between different stages of clinical disease. This study suggests that distinct quasispecies populations, which differed in patter of evolution and niche advantage, co-existed during long term persistent infection by EIAV. A multi-population quasispecies model challenges our current thinking of viral populations and may have significant biological implications.
Graduated: Fall, 2003 with an M.S. in Botany
Current Position: Middle school science teacher, Apache Junction, AZ
Major and Co-Major Professors: Dr. Jonathan Wendel and Dr. Dean Adams
Thesis Title: Initiation and early development of fibers in wild and cultivated cotton
Abstract: Gossypium (Malvaceae) is a diverse genus best known for cultivated cotton. It includes about 50 species, 45 diploid and 5 allopolyploid, which occur in arid and semi-arid regions throughout the world (Vollesen, 1987; Fryxell, 1992). The diploids are divided into eight genome groups based on chromosome pairing and size, and fertility between species (Endrizzi, Turcotte, and Kohel, 1985). These groups comprise natural lineages within the genus and correspond to geographic locations: A, B, E, F- Africa and Arabia; C, G, K- Australia; and D- New World. Allopolyploid members are founds in the New World and contain the A and D genomes (Wendel, 1995; Wendel et al., 1998; Brubaker, Bourland, and Wendel, 1999; Percival, Wendel, and Stewart, 1999; Cronn et al., 2002). This understanding of the evolutionary history of the genus allows many aspects of evolutionary differences in development and morphology to be studied in a phylogenetic context.
Graduated: Spring 2009 with a PhD in BCB
Current Position: To be determined
Major and Co-Major Professors: Dr. Robert Jernigan and Dr. Zhijun Wu
Dissertation Title: Pathways of conformational transition originating in phosphorylation: A study of CDK2 using targeted molecular dynamics and coarse grained models
Abstract: There is a critical need for protein structure and function prediction. Accurate protein secondary structure prediction is essential for many bioinformatics applications, including protein tertiary structure prediction. We developed an algorithm (Fragment Data Mining, FDM) for protein secondary structure prediction using fragments of known structures obtained by multiple sequence alignment (MSA). Its performance is excellent where high-score MSA matches are available. By combing it with GOR V, a new Consensus Database Mining (CDM) method was developed, which surpasses the performances of both FDM and GOR V. For each residue, it chooses to use either the result of FDM or GOR V depending upon the availability of high-score matches of MSA. A server has been set up to make CDM publicly accessible. It becomes more popular due to the reliability and efficiency of its performance, the simplicity of its use, and its potential for improvement with the rapidly growing number of determined structrues.
Phosphorylation is the most important post translational modifications for cellular regulation and signal transduction. Upon phosphorylation, proteins can undergo obvious conformational changes. It is challenging to characterize these changes because of the high flexibility of phosphorylation regions and the difficulties in obtaining diffraction quality crystals. In the current study, we focused on the conformational changes of CDK2 due to phosphorylation at Thr160. We use C -C -side chain (CABS) modeling, Targeted Molecular Dynamics (TMD) and conventional molecular dynamics (MD) to simulate the structural transition and create transition pathways. Principal component analysis (PCA) of the trajectories and normal mode analysis (NMA) with anisotropic network model (ANM – an elastic network model) were used for trajectory analysis and performance comparisons. The CABS with appropriate constraint weights and TMD with proper force constants successfully simulated the conformational changes of CDK2 phosphorylation, including the formation of the arginine cluster, maintaining the geometrical relationships of the conserved residues, and the characteristic movement of the active loop (T loop). For conventional MD, we use the CABS modeling and energy optimization to construct the missing segments in the structure. CABS is, for the first time, used also to create transition pathways as well as to patch in the segment with no determined coordinates. It proved especially valuable in the study of small localized conformational changes. The results show that CABS and TMD are both effective approaches for creating pathways of transitions due to phosphorylation. PCA showed significant overlaps with set of low frequency ANM normal modes. It is possible to explore the mechanisms of phosphorylation-induced conformational changes with these simulation methods and analysis methods.
Graduated: Summer, 2005 with a Ph.D. Co-Major in BCB and Physical Chemistry
Current Position: Visiting Fellow, National Cancer Institute, Center for Cancer Research Nanobiology Program (CCRNP), Frederick, MD http://ccr.cancer.gov/Labs/staff.asp?labid=91
Major and Co-Major Professors: Dr. Zhijun Wu and Dr. Robert Jernigan
Dissertation Title: Distance-based NMR Structure determination and refinement
Abstract: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy are two widely used experimental techniques for protein structure determination. In protein Data Bank (PDB), about 85% of deposited protein structures are determined by X-ray crystallography. The rest of the structures are determined by NMR spectroscopy. The main difference between these two approaches lies in the state of protein samples to which they are applied: for X-ray crystallography, a protein has to be in the crystalline state while in NMR, it may be in the solution state. Both approaches have their own pros and cons. For example, X-ray crystallography is a mature technique capable of providing more objective interpretation of data. This approach has various quality indicators such as resolution and R-factor to assess the structures. It can be applied to large molecules, e.g., virus particles, and produce a single model that is easy to visualize and interpret. Raw data processing is highly automatic. In contrast, NMR is a relatively new technique and provides more subjective interpretation of the data. It lacks established quality indicators of data and models. In addition, it is limited to determination of relatively small proteins (<20kDa) and produces an ensemble of possible structures rather than one model. Data sometimes have to be manually processed. On the other hand, a protein has to form stable crystals for X-ray analysis, which could be time-consuming and often impossible. The crystalline state is not a natural and physiological environment for the protein either. In addition, X-ray crystallography is less useful for large flexible modular proteins. In contrast, the solution state of a protein is closer to biological conditions and relatively easy to prepare. NMR can provide information on dynamics and identify individual side-chain motion, often used to monitor conformational change on ligand binding. With the pros and cons, both approaches have undergone dramatic development during the past five years, especially for NMR. Advances in data collection, spectra assignment and analysis, structure calculation and computer graphics bring no barrier among NMR spectra assignment process, NMR structure assessment and visualization. Many quality indicators such as bond length, angle and NOE violations (inter-atomic distances that lie outside of NOE ranges) have been developed and used for quality assessment of NMR structures. Novel refinement schemes aimed at increasing the accuracy of the resulting structures have been proposed and tested. As a result, nowadays, proteins in size up to 30 kDa (about 260 residues) are routinely accessible by NMR spectroscopy with increased resolution, equivalent to approximately 2.5-A resolution crystal structures.
Graduated: Summer, 2008 with a PhD in BCB
Current Position: Assistant Professor, Northwestern State, Departments of Biology and Math Sciences, Louisiana; Will begin bioinformatics concentration there.
Major and Co-Major Professors: Dr. Karin Dorman and Dr. Doug Jones
Dissertation Title: Exploring host-pathogen relationships through computer simulations of intracellular infection
Abstract: Computer simulations of infectious disease allow for the identification and estimation of important pathogen and immune parameters, the validation of theoretical biological models with experimental data, and the characterization of the host-pathogen interactions that lead to emergent and sometimes counterintuitive behavior. This dissertation describes the development, analysis, and calibration of a computer model of Leishmania major infection, the identification of correlates of escape mutant success and optimal escape strategies in a computer model of a viral infection, and statistical software to aid in computer model analysis and calibration.
In an agent-based model of L. major infection, sensitivity analysis reveals that increasing growth rates can favor, or suppress parasite load, depending on the stage of the infection and the ability of the pathogen to avoid detection. Calibration of the computer model suggests that the pathogen has a relatively slow growth rate and can grow for an extended time before damaging the host cell.
In a computer model of viral infection, we find that the relative overall importance of the cellular (or humoral) response consistently correlates with both the success of immune escape and the optimal escape strategy, and that correlation is relatively robust to the time the escape mutant arises. Mutants that simultaneously escape both responses perform substantially better than humoral or cellular escape mutants alone, highlighting the importance of both responses in controlling infection. Interestingly, loss of infectiousness of humoral escape mutants favors the virus, likely because decreasing infectivity weakens the cellular response.
Finally, Gaussian processes (GP) are commonly used as fast predictors of computer model output and are essential tools for computer model calibration and analysis. We describe the R package mlegp , which fits GPs to scalar or multivariate computer model output and performs sensitivity analysis to identify and characterize the effects of important model parameters.
Graduated: Summer, 2002 with an M.S. in BCB.
Current Position: Pursuing a PhD in Statistics at ISU
Major and Co-Major Professors: David Fernandez-Baca and Dr. Xun Gu
Thesis Title: Consensus properties of supertree construction methods
Abstract: The combination of a set of rooted perfect phylogenetic trees on overlapping leaf sets into one supertree is important and fundamental for evolutionary biology. In this thesis, we will present three supertree techniques – MRP, MRF, MinCutSupertree – and compare the consensus properties of MRP and MRF with some consensus tree criteria.
Jing Ding
Graduated: Spring 2006 with a PhD Co-Major in BCB and ComE
Current Position: Staff specialist, Ohio State University, Columbus
Major and Co-Major Professors: Dr. Dan Berleant and Dr. Eve Wurtele
Dissertation Title: BOW-Based vs. Concept-Based Text Clustering for Functional Analysis of Genes
Abstract: The rapid development in genomic technologies (e.g. microarray) has enabled biologists to simultaneously monitor expression of hundreds or even thousands of genes in a single experiment. To interpret the biological meaning of the expression patterns, it still largely relies on biologists domain knowledge, as well as collected information from literature and/or various public databases. Individual experts domain knowledge is insufficient for large datasets, and manually collecting and analyzing information from literature and/or public databases are tedious and time-consuming. Computer-aided functional analyzing tools are highly desirable. We developed GeneNarrator, a text-mining system for functional analysis of microarray data. Given a list of genes, GeneNarrator collects functional information (MEDLINE citations) from PubMed, and clusters the citations into functional topics. The genes are then mapped to the topics and clustered into groups based on their similarities in topic distribution.
Graduated: Fall, 2005 with a PhD Co-Major in BCB and EE
Current Position: Research Associate / Senior Bioinformatics Analyst position, Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL
Major and Co-Major Professors: Dr. Julie Dickerson and Dr. Eve Wurtele
Dissertation Title: Multi-scale Genetic Network Inference based on Time Series Gene Expression Profiles
Abstract: This work integrates multi-scale clustering and short-time correlation to estimate genetic regulatory networks with different time resolutions and detail levels. Gene expression data are noisy and large scale. Clustering is widely used to group genes with similar pattern. The cluster centers can be used to infer the genetic networks among these clusters. This work introduces the Multi-scale Fuzzy K-means clustering algorithm to uncover groups of coregulated genes and capture the networks in different levels of detail.
Time series expression profiles provide dynamic information for inferring gene regulatory relationships. Large scale network inference, identifying the transient interactions and feedback loops as well as differentiating direct and indirect interactions are among the major challenges of genetic network inference. Pairwise time correlation can detect linear interactions between genes. Estimates of the time delay and direction of causality in the inferred network can also be made. Partial correlation and d-separation theory are combined to differentiate the direct and indirect interactions and identify feedback loops. Gene expression regulation can happen in specific time periods and conditions instead of across the whole expression profile. Short-time correlation can capture transient interactions.
The network discovery algorithm was validated using yeast cell cycle data. The algorithm successfully identified the yeast cell cycle development stages, cell cycle and negative feedback loops, and indicated how the networks dynamically changes over time. The inferred network reflects most interactions previously identified by genome-wide location analysis and matches extant literature results. The inferred network provides more detailed information about genes (or clusters) and the interactions among them. Interesting genes, clusters and interactions were identified, which match the literature and the gene ontology information and provide hypotheses for further studies.
Graduated: Fall, 2004 - MS in BCB; Summer 2007 - PhD in BCB
Current Position: Postdoctoral Fellow, Univ. of North Carolina
Major and Co-Major Professors: Dr. Xun Gu and Dr. Daniel Voytas
Master's Thesis Title: Genomic differences between humans and primates
Abstract: Scientists around the world have wondered for many years what distinguishes speciation. Of particular interest is the genetic basis for human/primate (chimpanzee or gorilla) separation. Humans and chimpanzees are 99% identical in their genomic DNA sequence, thus making them very closely related. Despite this high degree of sequence similarity, humans and primates have a number of striking phenotypic differences. We hypothesize that sequence changes that have occurred between humans and primates have altered developmental programs. Because transcription factors alter the expression of numerous genes, we also hypothesize that changes in the expression or activity of transcription factors are responsible for the different phenotypic traits among humans and primates.
Using human chromosome 22 as a model for comparison between human and primate DNA, a random selection of noncoding genes approximately 1-2 kilobases (kb) long upstream was sequenced. Focused on promoter regions from the sequence data, significant differences were detected when comparing humans and gorillas (p-value= < 0.01) and gorillas and chimpanzees (p-value= <0.01) suggesting that limited similarities existed between the species. When comparing humans and chimpanzees (p-value= >0.1), no significant difference was detected. Using this information, transcription factors were analyzed between the human and chimpanzee data to determine if transcription regulation was different between the species. The results indicated no significant difference between humans and chimpanzees at the single-nucleotide level even though the species differ at the genetic and phenotypic levels. The results also indicated that changes in transcription regulation have played a major role in determining speciation. This research opens new avenues in investigating how many of the differences have functional consequences and the relative contributions of these transcription factors to expression differences.
Major and Co-Major Professors: Dr. Heather Greenlee and Dr. Vasant Honavar
Dissertation Title: Characterizing and Influencing Differentiation Of Retinal Progenitor Cells
Abstract: The vertebrate neural retina is a complex organ that is well suited for studying development of the central nervous system. Blinding degenerative retinal diseases including retinitis pigmentosa, macular degeneration and glaucoma are characterized by loss of retinal neurons. At this time there is no way to replace retinal cell loss due to disease or injury since differentiated retinal cells are unable to regenerate. As a potential approach for treating retinal injury, neural progenitor cells have been proposed as a unique source of transplantable cells to replace lost cells in the damaged retina.
Previous studies have transplanted a variety of neural stem cells to the eye in hopes of developing a therapy to replace retinal neurons lost to disease. Successful integration, survival and differentiation of the cell types have been variably successful. At the moment little is known about the fundamental biological differences between stem cell or progenitor cell types.
We have used proteomic profiling to begin to identify unique characteristics of retinal progenitor cells. Our results demonstrate that expanded retinal progenitor cells express higher levels of stress-response proteins compared to their brain-derived counterparts. Further, we have described the dynamic expression of stress-response proteins during in vivo retinal development. Finally, we have demonstrated that changing the oxidative environment by addition of the antioxidant vitamin E to retinal progenitor cells differentiated in vitro decrease expression of stress-response proteins and alter their differentiation. These studies are the first to describe the expression of stress-response proteins during in vitro and in vivo retinal cellular development. Our results demonstrate the importance of understanding the oxidative nature of a host environment and how differentiation of transplanted cells might be affected.
Scott Emrich
Graduated: Summer, 2007 with a PhD in BCB
Current Position: Assistant Professor, University of Notre Dame, Notre Dame, IN
Major and Co-Major Professors: Dr. Srinivas Aluru and Dr. Patrick Schnable
Dissertation Title: Assembly and Analysis of Complex Plant Genomes
Presentation: June 8, 2007
Abstract: Concurrent advances in high-throughput sequencing and assembly have led to the completion of many complex genomes. Even so, these assemblies require substantial computational resources. In this dissertation, we present a massively parallel approach that scales to thousands of processors without duplicating the biological expertise present in conventional assembly software. Additional bioinformatics techniques were required to accurately assemble the maize genome including novel repeat detection, and the resulting framework has been strongly supported by maize experimental data. More recently, this framework has been generalized for fruit fly, sorghum, soybean and environmental sequence assemblies. Questions in plant genome analysis were also addressed. For example, we have discovered an estimated 350 “orphan” maize genes and have shown that approximately 1% of all maize genes were recently duplicated, many of which into at least two functional copies. LCM-454 sequencing is introduced and analyses that indicate this approach can discover rare, potentially tissue-specific transcripts and thousands of SNPs will be presented. This dissertation combines high performance computing, computational biology and high-throughput sequencing for our ongoing work on the maize genome project. We conclude by describing how these contributions can be useful for any species, including non-model organisms that are unlikely to be fully sequenced.
Joset Etzel
Graduated: Spring 2006 with a PhD in BCB
Current Position: Research Postdoctoral Fellow, University of Groningan, Netherlands
Major and Co-Major Professors: Dr. Julie Dickerson and Dr. Ralph Adolphs
Dissertation Title: Algorithms and Procedures to Analyze Physiological Signals in Psychophysiological Research
Abstract: This dissertation presents analytical techniques which allow more information to be derived from psychophysiological data than otherwise possible. The techniques include an implemented algorithm for chest strain-gauge respiration signal analysis and a permutation testing method for evaluating changes over time in physiological signals. These methods are applied to three data sets, each examining physiological correlates of emotional experience. In the first study physiological correlates of moods induced using music were identified, although respiration entrainment confounds the issue of whether mood or the music caused the observed patterns. The second study examined physiological responses while subjects watched an emotional movie under three conditions; changes relating both to the movie scenes and condition were identified. Finally, the third study evaluates short term changes in heart rate while viewing words in terms of the type of word viewed and later word recall.
Fang Fang
Graduated: Spring 2006 with a PhD in BCB
Current Position: Postdoctoral Fellow, Dr. Arlene Auerbach, Lab of Human Genetics & Hematology, The Rockefeller University, New York City
Major and Co-Major Professors: Dr. Karin Dorman and Dr. Drena Dobbs
Dissertation Title: Virus Recombination: Modeling and Data Analysis
Abstract: As a key evolutionary process, recombination shapes the genetic structure of virus populations. The dramatic increase of virus full-length sequences provides a chance to study virus recombination through molecular data. Many statistical methods have been developed, and a lot of the methods are phylogenetic-based. My research focuses on recombination modeling and data analysis. I first apply an existing phylogenetic-base method, Bayesian dual change-point model (DMCP), to investigate the role of representative data types for recombination study. We conclude that consensus data is overall the best data type to represent virus genotypes. Using consensus data we studied recombination on all full-length hepatitis B virus (HBV) sequences, and set up a system for using DMCP model for large scale sequence analysis. We discovered that HBV has extremly high recombination rate. For the first time we reported circulating recombination forms of hepatitis B virus, and identified one potential recombination hotspot. One important goal of studying recombination is to find potential recombination hotspot, and to reveal the recombination molecular mechanism. This goal requires identification of all recombinants generated by different recombination events,which is not trivial when recombination sequences have similar mosaic structures. Extending the DMCP model, I developed a metnod to identify the number of recombination event producing multiple recombinants. I apply this method to several HBV recombinants that have identical mosaic structure and find at least two recombinant events.
Graduated: Fall, 2002 with a M.S. in BCB
Current Position: Research Scientist, Dr. Ed Yeung, Iowa State University, Ames, IA
Major and Co-Major Professors: Dr. Volker Brendel and Dr. Zhijun Wu
Thesis Title: A new approach for discovering protein motifs
Abstract: Motif recognition is a powerful homology based sequence analysis tool for clustering new protein sequences into different families based on characteristic motifs. Compared to BLAST, these approaches typically have lower false positive rates and can reveal more remotely related family members. However, the current motif databases do not cover all the sequences in protein sequence databases. One of the major reasons for the low coverage of motif databases is that there is only a small set of known member sequences available for constructing protein motifs for many gene families. I have designed a new algorithm, “mFISHER”, to detect protein motifs from only 2-5 known member sequences by artificial evolution of given sequences based on a position specific PAM evolution model. Based on my test results on 160 motif families, the overall average recall rate or sensitivity (true/(true + false negative)) and specificity (true/(true + false positive)) are 88% and 95%, respectively. Compared with MEME (Multiple EM for Motif Extraction), mFISHER is better based on the recall rate, especially when only 2 or 3 members are available. Both approaches have the similar sensitivity. MFISHER is promising for constructing protein motifs when only a few known members.
Graduated: Fall, 2001 with a PhD in MCDB and BCB
Currently: Postdoctoral Fellow with Dr. Michael Lynch, Biology Department, Indiana University, Bloomington, IN
Major and Co-Major Professors: Dr. Daniel Voytas and Dr. Les Miller
Dissertation Title: Studying the replication mechanism of the yeast retrotransposon Ty5 by molecular and computational approaches
Abstract: The yeast retrotransposon Ty5 is a Ty1/copia element. Officially, it is in the Hemivirus genus of the Pseudoviridae family. The ability to genetically manipulate retrotransposons and the yeast host cell was taken advantage of to explore replication mechanisms unique to Ty5 and common to most retrotransposons. Because of the abundance and diversity of retroelement sequences, along with the fact that many retroelement enzymes have evolved unique functional specificities, computational approaches were also developed to study functional divergence in replication. By screening a randomly mutagenized Ty5 library, two mutations (Y68C, D252N) that caused higher transposition frequencies were identified. Both mutations increased Ty5 cDNA levels, but did not have dramatic effects on the steps after cDNA synthesis (i.e. integration and recombination), or protein synthesis, processing, or solubility. The D252N mutation increased the hydrogen bonding potential of the CCHC zinc finger of nucleocapsid protein (NCp), making the Ty5 NCp zinc finger more like Ty1/copia consensus zinc fingers in terms of hydrogen bonding potential. Other mutations that increased the hydrogen bonding potential (D252R, D252K) provided the same fold increase in Ty5 reverse transcription, and natural occurring mutations in the Ty5 zinc finger repress this function. Hydrogen bonding is suggested to be a universal requirement for the function of retroviral type zinc fingers and cellular zinc fingers. A half-tRNA priming mechanism for Ty5 reverse transcription was also demonstrated. Mutations in the anticodon of tRNA (IMT) and the putative PBS of Ty5 decreased transposistion, but transposistion was restored when complementarity between the IMT and PBS was restored. A tree-based method and supplemental Split Tester software were developed to study the functional divergence of reverse transcriptase (RT) with respect to half-tRNA and full-tRNA priming mechanisms. The domains identified by this computational approach were previously experimentally demonstrated to bind with the tRNA primer/template in HIV RT. Using this software, another domain related to integrase functional specificity, namely whether or not integrase carries out 3’-end processing during integration, was also consistently identified in different integrase datasets. A model describing this functional divergence is proposed.
Graduated: Summer 2003 with an M.S. in BCB
Current Position: Currently: Postdoctoral Fellow, The Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore MD
Major and Co-Major Professors: Dr. Vasant Honavar and Dr. Kai-Ming Ho
Thesis Title: Genome wide recognition of Tumor Necrosis Factor (TNF) related ligands in human and Arabidopsis genomes: A structural genomics approach
Abstract: Tumor necrosis factors (TNFs) play a crucial role in mammalian signal transduction pathways for cell proliferation, survival, and differentiation. Human and other species (such as Arabidopsis) genome sequencing projects provide a unique opportunity for genome-wide recognition of TNF related ligand proteins and discovery of potential TNF-TNFR signal transduction mechanism in plants. Genome-wide recognition of TNF related proteins in human and Arabidopsis was carried out using secondary structure prediction and protein fold recognition. In the protein fold recognition scheme, sequence-structure models are evaluated using contact energy score based on Miyazawa-Jernigan and Li-Tan-Wingreen models. Secondary structure composition based initial screening not only reduces search space of protein fold recognition but also shifts the score distribution of the selected candidates to a higher score region. In order to investigate influence of sequence length on threading results, protein fold recognition was conducted on human and Arabidopsis genome sequences of different length. The test on known TNFs from diverse species indicates that about 83% of TNFs are able to be identified; the test on human genome sequences shows that about 80% of known TNFs can be recognized. Integration of secondary structure profiling into the scheme can improve performance by adjusting local sequence-structure relationship. However, this improvement largely depends on accuracy of secondary structure prediction. Average scoring performs better than maximal scoring in model evaluation and selection. Pattern classification algorithms such as decision tree, neural network, Naïve Bayes classifier, and support vector machine are applied to discriminate TNF related proteins from the competitive false positives which have similar secondary structure composition to known TNFs and also have high fold recognition scores. Both known TNF and false positive sequences are represented with the twenty q values corresponding to twenty amino acids in Li-Tan-Wingreen model. Cross-validation results show that Naïve Bayes classifier performs better than SVM, neural network, and decision tree, and Naïve Bayes classifier is suitable for stringent control of false positive. This genome-wide search scheme was used to search potential TNF-like signal proteins in Arabidopsis genome. Possible role of candidates in human and Arabidopsis genomes is discussed. These results demonstrate that structure based methods can facilitate functional prediction in a genome scale.
Graduated: Fall, 2003 with an M.S. in EEB
Current Position:
Major and Co-Major Professors: Dr. Dean Adams and Dr. Gavin Naylor
Thesis Title: Geometric Morphometric analysis of shark teeth of the genus Rhizoprionodon: The modern, the ancient, and the hypothetical. Modern tooth shape analysis and test of ancestory prediction methods by comparison to fossil shapes
Abstract: Shark teeth are extremely common in the fossil record, and they can potentially provide insight into the evolutionary history of sharks. However, isolated fossil teeth are difficult to assign to the correct jaw, position, and taxon without organismal context because individual sharks exhibit a variety of tooth shapes. Tooth shape varies across jaws, positions within each jaw, and taxa.
Fortunately, tooth shape is quantifiable, and shapes can be compared using the techniques of geometric morphometrics, which measure shape and its covariation with other variables. Analysis of modern tooth shapes was performed in order to gain understanding of patterns of modern tooth shape variation. These results could then be applied to fossils to provide better identification of fossils in order to make use of sharks’ extensive fossil record.
To quantify modern patterns of tooth shape variation, teeth of five Rhizoprionodon species and representative of three closely related genera (Loxodon,Eusphyra, and Sphyrna) were quantified and analyzed using geometric morphometric methods. Ancestral tooth shapes were estimated using the modern shape data mapped onto a phylogeny created using molecular data, and a Brownian motion model of evolution. These shapes were compared to fossil teeth from Rhizoprionodon sp. and Sphyrna spp. to evaluate the accuracy of the estimated ancestral shapes.
Modern teeth at the front of the jaw displayed the most dramatic shape differences between jaws and positions. Teeth from each genus could be distinguished, but species within Rhizoprionodon could not. Fossil tooth shapes most closely resembled those of modern teeth, indicating that tooth shape did not change according to the Brownian motion model used to predict ancestral shapes.
Graduated: Summer, 2003 with a PhD in BCB
Current Position: Assistant Professor, City University of New York
Major and Co-Major Professors: Dr. Xun Gu and Dr. Dan Nettleton
Dissertation Title: Functional divergence and genome evolution of vertebrate protein kinases
Abstract: The emerging complete and nearly complete genome sequences have provided a significant amount of materials for large-scale comparative genomic analysis. Novel methods have been developed to elucidate the function of gene products and functional interacting networks. Many of these post-genomic attempts have focused on unveiling the evolutionary forces that have shaped the network organization. Among various evolutionary forces, duplication of functional domain, individual gene, chromosomal segment, or entire genome has long been thought as primary resource for the function novelties in a vast number of gene families. It is therefore intriguing to quantitatively trace the changes of evolutionary constraints after a duplication event.
This study is focused on the exploitation of the functional divergence and evolutionary patterns in vertebrate kinase complements (denoted as kinomes) and kinase-regulated signaling transduction pathways, using a combinatorial statistical and evolutionary approach. The analysis of an individual kinase gene family (Jak), protein tyrosine kinase superfamily, and a kinase mediated signaling transduction pathway (TGF- b ) showed that functional divergence (altered functional constraint) after (domain or gene) duplication is a general pattern. Moreover, the age distribution of the vertebrate kinomes showed that (1) The major kinase-related animal specific signal-transduction pathways have been generated through an ancient continuous domain shuffling (or duplications) during the time period from early stage of eukaryotes to metazoan evolution; (2) Vertebrate tissue-specificity of signal-transduction is facilitated by large-scale duplication event(s) in the early stage of vertebrates; and (3) The kinase pseudogenes are generated through either segmental duplication or retrotransposition very recently.
Graduated: Summer 2007 with a PhD in BCB
Current Position: To be determined
Major and Co-Major Professors: Dr. Patrick Schnable and Dr. Dan Ashlock
Dissertation Title: Adaption of Multiclustering to the Analysis of Microarray Data
Abstract: Clustering has become an integral part of microarray data analysis and interpretation. It is helpful to reduce the scale of information generated by microarray experiment to the level that biologists can generate hypothesis. There is a danger that artifacts induced by clustering methods can cause misinterpretation of the data. Clustering method that can accurately capture the natural structure of the data would be a useful tool for biologists to discovery the biological meaning buried in the data. To this end, a new clustering algorithm, called K-means multiclustering, is introduced. The method can avoid the artifacts induced by distance or similarity metrics by amalgamating the results of many K-means clusterings.
Results: The multiclustering algorithm is a model-free clustering method. It is found to be reliable and consist in capturing the underlying data structure with high accuracy that is competitive with model based clustering and superior to other methods on synthetic micorarry data generated in a manner consistent with the hypothesis of model based clustering. The algorithm has a high level of immunity to artifacts introduced by the metric used to measure the distance between data points. It can successfully cluster data sets which are designed to have different shapes and variation and cannot be correctly clustered by traditional clustering method. The cut plot computed by this method is a very simple and useful summary of the data structure. A detailed view of the formation of clustering can also be generated by the method to reveal the underlying hierarchical structure of data set.
Graduated: Spring 2005 with a PhD in IG
Current Position: Postdoctoral Research Associate, David Baulcombe, Sainsbury Lab, Norwich, England
Major and Co-Major Professors: Dr. Daniel Voytas and Dr. Mei Hong
Dissertation Title: Characterization of the Sireviruses: A unique group of Ty1/copia LTR retrotransposons in plants
Abstract: Plant genomes have allowed the expansion of many types of mobile genetic elements. LTR retrotransposons are a subclass of mobile genetic elements that replicate using an RNA intermediate. The Pseudoviridae (Ty1/copia) are a family of LTR retrotransposons, and the Sireviruses are one of three genera in the Pseudoviridae. The Sireviruses have features that set them apart from classical retrotransposons. Different members of the Sireviruses show great variability in their genomic structures and the translational tricks they use to express their encoded proteins. For example, we have shown that the SIRE1 elements of soybean use stop codon suppression to express their Env-like protein. Secondly, some monocot members of the Sireviruses may use a bypass mechanism to translate Pol.
Another notable feature of the Sireviruses is that most carry additional coding information in the form of an open reading frame (ORF) referred to as an env-like ORF, and all have encoded extra coding information in their gag gene. The env-like ORF has caused speculation that these elements are plant retroviruses, although no experimental evidence has determined this to be true. However, using a yeast two-hybrid screen, we have discovered an interaction between multiple Sirevirus Gags and a family of related host cell proteins referred to as dynein light chain LC8 and LC6. The LC8 and LC6 proteins are highly conserved in eukaryotes and are components of the dynein and myosin-V motors. LC8 can bind cargo (cell proteins or virus particles) to allow movement along the cytoskeleton. Thus, one hypothesis is that the interaction of the Sirevirus Gags with LC8 or LC6 may allow for movement of the Sirevirus virus-like particles or transposition intermediates within a cell (for example, from cytoplasmic to nuclear compartments). If true, this would not only represent the first example of a movement mechanism for any retrotransposon, but it also illustrates how plant retrotransposons and plant viruses use similar mechanisms to achieve a common goal. In addition, an initial characterization of the expression and localization of the Arabidopsis thaliana LC8/LC6 gene family was completed.
Graduated: Summer, 2006 with a PhD in Biophysics
Current Position: Manager, X-Ray Crystallography Center, Iowa State University
Major Professor: Dr. Mark Hargrove
Dissertation Title: Structural Characterization of Ligand Binding in Hexacoordinate Hemoglobins
Abstract: The goal of biophysics is to study the structures of the components of living organisms and to understand the mechanics of the processes of life. Hemoglobin is a well suited model for this study. As an essential component of the life blood of mammals, and easy to obtain in large quantities, hemoglobin and its monomeric partner myoglobin are two of the most well studied and characterized components of life. Yet hemoglobin studies continue to reveal new forms of hemoglobin, raising new questions, functional possibilities, and research opportunities. My research focuses on hemoglobins classified as hexacoordinate. I have focused particularly on the structural characterization of these proteins upon ligand binding. Included below for your benefit are a list of abbreviations and terms used in my talk along with their definitions.
Hbs -- hemoglobins
hxHbs -- hexacoordinate
hemoglobins trHbs -- truncated hemoglobins
nsHbs -- nonsymbiotic hemoglobins
sHbs -- symbiotic hemoglobins
SynHb -- Hb from Synechocystis
ferric -- oxidized (3+ iron)
ferrous -- reduced (2+ iron)
ligand -- small binding molecule like oxygen
k' -- rate of ligand binding
K -- equilibrium binding association constant
soret -- optical peak around 390-440nm
List of publications: Hoy, J. A., Kundu, S., Trent, J. T., 3rd, Ramaswamy, S., and Hargrove, M. S. (2004). The crystal structure of Synechocystis hemoglobin with a covalent heme linkage. J Biol Chem. 279, 16535-16542. Trent, J. T., 3rd, Kundu, S., Hoy, J. A., and Hargrove, M. S. (2004). Crystallographic analysis of synechocystis cyanoglobin reveals the structural changes accompanying ligand binding in a hexacoordinate hemoglobin. J Mol Biol. 341, 1097-1108. Smagghe, B. J., Kundu, S., Hoy, J. A., Halder, P., Weiland, T. R., Savage, A., Venugopal, A., Goodman, M., Premer, S., Hargrove, M. S. (2006). Role of Phenylalanine B10 in Plant Nonsymbiotic Hemoglobins. Biochemistry Aug 15;45(32):9735-9745. Hoy, J. A., Smagghe, B. J., Halder, P., Hargrove, M. S. (2006). Covalent heme attachement in Synechocystis hemoglobin is required to prevent ferrous heme dissociation. Manuscript in preparation. Hoy, J. A., Robinson, H., Trent, J. T., Kakar, S., Smagghe, B. J., Hargrove, M. S. (2006). Crystal structure of a nonsymbiotic plant hemoglobin; implications for the evolution of oxygen transport. Manuscript in preparation.
Bio: BA in Physics and BA in Humanities from Wartburg College, Waverly, Iowa 1996 MS in Physics from Iowa State University, 1999 Temporary Instructor of Physics, ISU, 1999 - 2000 PhD studies in Biophysics, ISU, 2000 - 2006 Postdoc in Hargrove Lab
Graduated: Summer 2004 with an M.S.; Summer 2007 with a PhD
Current Position: GenomeQuest, Field Application Scientist, Westborough , MA
Major and Co-Major Professors: Dr. Karin Dorman and Dr. Susan Carpenter
Thesis Title: EIAV DB: A comprehensive Equine Infectious Anemia (EIAV) Virus database
Abstract: A major problem in biology is the storage and retrieval of biological data in a meaningful and efficient manner. With the advent of mass sequencing projects, such as the human genome project, the need to store, retrieve, and analyze sequence data is stronger than ever before. The following thesis tackles a small part of this problem by presenting techniques, models, and applications for productively storing and retrieving a set of related viral sequences in a central data bank. The thesis begins by providing an overview of the relational database and its role in storing biological data. The main chapter of the thesis is a description of a novel relational database application (EIAV DB). EIAV DB is a central repository of Equine Infectious Anemia Virus sequence and feature information. The models and application provide insight into technologies that help alleviate the storage and retrieval problem.
Major and Co-Major Professors: Dr. Jim Reecy and Dr. Vasant Honavar
Dissertation Title: Hypothesis building using the Animal Trait Ontology
Abstract: With the advent of sequencing projects in model organisms, humans, and domesticated livestock species, the need for storage, retrieval, and analysis of genomics information for these animals has become important. The Animal Trait Ontology (ATO) is an ontology that has been created to store the relationships between farm animal traits for several domesticated farm animals. The Collaborative Ontology Building (COB) editor was used to create and edit the ATO. An online ontology browser has been developed to search and browse the ontology and to view the relationships between the terms. Some of the traits in the ontology are linked to associated quantitative trait loci (QTL) information for each species through a tool called the Comparative Animal QTL (CAQ) tool which allows users to compare QTL experiments in livestock species. The tool allows QTL experiments to be compared based on 1) one trait given one species, and 2) two traits given one species. The effectiveness of the tool is recorded in the form of a data and statistical analysis which demonstrates its use in examining pleiotropic effects for traits in the pig. In addition, the Human and Animal Trait Ontology is discussed and it will form an agglomeration of several different species ontologies, including the ATO, that will form a consensus for describing phenotypes and traits across different disease models.
Graduated: Summer, 2004 with a PhD in IG and with a BCB minor
Current Position: Postdoctoral Research Associate, VCU ( Virginia Commonwealth University), Richmond, VA; Project: SNPs in mammals
Major and Co-Major Professors: Dr. Thomas Peterson and Dr. Xun Gu
Dissertation Title: Computational and molecular analysis of Myb gene family
Abstract: Myb proteins are defined by a highly conserved DNA-specific binding domain termed Myb, which is composed of approximately 50 amino acids with constantly spaced tryptophan residues. Multiple copies of Myb domains often exist as tandem repeats within a single protein. There are up to four tandem Myb repeats present in Myb proteins identified to date (termed R0R1R2R3 hereafter). In our study, we collected additional Myb genes, and performed a series of phylogenetic analyses to explore the evolutionary origin of Myb genes. The results suggest that the Myb gene family originated from an ancient one Myb-box gene. One and two intragenic duplications produced R2R3 and R1R2R3 Myb genes, respectively, which then co-existed in the primitive eukaryotes and gave rise to the currently extant Myb genes. Based on our results, we proposed that plant R1R2R3 Myb genes were derived from R2R3 Myb genes by gain of the R1 repeat through an ancient intragenic duplication; this gain model is more parsimonious than the previous proposal that plant R2R3 Myb genes were derived from R1R2R3 Myb genes by loss of the R1 repeat. The phylogenetic analysis of isolated individual Myb repeats indicates that R2 repeat has evolved more slowly than the R1 and R3 repeats. However, it is not clear which repeat is the most ancient one.
Another goal of our project is to classify and predict functions of Myb genes. We clustered the closely-related Myb genes into subgroups from Arabidopsis and rice on a basis of sequence similarity and phylogeny. The gene structure analysis revealed that both the positions and phases of introns are conserved in the same subgroup, although these differ between subgroups. Conserved motifs were detected in C-terminal coding regions within subgroups, and these motifs exist specifically in Myb genes. We also found that Myb genes with similar functions are clustered together. In contrast, no conserved regulatory elements were identified in the divergent non-coding regions. Additionally, the distribution pattern of introns in the phylogenetic tree indicates that Myb domains originally had a compact size without introns. Non-coding sequences were inserted and the splicing sites were conserved during evolution.
Graduated: Summer, 2008 with a PhD in BCB
Current Position: Industry position
Major and Co-Major Professors: Dr. Roger Wise and Dr. Xun Gu
Dissertation Title: Assembly And Annotation Tools For Analysis Of Large Contiguous Regions Of The Maize Genome
Abstract: LTR retrotransposons make up significant portions of many of the longer grass genomes, their repeat sequences across the genome, their terminal repeats, and their nested cluster configuration make assembly of sequence clones challenging and identification of gene regions difficult. In this thesis I provide tools necessary for both assembly and annotation of highly repetitive genomes and use these tools to construct the currently two longest maize sequence contigs.
In the first part of the thesis I present TEnest, annotation and visualization software for transposable elements in grass genomes. TEnest identifies all fragmented transposable elements within the input sequence and reconstructs each to the original insertion state. This provides a chronological display of the nesting pattern of clustered transposable elements. For LTR retrotransposons TEnest calculates an estimated age since insertion based on the divergence of its paired LTRs. I also provide a case study of TEnest on the available maize genome sequence. TEnest shows the distribution of transposon families, ages of insertion, and frequencies of solo LTRs. In addition I provide a phylogenetic analysis of retrotransposon families showing the estimated ages since insertion of LTR retrotransposons cluster with their sequence identity, showing that LTR retrotransposons experience specific intervals of extreme proliferation to expand across the genome.
In the second part of this thesis I introduce our two contiguous maize sequences, rf1-associated contigs rf1-C1 and rf1-C2 sequenced from maize B73. These are the two longest contiguous maize sequences and provide previously unmatched sequence quality for answering many questions surrounding the makeup of the maize genome. Here, using TEnest, we propose two maize assembly techniques for highly repetitive regions. The use of these processes has allowed us to provide the high quality contiguous sequences of the rf1-associated region and will assist researchers with assembly of difficult sequence clones. We show definite separation between gene and repeat regions. The rf1-associated contigs, when compared to the rice and sorghum genomes, show conserved macro-colinearity between genes across the long sequences. But at a closer look at individual gene islands show there is micro-non-colinearity across the analyzed grass species.
The third section of this thesis compares the B73 rf1-associated sequence contigs with two BACs sequenced from Wf9-BG, an Rf1 containing maize line. Here we identify four genes in an island corresponding to a similar gene island in B73, however a fifth gene is missing from Wf9-BG. Two repeat clusters surround the gene island; one matches its counterpart in B73, the second repeat cluster does not align to B73. Leading up to this area of recombination we observe a drastically increased frequency of polymorphisms.
Graduated: Summer, 2003 with a PhD Co-Major in BCB and Chemical Engineering
Current Position: Postdoctoral Fellow, Dr. Russ Altman, MD, PhD, Helix Bioinformatics Group, Department of Genetics, Stanford School of Medicine, CA; http://helix-web.stanford.edu/people/alain/; Stanford School of Medicine, Department of Genetics
Major and Co-Major Professors: Dr. Peter Reilly and Dr. Amy Andreotti
Dissertation Title: Protein-Carbohydrate and Protein-Protein interactions: Using models to better understand and predict specific molecular recognition
Abstract: Any molecular recognition event results in a change in the free energy of the system. The extent of this change is related to the association constant, such that the more negative the free energy change is, the tighter the interaction between receptor and ligand. Protein-carbohydrate interactions play a critical role in signal transduction, innate immunity and metabolism. Modeling these interactions is somewhat complicated by the inherent flexibility of carbohydrates as well as their relatively large number of functional groups. An empirical scoring function for docking carbohydrates to proteins will be presented specifically tailored to predict both the correct binding orientation and free energy of binding of the carbohydrate-ligand/protein-receptor complex. This new scoring function can predict free energies of binding to within 1.1 kcal/mol residual standard error, a definite improvement over existing scoring functions which result in standard errors well over 2 kcal/mol. Application of automated docking methodology to determine carbohydrate recognition specificity of the c-type Lectin, human Surfactant Protein D will also be presented. In the second part of the thesis, the role of p-stacking interactions (e.g. between Tyr side chains) in stabilizing protein folds will be discussed. A 17-residue peptide derived from the naturally occurring anti-microbial peptide Tachyplesin I is investigated using NMR spectroscopy. NOE cross peaks were observed confirming the existence of this interaction in solution. In the final part of the thesis, a quantitative NMR investigation into the self-association behavior of the regulatory domains of several Tec family member kinases will be presented. Of particular interest, self-association within Bruton's Tyrosine Kinase (Btk) regulatory domains occurs through the formation of an asymmetric homodimer. Together this work demonstrates the importance of rigorous biophysical characterization of bio-molecular recognition events and how interdependent computational modeling and experimentation are.
Graduated: Spring, 2008 with a PhD in BCB
Current Position: Postdoctoral Fellow, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N. PO Box 19024, Seattle, WA 98109; http://www.fhcrc.org/
Major and Co-Major Professors: Dr. Dianne Cook and Dr. Eve Wurtele
Dissertation Title: Interactive graphics, graphical user interfaces and software interfaces for the analysis of biological experimental data and networks
Abstract: Biologists need to analyze and comprehend increasingly large and more complex experimental data. These experimental data are multivariate, where each row corresponds to a biological entity, and each column corresponds to the level of an experimental treatment. Biological experiments often produce multiple data sets, each describing one aspect of the system, such as the transcriptome recorded by a microarray or metabolome recorded using gas chromatography mass spectrometry (GC-MS). A biochemical network model provides a conceptual system-level framework for integrating data from different sources. Effective use of graphics enhances the comprehension of data, and interactive graphics permit the analyst to actively explore data, check its integrity, satiate curiosities and reveal the unexpected. Interactive graphics have not been widely applied as a means for understanding data from biological experiments. This thesis addresses these needs by providing new methods and software that apply interactive graphics in coordination with numerical methods to the analysis of biological data, in a manner that is accessible to biologists.
Graduated: Spring, 2004 with a PhD in BCB
Current Position: Postdoctoral Fellow, Genetics Department, University of GA, Athens, GA
Major and Co-Major Professors: Dr. Daniel Ashlock and Dr. John Mayfield
Dissertation Title: Pseudophyte evolutionary algorithm: A simple computational model of parapatric speciation s
Abstract: The Pseudophyte Evolutionary Algorithm (PEA) is an individual-based computer model of a population of haploid, annual plants used to examine the process of speciation in a patchy environment. The model incorporated both pre-mating and post-zygotic mechanisms for the evolution of reproductive isolation via pollen selection and offspring inviability, respectively. The PEA allows speciation as an emergent property rather than an explicit feature of the model to understand how environmental patchiness, number and arrangement of loci, and reproductive output of individuals affected the strength of isolating mechanisms as well as the rate at which these evolve. The effect of how genotypes were mapped to phenotypes was also explored to examine the sensitivity of the PEA to alternate representations.
Jae-Hyung Lee
Graduated: Fall, 2007 with a PhD in BCB
Current Position: Postdoctoral Fellow, Drena Dobbs Lab, Iowa State University
Major and Co-Major Professors: Dr. Drena Dobbs and Dr. Kai-Ming Ho
Dissertation Title: Analysis of protein-RNA and protein-peptide interactions in Equine Infectious Anemia Virus (EIAV) infection
Abstract: Macromolecular interactions are essential for virtually all cellular functions including signal transduction processes, metabolic processes, regulation of gene expression and immune responses. This dissertation focuses on the characterization of two important macromolecular interactions involved in the relationship between Equine Infectious Anemia Virus (EIAV) and its host cell in horse: i) the interaction between the EIAV Rev protein and its binding site, the Rev-responsive element (RRE) and ii) interactions between equine MHC class I molecules and epitope peptides derived from EIAV proteins. EIAV, one of the most divergent members of the lentivirus family, has a single-stranded RNA genome and carries several regulatory and structural proteins within its viral particle. Rev is an essential EIAV regulatory encoded protein that interacts with the viral RRE, a specific binding site in the viral mRNA. Using a combination of experimental and computational methods, the interactions between EIAV Rev and RRE were characterized in detail. EIAV Rev was shown to have a bipartite RNA binding domain containing two arginine rich motifs (ARMs). The RRE secondary structure was determined and specific structural motifs that act as cis-regulatory elements for EIAV Rev-RRE interaction were identified. Interestingly, a structural motif located in the high affinity Rev binding site is well conserved in several diverse lentiviral genomes, including HIV-1. Macromolecular interactions involved in the immune response of the horse to EIAV infection were investigated by analyzing complexes between MHC class I proteins and epitope peptides derived from EIAV Rev, Env and Gag proteins. Computational modeling results provided a mechanistic explanation for the experimental finding that a single amino acid change in the peptide binding domain of the equine MHC class I molecule differentially affects the recognition of specific epitopes by EIAV-specific CTL. Together, the findings in this dissertation provide novel insights into the strategy used by EIAV to replicate itself, and provide new details about how the host cell responds to and defends against EIAV upon the infection. Moreover, they have contributed to our understanding of the macromolecular recognition events that regulate these processes.
Darrin Lemmer
Graduated: Spring, 2006 with an M.S. in BCB
Current Position: IBM
Major and Co-Major Professors: Dr. Gloria Culver and Dr. Drena Dobbs
Thesis Title: CAVEMol: an immersive 3D molecule viewer
Abstract: As the number of solved molecular structures deposited with the Protein Data Bank (PDB) increases, so too does the desire for more advanced ways of using this data. Traditional applications for viewing and manipulating molecular structures create a computer-generated model on a standard desktop computer screen. The display may employ some method of stereography to create the illusion of depth, but generally the user just sees a flat image. The user is able to interact with the molecule by magnifying it to get closer look at a particular area of interest, or by rotating it along an arbitrary axis, thus allowing all sides of the molecule to be seen, though only one side is in view at any given time. The user may also be able to see changes in the molecule over time whereby each conformation of the molecule is a separate frame of an animation, or they may even be able to make modifications to the structure in real time. Regardless of the amount of control the user has over the molecule, however, one thing remains the same: the user experiences the molecule as though it were an object floating behind the monitor screen which they can indirectly control using a mouse or other pointing device.
An immersive environment, on the other hand, provides a new paradigm for molecular visualization, allowing the user a much more realistic interaction with the molecule. The user becomes part of the viewing experience, traversing a molecule as though walking or flying within it. The molecule can completely surround them on all sides, giving them a true sense of the size and shape of the molecule in three dimensions. The user may also interact with the object directly, moving and rotating it with their hands rather than a mouse.
This approach should prove particularly valuable for operations such as “interactive docking,” which allows a user to manipulate the interface between two molecules to identify favorable interaction sites. While this can be done to a degree in today’s desktop molecule viewers, the operation is difficult and time consuming. Because today’s viewers are limited to a flat screen display, a user can only attempt to dock two molecules in two dimensions at a time. When the structure is rotated, more often than not the third dimension is not properly aligned. Realigning the third dimension invariably breaks one or both of the first two. The result is a long and frustrating cycle of alignment rotation and realignment. By allowing direct manipulation in all three dimensions simultaneously, the immersive perspective eliminates this cycle.
This thesis presents the design and implementation of CAVEMol, a molecular visualization application for immersive environments. I will also give an overview of molecular visualization and immersive environments, and then discuss future work that can be done in this area as well as applications where molecular visualization in an immersive environment can be particularly valuable.
Graduated: M.S. in BCB, Fall 2004 and PhD in BCB in Spring 2009
Current Position: Past: The Institute for Genomic Research (TIGR) with Dr. Robin Buell
Thesis Major and Co-Major Professors: Dr. Xiaoqiu Huang and Dr. Daniel Voytas
Thesis Title: BACAP: An assembly program for heirarchial shotgun sequencing
Abstract: We propose a sequence-based algorithm BACAP to assemble BAC sequences generated from hierarchical shotgun sequencing. Our approach relies on sequence similarity rather than physical mapping. It follows the “overlap-layout-consensus” framework used for shotgun sequencing data. BACAP uses heuristic methods to achieve efficiency and accuracy. It was tested on four simulated data sets of 200 BAC-size sequences each and one real data set of 228 rice BACs from TIGR. The average running time was 25 minutes on one 900 MHz IA-64 GenuineIntel Itanium machine. Our results show that BACAP can quickly and accurately accomplish some BAC assembly tasks without physical mapping information.
PhD Major Professors: Dr. C. Robin Buell and Dr. Xun Gu
Dissertation Title: Insights into the rice and Arabidopsis genomes: intron fates, paralogs, and lineage-specific genes
Abstract: With the availability of near-complete rice genome sequence, high-quality annotation data, and large expression profile datasets, we examined segmental duplication, intron turnover, and paralogous protein family composition in rice. These data suggest a large percentage of the rice genome was involved in segmental duplication creating a large number of paralogous families. We found that singleton and paralogous family genes differed substantially not only in their likelihood of encoding a protein of known or putative function but also in the distribution of specific gene function. We showed that a significant portion of the duplicated genes in rice show divergent expression although a correlation between sequence divergence and correlation of expression could be seen in very young genes. We observed that intron evolution within the rice genome following segmental duplication is dominated by intron loss rather than intron gain. In addition, with the availability of more complete or near-complete plant genomes and transcriptomes across a wide range of species, we identified and characterized conserved Brassicaceae-specific genes and Arabidopsis lineage-specific genes. Lineage specific genes in the Brassicaceae and within Arabidopsis were enriched in genes of no known function and appear to be fast evolving at the protein sequence level.
Graduated: Fall 2001 with an M.S. in BCB
Current Position: Staff, J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850; phone: 240-268-2767; email: press@venterinstitute.org; http://www.venterinstitute.org/
Major and Co-Major Professors: Dr. Xun Gu and Dr. Xiaoqiu Huang
Thesis Title: The Relationship of Sequence Similarity and Expression Pattern Similarity between Yeast Genes within Gene Families
Abstract: After gene duplication, the sequence and expression patterns of duplicated genes diverge. It is known that the function divergence of duplicated genes could be related to the divergence of both their coding sequence and expression profile mainly caused by the sequence change of regulatory region. But it is not known if the sequence divergence and expression pattern divergence are correlated. Former research by Andreas Wagner showed there is at most very weak correlation between them. On the contrary, our research shows there is a strong correlation between the sequence similarity and expression profile similarity if the sequences are quite conserved; the degree of coexpression of duplicated genes is consistent to their duplication order.
Graduated: Fall, 2002 with an M.S. in Genetics
Current Position: Scientist, Nanosphere, Inc., Northbrook, IL
Major and Co-Major Professors: Dr. Steve Rodermel andDr. Vasant Honavar
Thesis Title: Proteomics of the developing chloroplast in maize
Abstract: Chloroplast protein expression profiles during the light-induced biogenesis of the maize plastid were determined from 2D gel analysis. During five time points of this ‘greening’ process (0,2,4,12, and 48 hours post-illumination), maize plant tissue was collected, plastids isolated, and protein precipitated and separated in two dimensions using 2D protein gels. From these proteome maps, quantities of spots were analyzed by: Principal Components Analysis, hierarchical pairwise average linkage cluster analysis, Adaptive Resonance Theory 2 cluster analysis, and Self Organizing Map cluster analysis to determine chloroplast protein expression profiles. 54 spots representing 26 proteins were identified by MALDI-TOF mass spectrometry and used to verify the protein expression profiles. Two main conclusions were drawn from this data: 1) ART2 may be a useful clustering tool for expression data, and 2) different forms or modifications of the same protein show different expression patterns.
Wiesia Mentzen
Graduated: Summer, 2006 with a PhD in BCB
Current Position: Senior Scientist with Alberto de la Fuente at CRS4 Bioinformatica, Pula, Italy
Major and Co-Major Professors: Dr. Eve Wurtele and Dr. Xun Gu
Dissertation Title: From Pathway to Regulon in Arabidopsis
Abstract: I apply combined bioinformatic approaches using genomic and transcriptomic data to investigate the fatty acid biosynthesis pathway, at the molecular level, and in the context of the system biology of Arabidopsis. Fatty acids are essential components of all known bacterial and eukaryotic cells with critical role in cells as energy reserves and the metabolic precursors for biological membranes. The pathway for fatty acid synthesis seems to be conserved across all living systems. Acetyl-CoA carboxylase, a member of a superfamily of biotin-dependent enzymes, catalyzes the first committed step of the fatty acid biosynthesis pathway. Phylogenetic study exposed complex and intertwined evolutionary histories of this family, with multiple domain fusions and rearrangements. As revealed by meta-analysis of a wide array of Arabidopsis transcriptomic data, fatty acid biosynthesis is transcriptionally regulated, and this regulation not only extends across all pathway reactions, but also some substrate- and cofactor-producing reactions, thus defining a major transcriptionally co-regulated pathway. I extend the meta-analysis of the transcriptome to find groups of coexpressed genes (also called modules, or regulons) in the Arabidopsis genome. Major functionally-coherent gene groups were identified. These comprise development, information processing, defense, and metabolism, as well as tissue- and organelle- specific processes.
Graduated: Summer 2008 with a PhD in EEB
Current Position: Post-doctoral Fellow, Dr. Tony Frakino's lab, Dept. of Biology and Biochemistry, University of Houston
Major and Co-Major Professors: Dr. Fred Janzen and Dr. Dean Adams
Dissertation Title: Post-orbital color pattern variation and the evolution of a radiation of turtles (Graptemys)
Abstract: One of the most deeply studied areas in the field of evolutionary biology is the formation and maintenance of new species, as well as the variation in the rate and extent to which taxa radiate. A range of evolutionary processes, from ecological adaptation to sexual selection and reinforcement, can lead to the formation of new species. However, the formation of new species likely results from several isolating mechanisms acting in concert. The map turtle complex (genus: Graptemys) is an excellent model system for exploring the nature of speciation given its exceptional species richness and high levels of morphological diversity, particularly in facial coloration patterns. This research utilizes an integrative approach to establish the role of post- orbital color patterns in species diversification and maintenance. This multi- faceted approach will incorporate aspects of phylogenetics, population and quantitative genetics, morphometrics, and behavior to assess morphological evolution within species and across the genus. The phylogeny of map turtles was characterized by a hard polytomy indicating rapid speciation. Across the genus, morphological evolution occurred in a parsimonious manner. Within species, both morphology and genetics exhibited a pattern of isolation by distance. Temperature significantly influence coloration patterns and multivariate heritability was generally low. Finally, in behavior trials, neither males nor females spent significantly more time with members of their own species. In all projects, the signatures of sexual selection or reinforcement were absent or equivocal where they would be expected if they were the main forces continuing to shape interactions among map turtle species. The results of this research indicate that role of past and on-going selection on coloration pattern within the map turtle clade has been limited, indicating that post-orbital coloration was not the driving factor in the radiation of this turtle clade. Alternative explanations for map turtle species richness are explored.
Graduated: Summer 2007 with a PhD in BCB
Current Position: Postdoctoral Associate, Crop Genome Informatics Laboratory, USDA Agricultural Research Service, On the Campus of Iowa State University
Major and Co-Major Professors: Dr. Robert Jernigan and Dr. Drena Dobbs
Dissertation Title: Studies of Protein Designability using Reduced Models
Presentation: July 9, 2007
Abstract: One the most important problems in computational structural biology is protein designability, that is, why protein sequences are not random strings of amino acids but instead show regular patterns that encode protein structures. Many previous studies that have attempted to solve the problem have relied upon reduced models of proteins. In particular, the 2D square and the 3D cubic lattices together with reduced amino acid alphabets have been examined extensively and have lead to interesting results that shed some light on evolutionary relationship among proteins. Here, additionally to the 2D square lattice, we study the 2D triangular and 3D face centered cubic (fcc) lattices, we perform designability studies using different shapes embedded in the 2D square lattice, and we use machine learning algorithms to classify binary sequences folding to highly- or poorly-designable conformations. In the first part of the thesis we extend the transfer matrix method to the 2D triangular lattice. The transfer matrix method is a highly efficient method of enumerating all conformations within a compact lattice area that has earlier been developed for the 2D square and 3D cubic lattices. In addition we also enumerated all compact conformations within simple geometries on the 2D triangular and 3D face centered cubic (fcc) lattices using a standard backtracking algorithm. In the second part of the thesis we described protein designability studies on various shapes in the 2D square lattice using a reduced hydrophobic-polar (HP) amino acid alphabet. We used a simple energy function that counted the number of H-H, H-P and P-P interactions within a restricted set of protein shapes that have the same number of residues and non-bonded contacts. We found a difference in the designabilities of different protein shapes. Finally, in the third part of the thesis we used standard machine learning algorithms to classify two classes of protein sequences. We first performed a designability study for two shapes, using a binary HP alphabet, on the 2D triangular lattice and separated highly- and poorly-designable conformations. Highly-designable conformations had many sequences folding to them with the lowest energy and poorly-designable conformations had few or no sequences folding to them. Sequences were classified as highly- or poorly-designable depending on whether they folded to highly- or poorly-designable structures. Using several machine learning algorithms such as Decision Tree, Naïve Bayes, and Support Vector Machine, we were able to classify highly- and poorly-designable sequences with high accuracy.
Graduated: Spring 2004 with an M.S. in BCB
Current Position: Bioinformatics Scientist, NewLink Genetics, Ames, IA
Major and Co-Major Professors: Dr. Dan Ashlock and Dr. Kirk Moloney
Dissertation Title: The Effect of Tags on Non-Local Adaptation
Abstract: This project investigates in greater depth in phenomenon of non-local adaptation previously observed in an evolutionary model based on the game iterated Prisoner’s Dilemma. Non-local adaptation is the ability of an agent or population of agents to perform well against other agents that share no common history or ancestry with them. Populations of agents both with and without identifying tags are evolved to perform noisy iterated prisoner’s dilemma on a toroidal grid. The agents consist of a finite state machine specialized for playing iterated prisoner’s dilemma and simple tag recognition capability. The populations are allowed to evolve for 10,000 generations and the state of the world is stored every 500 generations. Populations from these samples are placed in competition with populations from generation 10,000. This procedure is repeated for varying levels of overall mutation rate, with and without tags, and varying frequencies of tag related mutations. Non-local adaptation is seen in these populations, however, tags seem to slow the acquisition of non-local adaptation. Although the concept of non-local adaptation is not a widely accepted phenomenon in biology, these results suggest that it may happen and that they effect is persistent in the face of changes in mutation rate and in the face of increased task complexity. Further analysis of the populations tend to have a predominant tag most of the time with punctuated periods of increased tag space usage that most likely correspond to invasion of the population by an opportunistic agent with a new tag identifier.
Graduated: Fall 2006 with a PhD in BCB and Statistics
Current Position: Associate Statistician, Eli Lilly, Toxicology Department, Working with Microarray Analysis, Indianapolis, IN
Major and Co-Major Professors: Dr. Dan Nettleton and Dr. Jim Reecy
Dissertation Title: Identification of Differentially Expressed Functional Categories in Microarray Studies Using Nonparametric Multivariate Analyses
Abstract: Tests of differential expression across groups of genes, within a functional category, are performed using a method motivated by Barry, Nobel, and Wright (2005). Rather than comparing marginal distributions on a gene-by-gene basis across treatment groups, we use a test statistic that can detect general changes in multivariate distributions across treatment groups. Resampling-based methods and multiple-testing adjustments are used to obtain simultaneous inference for multiple groups of genes. Results are visualized on a directed acyclical graph, and new methods for pinpointing genes of greatest interest are provided.
Graduated: Summer 2008 with an M.S. in BCB
Current Position: PhD Student, Univ. of CA
Major Professor: Dr. Steve Proulx
Thesis Title: Evolutionary variance of gene network via simulated annealing algorithm
Abstract: The traditional approach of molecular biology research was on examining and collecting data on a single gene or a single reaction. However, recently, there has been much interest on the dynamics of gene regulatory networks ( E. Klipp, et al., 2005). We applied mathematical approach for modeling of gene network. The models depict the reaction kinetics of the constituent parts and the functions are ultimately made from basic principle of simple expressions derived from Michaelis-Menten enzymatic kinetics, and the functional forms are usually chosen as Hill functions that serve as an approximation for the real molecular dynamics ( E. Klipp, et al., 2005). These dynamics depends on many parameters and the parameters strongly influence the behavior of the resulting gene network. Thus, we used simulated annealing algorithm to calculate a high fitness and optimal parameters of the gene network. The simulated annealing algorithm is suitable for calculating many degree of freedom (Jonathan Tomshine and Yiannis N. Kaznessis., 2006). We developed 3 different models that have two genes and experience two different environments, and simulated to describe the behavior of evolutionary gene networks. From simulation, we could find how genes interact each other by evolutionary times, we could obtain a high fitness of each gene network model, and we could indicate how gene network is evolved from tracks of parameters and a fitness. Also, we analyzed the relations of a high fitness and parameters. We think we can apply to design and optimize other gene network, and these findings are useful to analysis of the evolutionary gene network.
Jeffry D. Sander
Graduated: Fall, 2008 with a PhD in BCB
Current Position: Postdoctoral Fellow, Harvard Medical Center
Major and Co-Major Professors: Dr. Drena Dobbs and Dr. Daniel Voytas
Dissertation Title: Characterization and design of C2H2 zinc finger proteins as custom DNA binding domains
Abstract: As the storage medium for the source code of life, DNA is fundamentally linked to all cellular processes. Nature employs hundreds of sequence-specific DNA binding proteins as transcription factors and repressors to regulate the flow of genetic expression and replication. By adapting these DNA-binding domains to target desired genome locations, they can be harnessed to treat diseases by regulating genes and repairing diseased gene sequences. The C2H2 zinc finger motif is perhaps the most promising and versatile DNA binding framework. Each C2H2 zinc finger domain (module) is capable of recognizing approximately three adjacent nucleotide bases in standard B form DNA. Through directed mutagenesis, novel zinc finger modules (ZFMs) can be selected for most of the 64 possible DNA triplets. By assembling multiple ZFMs with the appropriate linkers, zinc finger proteins (ZFPs) can be generated to specifically bind extended sequence motifs. Several methods of varying complexity are currently available for ZFP engineering. ZFPs generated from the relatively simple modular design method often fail to function in vivo. Those generated using the most reliable module subsets, those recognizing triplets with a 5' guanine (GNN), only function an estimated 50% of the time, while modularly assembled ZFPs comprised primarily of non-GNN modules rarely function in vivo. These low success rates are extremely problematic for applications requiring multiple ZFPs targeting adjacent sequence motifs. More complex approaches provide enhanced success rates as compared to modular design, with the drawback that they are also more labor intensive and require additional biological expertise. In this work we engineered ZFPs, analyzed characteristics of functional engineered zinc finger proteins and their targets, formulated algorithms predictive of ZFP success for both modular assembly and OPEN (Oligomerized Pool Engineering) selection methods, and generated online software tools to aid others in the successful application of this technology. .
Shannon D. Schlueter
Graduated: Fall 2006 with a PhD in BCB
Current Position: Assistant Professor, Department of Bioinformatics and Genomics, University of North Carolina, Charlotte, NC
Major and Co-Major Professors: Dr. Volker Brendel and Dr. Randy Shoemaker
Dissertation Title: Plant genome informatics: Evaluation and Analysis of genomic DNA features involved in transcriptional processing of protein coding genes
Abstract: As biological data collection methods have become more cost effective and less time consuming, the necessity of computational tools to store, manage, and analyze such data has led to the creation of a broad field of research. With the vast majority of effort in bioinformatics being applied to research on vertebrate species, researchers in the plant sciences have often been left with less than satisfactory tools to fill this need. In the course of this study, I have developed xGDB, an extensible infrastructure for integrating biological data resources and applying them to hypothesis driven research. Eleven plant species xGDB databases have been made publicly available at http://www.plantgdb.org. Using the infrastructure provided by xGDB, a sophisticated system was developed to investigate the reliability of protein coding gene structure annotations on a per gene basis. With this, I generated the necessary dataset to develop and test a plant specific probabilistic model of RNA polymerase II transcription start sites and promoters. Through application of this model, a look at individual plant protein coding gene promoters has shown unique structure and organization. Together, this work demonstrates the importance of integrated computational infrastructure and genomic domain knowledge.
Justin Schonfeld
Graduated: Spring 2006 with a PhD in BCB
Current Position: Postdoctoral Fellow, Cognitive Information Processing group, Computer Science and Engineering Department, University of Nevada, Reno , NV
Major and Co-Major Professors: Dr. Dan Ashlock and Dr. Dan Voytas
Dissertation Title: A modular data analysis pipeline for the discovery of novel RNA motifs
Abstract: This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shorted segments of RNA primary sequence called bricks. The bricks are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns. An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called nonlinear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produced 2-Dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences in which crossover points were localized.
Graduated: Summer 2003 with a M.S. in BCB
Current Position: Bioinformatics Scientist, NewLink Genetics, Ames , IA
Major and Co-Major Professors: Dr. Srinivas Aluru and Dr. Charles Link
Dissertation Title: Region-specific implication of 5'UTR motifs in translational control mechanisms
Abstract: This study uses a novel approach based on the RESCUE technique (Relative Enhancer and Silencer Classification by Unanimous Enrichment) (Fairbrother et al, 2002) to identify region-specific motifs in the 5'UTR. A highly selective screening procedure is described and implemented, which drastically reduces the false positive rate of identified motifs by the original technique. For increased accuracy, we present the results only for species that have well-curated mRNA data as maintained in the Refseq curated database. The results of these computations suggest that there are motifs in the 5'UTR that act in conjunction with the kozak consensus sequence in the process of translation initiation. Specifically, motifs have been identified in the inter-ATG regions of 5'UTRs with multiple uATGs (upstream ATGs) that may have an effect on translation initiation. Strong and weak kozak sequences have also been associated with mutually exclusive motif sets both upstream and downstream of the true start codon. Finally a number of motifs were identified as being preferentially present in the uORF (upstream Open Reading Frame) regions, which argues against the theory that uORF sequences are random. In general, uORF regions are also found to be strongly selective against motifs associated with strong kozak sequences.
In addition to the above-stated results which are applicable across species, motif overlap analysis (ex.motifs that are associated with both strong kozak sequences and the inter-ATG region upstream of the true start codon) also suggests some species-specific translational control mechanisms. The region-specific identification of motifs itself is probably indicative of higher-order secondary and tertiary structures and interactions. The experimental validation of these results could lead to the discovery of novel primary/secondary motifs and translational contro mechanisms encoded in the 5' untranslated regions of different species.
Graduated: Fall, 2007 with a PhD in BCB
Current Position: USDA, Bethesda, Maryland; Previous: Postdoctoral Fellow, Volker Brendel's Lab, Iowa State University
Major and Co-Major Professors: Dr. Volker Brendel and Dr. Jonathan Wendel
Dissertation Title: Computational annotation of eukaryotic gene structures: algorithms development and software systems
Abstract: An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.
The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution. Therefore, the continued development of methods not explicitly reliant on homology information--the so-called ab initio gene prediction methods--should help to more rapidly achieve a comprehensive understanding of gene content in model eukaryotes.
This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in ab initio gene prediction. The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular. Splice site prediction, coding fragment recognition, translation initiation site prediction and overall gene structure prediction will be discussed.
Graduated: Fall 2008 with an PhD in BCB
Current Position: Instructor at West Virginia College
Major and Co-Major Professors: Dr. Drena Dobbs and Dr. Vasant Honavar
Thesis Title: Computational analysis and prediction of protein-RNA interactions
Abstract: Protein-RNA interactions are essential for many important processes including all phases of protein production, regulation of gene expression, and replication and assembly of many viruses. This dissertation has two related goals: 1) predicting RNA-binding sites in proteins from protein sequence, structure, and conservation information, and 2) characterizing protein-RNA interactions.
We present several machine learning classifiers for predicting RNA-binding sites in proteins based on the protein sequence, protein structure, and conservation information. Our first classifier uses only amino acid sequence information as input and predicts RNA-binding sites with an AUC of 0.74. Using the neighboring amino acids in the protein structure improves prediction performance over using sequence alone. We show that using evolutionary information in the form of position specific scoring matrices provides a further significant improvement in predictions. Finally, we create an ensemble classifier that combines the predictions of the sequence, structure, and PSSM based classifiers and gives the best prediction performance, with an AUC of 0.81.
We construct the Protein-RNA Interaction Database, PRIDB, a comprehensive collection of all protein-RNA complexes in the PDB. PRIDB focuses on characterizing the molecular interaction at the protein-RNA interface in terms of van der Waals contacts, direct hydrogen bonds, and water-mediated hydrogen bonds. We perform an extensive analysis of the RNA-binding characteristics of a non-redundant dataset of 181 proteins to determine general characteristics of protein-RNA binding sites. We find that the overall interaction propensities for Watson-Crick paired nucleotides and non Watson-Crick paired nucleotides are very similar, with the propensities for amino acids binding to single stranded nucleotides being more divergent. We find that van der Waals contacts are more numerous than hydrogen bonds and amino acids interact with RNA through their side chain atoms more frequently than their main chain atoms. We also find that contacts to the RNA base are not as frequent as contacts to the RNA backbone.
Together, the prediction and characterization presented in this dissertation have increased our understanding of how proteins and RNA interact.
Graduated: Spring 2001 with an M.S. in Genetics
Current Position:
Major and Co-Major Professors: Dr. Susan Carpenter and Dr. Dan Ashlock
Thesis Title: Application of computational tools to analyze evolution of equine infectious anemia virus
Abstract: Evolution is the study of how variation alters the phenotype and population dynamics over time. Population genetics theories fit viral evolution well because of the properties of a viral population. Retroviruses are characterized by a high mutation and replication rate, which produces a heterogeneous mixture of viral variants commonly referred to as a quasispecies. Equine infectious anemia virus (EIAV) infection is a well-studied model for retrovirus variation and evolution (32, 33, 34). EIAV infection is characterized by a rapid, variable, dynamic disease course. Dynamic features of clinical disease as well as the ability of the horse to control the infection makes EIAV an excellent system to study evolution of viral quasispecies during progression of clinical disease. Here, we describe analyses of genetic data from longitudinal studies of genetic variation in a horse experimentally infected with equine infectious anemia virus. These studies include the genes encoding the regulatory protein Rev and the surface envelope glycoprotein, SU. Phylogenetic and cluster analyses suggested that the population of Rev variants was comprised of two distinct quasispecies that co-existed during infection, the populations shifted rapidly during febrile and afebrile periods with as little as 10 days between changes in population dominance of populations. In this study, we also examined evolution of EIAV envelope quasispecies in the chronic period evolve by random processes while quasispecies in the inapparent period evolve by a combination of Darwinian selection and random processes. These results propose that the envelope evolves by different processes during different stages of disease. Different evolutionary mechanisms during different stages of disease require unique approaches to anti-retroviral therapy during different stages of disease. Together, these results suggest there are unique host environments and viral population interactions during different stages of disease. Multiple quasispecies and varying processes of evolution during persistent retrovirus infection challenges the current thinking and has important biological implications for control of viral infections.
Graduated: Spring 2007 with a PhD in BCB and Co-Major in Mathematics
Current Position: Postdoctoral Fellow, The Jackson Laboratory, Bar Harbor, Maine
Major and Co-Major Professors: Dr. Zhijun Wu and Dr. Robert Jernigan
Dissertation Title: Boundary Value Approaches To Molecular Dynamics Simulation
Abstract: Conformational transitions of biomolecules like proteins play an important role in many cellular processes, most often in a positive way, but sometimes in a detrimental way, perhaps causing diseases. Knowledge about conformational transitions of proteins and other biomolecules has the potential to be important in many areas of biological research. Simulation is an important means of studying these transitions. When a molecule has more than one known stable conformation, one can consider study of conformational transitions by a boundary value approach to molecular dynamics simulation. Application of multiple-shooting methods – an iterative numerical method for solving boundary value problems for ordinary differential equations – is proposed to find Newtonian molecular dynamics trajectories for a system subject to an all-atom molecular mechanics force field. These trajectories correspond to conformational transitions of proteins. Important aspects of this work include assessment of potential biological significance and computational challenges. The many computational issues include feasibility of the approach for larger systems, convergence properties, global optimization algorithms, efficient methods for finding initial trajectories, choice of boundary conditions, methods for parameter reduction, algorithms for handling the initial-value sub-problems, for computing Jacobian matrices, and for solving resulting nonlinear systems of equations. Distance matrix interpolation methods, which are particularly useful for constructing approximate trajectories for application in situations where all-atom Newtonian trajectories are not feasible, have previously been described ([Kim2002]). We introduce different distance matrix interpolation approaches that hold some promise for useful application for the purpose of efficiently constructing initial trajectories as well as for possible progress in construction of approximate trajectories. The results from simulating conformational transitions of alanine dipeptide are presented.
REFERENCES
[Kim2002] Kim M, Jernigan R, Chirikjian G. Efficient generation of feasible pathways for protein conformational transitions. Biophysical Journal, 83: 1620 (2002).
[Elb1999] Elber R, Meller J, Olender R. Stochastic path approach to compute atomically detailed trajectories: application to the folding of C peptide. Journal of Physical Chemistry B. 103: 6, (1999). [Sch1997] Schlick T, Barth E, Mandziuk M. Biomolecular dynamics at long time steps: Bridging the timescale gap between simulation and experimentation. Annual Review of Biophysics and Biomolecular Structure. 26: 181 (1997).
[Ved2006a] Vedell P, Wu Z. Multiple Shooting Methods for Boundary Value Approaches to Biomolecular Dynamics Simulation. (submitted, 2006).
[Ved2006b] Vedell P, Wu Z. Shooting methods with inexact boundary conditions and parameter reduction for protein dynamics simulation (in preparation).
[Ved2006c] Vedell P, Jernigan R, Wu Z. Distance matrix interpolation methods for boundary value approaches to biomolecular dynamics simulation (in preparation).
Graduated: Spring, 2009 with a PhD in BCB
Current Position: Research Scientist, Pioneer Hi-Bred, Des Moines, IA
M.S. Major and Co-Major Professors: Dr. Gavin Naylor Dr. Vasant Honavar
Thesis Title: Spatial Clustering of Differences in Measured Homoplasy with Respect to Protein Structure
Abstract: The identification of residues that hold misleading phylogenetic signals and those that are of functional significance are intertwined. Advances in the one area can support the other mainly because misleading phylogenetic signals come from residues that are not evolving as a random process. This paper is a case study of the comparison of a widely accepted phylogenetic tree to trees that have been inferred from sequence data of five proteins. A new metric, RI Difference and based on Retention Index, is suggested measuring the relative support that individual sites provide for two trees. Through the identification of sites harboring disproportionally large misleading phylogenetic signal, we attempt to identify residues that are cooperating to define the function of the protein. This information is presented in the presence of the structure of the protein where clustering patterns (or lack of) are observed in the implicated residues. A new bioinformatic software tool, RI Compare, is presented implementing the metric and blending heterogeneous information from alignments, phylogenetic trees, and structure promoting this research. The results are offered followed by some speculation to what might be causing erroneous trees to be inferred. The relationship of the implicated residues to those of known importance is also discussed. While, regrettably, the results of this paper do not seem to suggest that the RI Difference measure is a general measure for the identification of functional important residues in all proteins, there is evidence to suggest it may be applicable to the large transmembrane class of proteins. Unfortunately, no experimental tests of the implicated residues have been performed at this time and judgment of the correctness of the results has been based solely on the proximity of the implicated residues to ligands, other chains, and residues of known importance. However, even if the RI Difference measure is identifying residues other than the functional significant ones, the fact that the cluster patterns are unlikely to occur at random is intriguing.
Graduated: Summer, 2001 with an MS in BCB
Current Position: Law Student, UC, Davis; received an MS at Stanford
Major and Co-Major Professors: Dr. Daniel Voytas and Dr. Volker Brendel
Title: Insertion site similarities in the Tc1/mariner element family
Abstract: In this study we report the first insertion site profile for the Tc1-like transposon Sleeping Beauty. We found that Sleeping Beauty prefers a consensus sequence of ATA TATAT, where the underlined TA represents the canonical target site. We also performed computational structural prediction analyses on Sleeping Beauty insertion sites and found that they were significantly different from random DNA. We then compared Sleeping Beauty’s insertion site structural profile with the insertion site profiles generated from three other studies of Tc1/mariner elements: Tc1 (genomic copies and experimentally induced insertions (van Luenen and Plasterk, 1994), Tc3 (van Luenen and Plasterk, 1994) and Himar1 (Lampe et al., 1998). Through this comparison we found that bendability, protein-induced deformability and A-philicity are the most significant for determining insertion site preference. We furher examined Tn5 and Ty1 insertion sites and found that Tn5 shares similarities with the Tc1/mariner elements whereas Ty1 insertions do not. Therefore, we predict that a large amount of the Tc1/mariner elements’, and possibly other DNA transposons’, target site specificity relies on unusual DNA structure in the area of insertion.
Graduated: Summer, 2006 with a PhD in BCB
Current Position: Staff, Roswell Park Cancer Institute, Buffalo, NY, Cancer Call Center at 1-877-ASK-RPCI (1-877-275-7724)
Major and Co-Major Professors: Dr. Xiaoqiu Huang and Dr. Xun Gu
Title: COMPUTATIONAL STUDIES OF ESTS: ASSEMBLY, SNP DETECTION, AND APPLICATIONS IN ALTERNATIVE SPLICING
Abstract: EST sequences are important in functional genomics studies. To better use available EST resources, clustering and assembling are crucial techniques. For EST sequences with deep coverage, no current assembly program can handle them well. We describe a deep assembly program named DA. The program keeps the number of differences in each contig alignment under control by making corrections to differences that are likely due to sequencing errors. Experimental results on the 115 clusters from the UniGene database show that DA can handle data sets of deep coverage efficiently. A comparison of the DA consensus sequences with the finished human and mouse genomes indicates that the consensus sequences are of acceptable quality. EST sequences can be used in SNP discovery. We describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs. EST sequences can also be used to study alternative splicing (AS), which is the most common post transcription event in metazoans. We first developed a pipeline to identify AS forms by comparing alignments between expressed sequences and genomic sequences. Then we studied the relationship between AS and gene duplication. We observed that duplicate genes have fewer AS forms than single-copy genes; we also found that the loss of alternative splicing in duplicate genes may occur shortly after the gene duplication. Further analysis of the alternative splicing distribution in human duplicate pairs showed the asymmetric evolution of alternative splicing after gene duplications. We also compared AS among six species. We found significant differences on both AS rates and splice forms per gene among the studied species by detailed and categorized studies. The difference in AS rate between rice and Arabidopsis is significant enough to lead to a difference in protein diversity between those two species.
References:
Jianmin Wang, Xiaoqiu Huang. A method for finding single-nucleotide polymorphisms with allele frequency in sequences of deep coverage. BMC Bioinformatics. 2005 6:220
Zhixi Su, Jianmin Wang (co-authors), Jun Yu, Xiaoqiu Huang, and Xun Gu. Evolution of alternative splicing after gene duplication. Genome Res. 2006
Graduated: Spring, 2002 with an M.S. in BCB
Current Position: Postdoctoral Research Associate, AstraZeneca Pharmaceutical, Wilmington, DE
Major and Co-Major Professors: Dr. Vasant Honavar and Dr. Drena Dobbs
Title: Protein Function Classification: A Data-Driven Approach
Abstract: Machine learning offers one of the most effective and practical approaches to data-driven knowledge acquisition. Decision tree learning algorithm represents one of the simplest and most commonly used machine-learning algorithms for data-driven induction of classifiers. My work describes an approach to data-driven discovery of sequence motif-based models in the form of decision trees for assigning protein sequences to functional families. Unlike approaches that try to classify protein sequences based on presence of a single motif, this method is able to capture regularities that can be described in terms of presence or absence of combinations of motifs. A training set of peptidase sequences with known functions is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families.
Graduated: Fall, 2003 with a PhD in Genetics and BCB
Current Position: Research Associate, Klemke Laboratory, Scripps Research Institute, La Jolla, CA, http://www.scripps.edu/imm/klemke/barry.htm
The role of the SDF-1/CXCR-4 receptor system in breast cancer metastasis.
In May 2005, received a three year Fellowship from Susan Komen Breast Cancer Foundation to continue his research in proteomics and cancer metastasis.
Major and Co-Major Professors: Dr. Parag Chitnis and Dr. Suresh Kothari
Title: Identification and functional analysis of thylakoid membrane proteome
Abstract: Membrane proteins play crucial roles in many metabolic pathways. Functions of most membrane proteins remain to be revealed because of their insolubility. New technological breakthroughs in proteomics together with more available genomic sequence information make it possible to study functions of membrane proteins on a genome-wide scale. We are trying to use methods in biochemistry, genetics, proteomics and bioinformatics to study the functions of the thylakoid proteome of Synechocystis sp. PCC6803. The thylakoid membrane proteins were separated into peripheral and integral fractions and resolved into 2-D gels with different pH range. The protein spots in the 2-D gels were subjected to peptide mass fingerprinting analysis, and totally 390 out of 558 analyzed spots were identified as protein products of 128 individual genes, of which 38 gene encode hypothetical proteins with unknown function. To study the function of the hypothetical proteins, we knocked out the DNA sequence of the corresponding ORF, and 10 knockout mutants were obtained. The growth analysis for the mutant cells revealed that only one mutant (H1) which has a deletion in the ORF slr0110, showed conditional growth phenotype. Detailed analysis indicated that the H1 mutant is sensitive to both glucose and light, which is caused by the over-reduction of the PQ pool in the thylakoid membrane. The ID and the structural and functional information of the identified proteins as well as the 2-D reference maps were included in a web-based relational database for thylakoid membrane proteins. The database was constructed with MySQL, and the application programs were developed with SQL, PERL, JAVASCRIPT and HTML. Users can search the information of identified proteins and compare their own identified proteins with the identified proteins in the database. A manager interface is also provided for the routine maintenance of the database.
Graduated: Summer, 2001 with a PhD in BCB
Current Position: Assistant Professor, Bioinformatics and Computational Biology, Department of Biology, University of Texas, San Antonio, TX; (210) 458-6492; http://www.bio.utsa.edu/faculty/wang.html; Research in my laboratory focuses on the comparative genomics, molecular evolution, and population genetics of gene families.
Major and Co-Major Professors: Dr. Zhijun Wu and Dr. Dan Ashlock
Title: Functional divergence and age distribution of vertebrate gene families
Abstract: Biology is undergoing a revolution based on the accelerating determination of DNA sequences, including the complete genomes of a growing number of organisms (Adams et al 2000; International Human Genomes Sequencing Consortium, 2001; Venter et al., 2001). During this post-genomic era, functional genomics seek to devise and apply technologies that take advantage of the flooding sequence information to analyze and predict in vivo functions of proteins (Doolittle 1996; McKusick 1997; Durbin et al. 1998). One of the missions of protein genomics is to make direct predictions on function(s) prior to biological experimentation.
The objective of this study is to develop and apply statistical methods to predict functional content from primary sequence and to explore the pattern of gene family evolution. This dissertation is composed of a general introduction, four chapters, each of which is in the journal of manuscript format, and a general conclusions section. The four chapters that detail the core of the research work are outlined below.
Chapter 1 introduces a new statistical model for testing functional divergence and predicting critical residues (Gu 1999) by a case study in caspase gene family. By taking advantage of substantial experimental data of caspases, the functional/structural basis of our predictions are extensively studied. The objective of this study is to show the potential of combining new methodology with classical phylogenetic approach in functional genomics.
Chapter 2 extends the study to a comprehensive survey in functional divergence among a large number of gene families by using Gu (1999) method (PHYBA, phylogeny-based-analysis). The technical issues, biological implications and potential applications are detailed addressed in this chapter.
Chapter 3 investigates the evolutionary patterns of 49 gene families that are generated in the early stage of vertebrates. The times of gene duplications are estimated to test the hypothesis of two-rounds (2R) of genome duplication. Complicated evolutionary patterns (2R/3R) are surveyed.
Chapter 4 examines the impacts of gene duplications on the functional divergence in vertebrate gene families. Two patterns of functional divergence after gene duplication(s) are illustrated.
Graduated: Fall, 2007 with a PhD in BCB
Current Position: Postdoctoral Research Associate, D. Neil Hayes Laboratory, Lineberger Comphrehensive Cancer Center, The University of North Carolina at Chapel Hil; Chapel Hill, North Carolina
Major and Co-Major Professors: Dr. Volker Brendel and Dr. Thomas Peterson
Title: Genesis of gene structures and computational analysis of U12-type introns
Abstract: Major Professor: Volker Brendel Co-Major Professor: Thomas Peterson Tuesday, November 20, 11:00 AM 1102 Molecular Biology Building Completely sequenced genomes provide a wealth of information that has allowed the exploration of large scale biological questions and continues to provide a critical resource for the advancement of biological research. Previously, the number of completely sequenced genomes was small and was generally limited to the model organisms. Currently, the number of genomes completely or partially sequenced is rapidly increasing, with 338 different eukaryotic genomes available as of October 2007. With a genome sequence in hand, the typical first step is gene structure annotation, or identifying the location and structural features of genes within the genome sequence, after which functional descriptions of the genes, relationships to homologous genes can be made, and other higher level research questions can be investigated. Annotation, then, imparts the biology of the organism onto the genome sequence. The goal of this thesis is to provide useful computational tools for gene annotation in emerging and mature genomes, and to analyze a particularly difficult-to-annotate gene feature. The process of gene structure annotation requires a genomic sequence of sufficient size so that it can contain a full gene, which in eukaryotes can be thousands of nucleotides. The popular method of whole genome shotgun sequencing to furnish genomic sequences, produces small sequence fragments of hundreds of nucleotides, which are eventually assembled into chromosome sequences, and can take several years from start to finish. In the interim, these small sequence fragments are deposited into repositories for historical reference and dissemination purposes, but since they are too small to contain a gene, these fragments are not particularly useful for gene structure annotation purposes. I have developed a web-based tool, Tracembler, which facilitates dynamic gene annotation of these fragments through on the fly sequence similarity searching and assembly. Hence, Tracembler allows biologists and interested scientists to immediately create gene annotations upon the latest sequences from emerging genomes without having to wait for the completion of the genome sequencing project. On the other end of the genome maturity spectrum, accurate gene structure annotation, which includes the biologically-correct specification of exons, introns, untranslated regions, protein coding regions, and alternatively spliced variants of a gene, remains a challenge for completely sequenced genomes. Pure computational approaches are excellent for providing an approximate initial summary of an organism’s gene space, but they are not completely accurate or comprehensive. Manual annotation by a human curator, who inspects and reviews the available evidence to make decisions in constructing a gene structure annotation, is considered the highest quality method. Hindrances to manual annotation are that it is time consuming, has restricted participation, and is not easy to conduct. Removing these limitations of manual annotation, I have developed the yrGATE (“your Gene structure Annotation Tool for Eukaryotes”) software, which enables individuals to create gene structure annotations using high quality evidence through an easy-to-use dynamic web browser interface and submit their annotations to a community database. A particular category of often mis-annotated genes is those containing U12-type introns. U12-type introns are a class of introns that have highly conserved sequence features, have a specific spliceosome that processes their removal from pre-mRNA transcripts, and comprise less than 1% of the introns in any studied eukaryotic organism. One reason for their mis-annotation is most gene prediction programs are not designed to specifically recognize them, which is likely caused by U12-type introns’ unique sequence features and rare occurrence. Apart from their mis-annotation, U12-type introns are intriguing due to their unique proposed evolutionary history and due to their maintenance in organisms at very low frequencies in a seemingly functional redundancy with the major splicing system. In order to further the understanding of this intriguing gene feature, a large-scale annotation and computational investigation of U12-type introns in the context of their host genes and evolution was completed, which yielded several new discoveries.
Graduated: Summer, 2006 with a PhD Comajor in BCB and Math
Current Position: Assistant Professor, Department of Mathematics, Western Kentucky University, Bowling Green, KY
Major and Co-Major Professors: Dr. Zhijun Wu and Dr. Robert Jernigan
Title: Distance-based Protein Structure Modeling
Abstract: Protein structure modeling could be studied based on the knowledge of interactions or distances between pairs of atoms, which is so-called distance-based protein structure modeling and such field includes problems of structure determination and refinement as well as analysis of protein dynamics. The distances for certain pairs of atoms in a protein can often be obtained based on our knowledge on various types of bond-lengths and bond-angles or from physical experiments such as nuclear magnetic resonance (NMR). The coordinates of the atoms and hence the protein structure can then be determined by using the known distances. However, it requires the solution of a mathematical problem called the distance geometry problem, which is proved to be computationally intractable in general. On the other hand, due to insufficient distance data such as nuclear overhauser effect (NOE) data in NMR, the protein structures determined by conventional techniques usually are not as accurate as desired. Therefore, the uses of such protein structures in important applications including homology modeling and rational drug design have been severely limited. In this work, we have developed several efficient algorithms including theories for the solution of the distance geometry problem using a geometric build-up algorithm. We also introduced a knowledge-based method for protein structure refinement, in which we constructed a dedicated structural database for protein inter-atomic distance distributions and derived so-called mean force potentials to refine NMR-determined protein structures. We have participated in CASPR competition regarding comparative models and reported some substantial improvement using mean force potentials. In the last, an efficient and simple method called Local-DME calculations has been developed to study protein dynamics of NMR ensembles specifically.
References:
Wu, D., and Wu, Z. An Updated Geometric Build-Up Algorithm for Solving the Molecular Distance Geometry Problem with Sparse Distance Data. Journal of Global Optimization, 2006 (accepted).
Wu, D., Cui, F., Jernigan, R., and Wu, Z., PIDD: Database for Protein Inter-atomic Distance Distributions, submitted to NAR, 2006.
Wu, D., Jernigan, R., and Wu, Z., Refinement of NMR-Determined Protein Structures with Database Derived Mean Force Potentials, to be submitted, 2006.
Graduated: Fall, 2004 with a PhD in BCB
Current Position: Postdoctoral Research Associate, Virtual Reality Application Center, with Dr. Zhijun Wu, Iowa State University, Ames, IA
Major and Co-Major Professors: Dr. Xun Gu and Dr. Zhijun Wu
Title: Comparative genomics: Multiple genome rearrangement and efficient algorithm development
Abstract: Multiple genome rearrangement by signed reversal is discussed: For a collection of genomes represented by signed permutations, reconstruct their evolutionary history by using signed reversals, i.e., find a tree where the given genomes are assigned to leaf nodes and ancestral genomes (i.e. signed permutations) are hypothesized at internal nodes such that the total reversal distance summed over all edges of the tree is minimized. It is equivalent to finding an optimal Steiner tree that connects the given genomes by signed reversal paths. The key for the problem is to reconstruct all optimal ancestral genomes or Steiner nodes.
The probelm is NP-hard and can only be solved by efficient approximation algorithms. Various algorithms and programs have been designed to solve the problem, such as BPAnalysis, GRAPPA, grid search algorithm, MGR greedy split algorithm (chapter 1). However, they may have expensive computational costs or low inference accuracy. In this thesis, several new algorithms are developed, including nearest path search algorithm (chapter 2), neighbor-perturbing algorithm (chapter 3), branch and bound algorithm (chapter 3), perturbing-improving algorithm (Chapter 4), and partitioning algorithm (Chapter 5). With theoretical proofs, computer simulations, and biological applications, these algorithms are shown to be 2-approximation algorithms and more efficient than the existing algorithms.
Graduated: Summer 2008 with an M.S. in BCB
Current Position: Postdoctoral Fellow, Biochemistry Department, St. Jude Hospital, Memphis, TN
Major and Co-Major Professors: Dr. Parag Chitnis and Dr. Suresh Kothari
Title: Is there a code for transcription factor-DNA recognition?
Abstract: The whole genome sequences from a wide variety of species including 599 viruses, and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant are available. 2000-3000 transcription factors out of approximate 30,000-40,000 genes in human genome, which play the central role in controlling cell development, cell growth and differentiation. Abnormal activity of transcriptional factors often leads to diseases. E lucidating the transcriptional regulatory network will be the next challenge of the post-genomic era. Gene regulation initiates from the selective binding of transcription factors to a particular DNA site out of a vast number of potential sites in the genome. It is unclear how transcription factors could specifically recognize the correct sites out of hundreds or thousands potential sites in the genome. We investigated the DNA recognition sites functionally mapped by biochemical and biophysically approaches and also transcription factor-DNA complexes solved by X-ray or NMR from Protein Data Bank. The purpose of this study is to find whether there is a simple code for transcription factor-DNA recognition. Our analyses show that (i) the length for DNA recognition sequences is typically from 4-10 bases; (ii) there is no GC or AT preference for our studied sequences; (iii) positively charged amino acids-Arg and Lys are found to be the majority of contacts with base and phosphate; (iv) some favored interaction pairs, Arg-G, Lys-G and Glu-C, are observed from our studies. However, no simple code for transcription factor-DNA recognition is obtained from our study. A relational database for storing and retrieving collected data is generated as an example to demonstrate the importance of database in computational biology.
Graduated: Summer, 2008 with a PhD in BCB
Current Position: Postdoctoral Associate, Dr. Jack Dekkers, Department of Animal Science, Iowa State University
Major and Co-Major Professors: Dr. Robert Jernigan and Dr. Zhijun Wu
Title: Analysis on protein structures using statistical and computational methods
Abstract: The orientation of side chains relative to the radial vector from the center of the protein to an amino acid is studied. We find that the average angles for different residue types are highly correlated with their hydrophobicities, and the average side chain orientations in different parts of structures exhibit characteristically different features. The application of our findings on side chain orientation to protein tertiary structure prediction has also been considered. Several statistical machine learning methods are used to check the predictability of side chain orientation.
One method to validate the computed motions generated from the elastic network models (ENM) is to compare them with the principal components (PCs) of multiple structures. The multiple structures of the same protein are superimposed first, and the correspondence between the experimental conformational changes represented by PCs and the normal modes from ENM are calculated. Here we use two superposition methods (least-squares fitting and maximum likelihood), and find that the extent of the correspondence between two conformational spaces depends on the superposition method.
The effects on motion of removing some protein subunits of partial 30S ribosome structures are studied. Our results show that some larger changes from removing single protein subunit can be restored by the removal of another subunit, which indicates their interdependencies. We further find that the subunits showing some interdependencies have strong positive motion correlation and interact together, which are consistent with the previous computational studies and experimental results from other people.
Graduated: Summer, 2008 with a PhD in BCB
Current Position:
Major and Co-Major Professors: Dr. Robert Jernigan and Dr. Zhijun Wu
Title: Understanding protein motions by computational modeling and statistical approaches
Abstract: Because of its appealing simplicity, the elastic network model (ENM) has been widely accepted and applied to study many molecular motion problems, such as the molecular mechanisms of chaperonin GroEL-GroES function, allosteric changes in hemoglobin, ribosome motions, motor-protein motions, and conformational changes in general. In this dissertation, the ENM is employed to study various protein dynamics problems, and its validity is also examined by comparing with experimental data.
First, we apply principal component analysis (PCA) to identify the essential protein motions from multiple structures (X-ray, NMR and MD) of the HIV-1 protease. We find significant similarities between the first few of these key motions and the first few low-frequency normal modes from the ENM, suggesting that the ENM provides a coarse-grained and structurally-based explanation for the experimentally observed conformational changes.
Second, we extend these approaches from a single protein (HIV-1 protease) to thousands of proteins whose multiple NMR structures are available. We also find close correspondence between the experimentally observed dynamics and the ENM predicted ones, indicating the validity of using the ENM to computationally predict protein dynamics.
Third, we develop a regression model for the isotropic B-factor predictions by combining the protein rigid body motions with the ENM. The new model shows significant improvements in B-factor predictions. Fourth, we further examine the validity of using the ENM to study protein motions. We use the anisotropic form of ENM to predict the anisotropic temperature factors of proteins. It presents a timely and important evaluation of the model, shows the extent of its accuracy in reproducing experimental anisotropic temperature factors, and suggests ways to improve the model.
Finally, we apply the ENM to study a dataset of 170 protein pairs having "open" and "closed" structures, and try to address how well a conformational change can be predicted by the ENM and how to improve the model. The results indicate that the applicability of ENM for explaining conformational changes is not limited by either the size of the studied protein or even the scale of the conformational change. Instead, it depends strongly on how collective the transition is.
Graduated: Fall, 2005
Current Position: Assistant Professor, Computer Science Department, Utah State University, Logan, UT
Major and Co-Major Professors: Dr. Vasant Honavar and Dr. Drena Dobbs
Title: Identification of interface residues involved in protein-protein and protein-DNA interactions from sequence using machine learning approaches
Abstract: In this study, we develop machine-learning methods to identify amino acid residues involved in protein-protein interactions and protein-DNA interactions. We focus on the methods using sequence information alone and build classifiers that can classify residues into interface and non-interface residues based on local sequence information. To facilitate the study of developing machine-learning algorithms to identify interface residues and the study of searching for characteristics that can distinguish the interfaces from the rest of the proteins, we also develop a database of protein-protein interfaces and systematically analyze the characteristics of the interfaces.
Graduated: Summer, 2006
Current Position: Senior Scientist, Genome Sequencing Center, School of Medicine, Washington University, St. Louis, MO
Major and Co-Major Professors: Dr. Xiaoqiu Huang and Dr. Dan Voytas
Title: Sequence comparison methods, statistics, and applications
Abstract: With more genomes being sequenced, understanding biological signals encoded in a genome has become a key challenge in modern biology. Cross-species comparison is a powerful approach in revealing those functional elements. In this thesis we first address some basic issues in sequence comparison, including optimization of sequence alignment parameters and statistical significance assessment of similarity scores. We present a method for assessing the effects of parameters on the sensitivity and specificity of an alignment algorithm on real coding DNA sequences. We then describe a computational and statistical method for assessing the statistical significance of the best alignment between two protein sequences. Multiple alignment of genomic sequences is a powerful approach for genome data analysis and annotation. We develop a sensitive multiple alignment program named MAP2 based on the generalized pairwise global alignment algorithm evaluated and tested above for handling long, different intergenic and intragenic regions in genomic sequences. We propose two similarity measures for evaluation of the performance of MAP2 and existing multiple alignment programs. We also present experimental results by MAP2 on six simulated data sets to show its strength in detecting the boundaries between similar and different regions. Finally, we apply different alignment algorithms to various sequence data, including genomic sequences, EST sequences, and cDNA sequences in the grass family, to explore gene conservation among the grass family and examine the usage of the rice genome as a reference to study other grass genomes.
References:
Ye L., Wang J., Huang X. Selection of effective parameter values for alignment of DNA sequences. (Submitted to BMC Bioinformatics)
Ye L., Huang X. (2005) MAP2: multiple alignment of syntenic genomic sequences. Nucleic Acids Research, 33(1):162-170
Graduated: Summer 2002 with an M.S. in BCB
Current Position: Bioinformatics Research Scientist/PhD Student, Chemistry Department, University of New Hampshire, Durham, NH
Major and Co-Major Professors: Dr. Eve Wurtele and Dr. Julie Dickerson
Title: MetNet DB: A Comprehensive Metabolic and Regulatory Network Database
Abstract: One of the major challenges in the post-genome era is to determine the cellular functions of genes and their products, to understand how the interactions among the entities in cellular contexts could yield a living cell. To attack this problem, Gene Expression Tool kit (GET) project was launched in Iowa State University. This thesis will describe a general data model for representing metabolic and regulatory biological networks. The model is implemented in a relational database: MetNet DB. MetNet DB serves as an information hub in GET software package. The thesis will also present one of MetNet DB practical applications: Probe database, which is based on the information derived from MetNet DB. Probe database provides the integrated functional annotations for Arabidopsis microarray probes. Currently Affymetrix Arabidopsis GeneChip and AFGC EST microarray, two large datasets are supported. Probe database could be seamlessly integrated to other microarray data analysis tools, such as GeneSpring. This provides an efficient annotation for mining Arabidopsis RNA profiling data.
Graduated: Spring, 2007 with a PhD co-major in Stat and BCB
Current Position: Research Statistician, Abbott Laboratory, Chicago, IL
Major and Co-Major Professors: Dr. Alicia Carriquiry and Dr. Jack Dekkers
Title: The design and analysis of microarray experiments using pooled samples for the study of quantitative traits
Abstract: Microarrays can simultaneously measure the mRNA expression levels of thousands of genes. In such experiments, mRNA samples are sometimes pooled across individuals to reduce cost or to increase mRNA volume in the sample. Our main objective is to investigate the effect of pooling mRNA on different types of inferences drawn from three important types of genomic experimentation. First, we investigate the effect of pooling mRNA on the power with which we can identify differentially expressed genes. We propose a statistical model for gene expression in a pool that mimics the process of mRNA pooling and develop the appropriate F statistics to test for differentially expressed genes. We show our power estimation is more conservative and less biased. Second, we investigate the effect of two different mRNA pooling strategies on the estimate of the correlation between phenotype and gene expression. We propose a maximum likelihood method to estimate the correlation between phenotype and expression. The MLE outperforms the standard Pearson correlation estimate in terms of bias and precision when individuals are stratified by phenotype prior to pooling. Finally, we evaluate the efficiency of a recently proposed QTL mapping approach which combines the idea of mRNA pooling with expression QTL transcriptome mapping. We argue that by pooling mRNA we can reduce the number of microarrays required by 2-fold or more and directly target the generation of expression data that is relevant to the phenotypic traits of interest. The reduction in cost can be achieved with negligible loss in power when QTL mapping is done via the standard regression approach. However, when mapping is carried out via composite interval mapping which takes into account linkage disequilibrium effects, the loss in power can be significant.
Graduated: Fall, 2002 with an M.S. in BCB
Current Position: System Engineer, Meredith Corporation, Des Moines, IA
Major and Co-Major Professors: Dr. Vasant Honavar and Dr. Xun Gu
Title: Gene Expression Analysis
Abstract: Microarray technology provides an approach to measure the expression levels of a large number of genes simultaneously and an insight into the transcriptional state of the cell. It can be used for searching for co-expressed genes under certain conditions. As such, it has become a powerful tool in genetic network research and functional genomics. Meanwhile, the technology produces large amounts of data and the data interpretation becomes a major bottleneck.
In this study, public yeast gene expression data is analyzed by Principal Components Analysis (PCA), Hierarchical Clustering, Self Organizing Mapping (SOM) and Adaptive Resonance Theory 2 (ART-2). The four statistical methods are also applied to maize chloroplast protein expression data in greening process. PCA can reduce the dimensionality of the data set. The first few components contain the most variance in the data and represent meaningful expression patterns. ART-2 is a neural network method, which is applied to gene expression analysis for the first time in our study. It provides very good clustering quality. Compared with Hierarchical Clustering and SOM, ART-2 is not limited by the rigid structure of Hierarchical Clustering and is not required to determine the clustering number in the beginning such as SOM. ART-2 has the ability to deal with noise in the data and is easy to implement and interpret the result. The algorithm is also fast and scalable.
Graduated: Summer, 2004 with a PhD Co-major in Statistics and BCB
Current Position: Assistant Professor, Tsinghua University, Tsinghua, PR China
Major and Co-Major Professors: Dr. Kenneth Koehler and Dr. Xun Gu
Title: Application of computational tools to analyze evolution of equine infectious anemia virus
Abstract: My Ph.D. is mainly about applying statistical methods to the analyses of gene expression data, i.e. microarray data, putting the gene expression process into an evolution framework, and characterizing the expression evolution procedure. Such expression divergence analysis can deepen our understanding of the phenotypic evolution at the transcriptional level.
Molecular phylogeny currently plays a major role in analyzing genomic data, trying to understand the relationship between genes, chromosomes and species. However, for another major source of genomic information, large-scale gene expression analysis, little research has been done from an evolutionary point of view. In chapter 1, we reviewed a preliminary phylogenetic expression analysis developed by Gu (2000) that used a Brownian motion process to represent expression variation among duplicate genes in a gene family. The general Brownian-based model can be transformed and restricted to obtain several derived models or sub-models. Each sub-model can be applied to deal with specific biological questions, depending on the imposed restrictions.
Basked on the E 0 model described by Gu (2004), we develop a fast algorithm to predict expression profiles at the ancestral nodes (genes). By comparing ancestral expression profiles with progeny expression profiles, so called expression divergence, expression profile changes along the duplication lineage, can be revealed and quantified. Such expression divergence can be used as an indicator of function divergence, showing if the gene activity is under selection pressure along that specific lineage and inferring the potential function difference between progeny genes. Details about ancestral expression inference can be found in chapter II.
The phylogenetic expression analysis proposed by (Gu 2004) is rather complicated, especially as it requires the use of the maximum likelihood estimation which is sensitive to model assumptions. In chapter III, we transformed the key idea of Brownian-based E 0 model into the form of an expression distance structure, and used the modified molecular phylogenetic approach to reconstruct an expression tree. Such expression phylogeny has the same convenience and flexibility as the molecular phylogeny in molecular evolutionary study. However, we did notice that there are some differences between expression phylogeny and molecular phylogeny, and such differences reveal the decoupling between expression profile evolution and sequence evolution.
In chapter IV, we used yeast expression data and motif data to study the relationship between expression divergence and motif divergence. Although it has long been believed that motif structure is the key factor in shaping the expression profiles, our analysis only reveals a weak coupling relationship between the two profiles. Many studies have shown that transcription regulation is a very complicated involving dynamic process. It involves interactions between DNA and proteins, such as motifs and transcription factors; the interaction between proteins, such as the transcription factors and their cofactors; the modification and degradation of the proteins; the structure of the DNA sequence, such the condensation of local chromatin; etc. Our results simply indicate that motif structure is only part of the story and people should be very cautious when making assumptions about the relationship between motif structures and expression profiles.
The research summarized in this dissertation is still in its theoretical stage. My next and immediate task is to apply those ideas to the analysis of real data. In order to accomplish this goal, statistical modeling and analysis in chapter II and III will first be incorporated into a program package which allows convenient and fast analysis, especially when dealing with massive datasets at the genomic level. In this research report, we showed some examples. In the future, we will extend these analyses to the whole genome of some organism, in particular, all the yeast gene families, and conduct the expression divergence analysis at the genome level.
During the study of motif and expression relationships, we realized the importance of gene networks in all aspects of the organism activities. In my next research project, I plan to combine the information of al the available components of gene networks, such as gene expression, gene duplication, metabolic pathway, motif structure, null mutation mutants, etc., and see if I can identify some relationships among those components.
My long-term research goal is to integrate genomic data resources with evolutionary concepts and further investigate relationships among sequence divergence, expression divergence and function divergence. A key interest is to understand how those divergence processes can be related to or shaped by the structure and development of the gene networks.
Graduated: Fall 2003 with an M.S. in BCB
Current Position: PhD Student in Statistics, Stanford University, CA
Major and Co-Major Professors: Dr. Karin Dorman and Dr. Susan Carpenter
Title: Branching process models for HIV-1 drug resistant mutants
Abstract: HIV drug therapy often fails because of the appearance of resistant viral mutants. Thus knowledge in the abundance of resistant mutants prior to treatment is essential for optimizing drug therapy to avoid resurgence of resistant mutants. A simple multitype continuous-time branching process model is developed and investigated for the generation of resistant viral mutants during HIV-1 infection. The growth of mutant populations are characterized by their means, variances and distributions from start of acute infection to the equilibrium state in chronic stage. The expressions for the equilibrium frequencies of mutants are derived and their dependence on mutation rates and mutant fitness explored. The model suggests that mutants with three or more point mutations are unlikely to occur prior to treatment. A similar branching process model is also used to compute the number of resistant mutants that are generated {\it de novo} during treatment. Then the two possible causes of resistance-related treatment failure are discriminated by characterizing the ratio of the amount of resistant mutants produced {\it de novo} to the number of preexisting resistant mutants.
Graduated: Fall, 2003 with an M.S. in BCB
Current Position: Assistant Professor, Department of Poultry Science, Texas A&M University, College Station, TX
Major and Co-Major Professors: Dr. Xun Gu and Dr. Susan Lamont
Title: Statistical analysis of functional divergence in gene families
Abstract: Duplication of genes has been thought as a major resource for the function innovation in a large number of gene families. The prediction of critical residues for functional divergence between homologous genes is important for functional genomics. The Toll-like receptor (TLR) gene family plays an important role in innate immunity and adaptive immune response. All TLR protein sequences from vertebrate animals were collected to investigate functional divergence and evolutionary pattern between TLR gene clusters. Four independent clusters were identified. Functional divergence of the domains in TLR family was characterized by a site-specific posterior profile analysis, and critical residues for altered selective constraints of amino acid sites after gene duplication were predicted. The extracellular domain of TLR gene showed higher functional divergence than the cytoplasmic domain. Further analysis indicated that the region between Leucine-rich repeats (LRR) 10 and 14 of TLR4 was a potential target for functional genomics study in the future. For a large set of gene families study, we collected the cDNA sequences of all orthologous genes of human, mouse or rat of two-gene cluster, three-gene cluster, and four-gene cluster from all available gene families in the database. The nonsynonymous and synonymous substitutions rate for all orthologs between human and mouse or rat were estimated. The ratio of nonsynonymous to synonymous substitutions rate were calculated. The nonsynonymous substitutions rate was positively correlated with synonymous substitutions rate, and the ratio of nonsynonymous to synonymous substitutions rate, which suggested that nonsynonymous substitutions rate is a major resource of the ratio of nonsynonymous to synonymous substitutions rate. The significant differences of nonsynonymous substitutions rate in most of paralogous genes suggested that nonsynonymous substitution plays an important role in creating novel function following gene duplication.
Graduated: Spring 2003 with a PhD in BCB
Current Position: MedImmune, Gaithersburg Headquarters, One MedImmune Way, Gaithersburg, MD 20878; (301) 398-0000
Major and Co-Major Professors: Dr. Volker Brendel and Dr. Srinivas Aluru
Title: Spliced alignment and its application in Arabidopsis thaliana
Abstract: The goal of my project has been to develop and apply methods for gene identification in genome sequences according to expressed sequence tags (ESTs) or homologous protein sequences evidence. For this purpose, we developed an efficient spliced alignment program, GeneSeqer (available at http://bioinformatics.iastate.edu/cgi-bin/gs.cgi), which is capable of aligning ESTs with a large genomic sequence. Another program MyGV (available at http://bioinformatics.iastate.edu/bioinformatics2go/MyGV/) written in JAVA as a browser to visualize the output of GeneSeqer had also been distributed recently. As a practical test and demonstration, GeneSeqer was applied to map 174,628 Arabidopsis EST sequences on the whole genome of Arabidopsis thaliana (5 chromosomes, about 117M bp in total), and all results were parsed and imported into a MySQL database. Much useful information was inferred from the Arabidopsis spliced alignments results, that could serve as valuable resource for a number of projects of special scientific interest, such as alternative splicing, non-canonical splice sites, mini-exons, etc. We developed an elaborate web interface to allow visually and interactively querying and browsing EST spliced alignments and GenBank annotation, accessible at http://zmdb.iastate.edu/PlantGDB/AtGDB.html.
|