![]() |
![]() |
Prasith BaccamHome Departments: Math and ImmunobiologyMajor Professor: Dr. Cornette Title: Genetic Variation and evolution of equine infectious anemia virus rev quasispecis during long term persistent infection Abstract: Genetic variation has been observed in many viruses. Viruses that carry their genetic information in the form of RNA exhibit high mutation rates because the viral polymerase lacks proof-reading mechanisms commonly found in DNA polymerase complexes. The combination of high mutation rates, small genome size, and high replication rates results in a population of closely related viral genotypes, which are commonly referred to as a quasispecies. A consequence of the genetic variation in viruses is possible variation in viral phenotype of the quasispecies population. Furthermore, changes in viral phenotype may be a biologically important factor in progression of disease. Here, we undertook a longitudinal study to describe the quasispecies nature and genetic variation in a lentivirus regulatory protein, Rev, during the course of disease in a pony experimentally infected with equine infections anemia virus (EIAV). This study examined rev variants that comprised the quasispecies population in sequential sera samples. Over the course of disease, there was continual appearance of novel rev variants, with some variants growing in frequency to predominate certain time points. Phylogenetic and cluster analyses suggested that the Rev quasispecies was comprised of two distinct populations that co-existed during infection. These two quasispecies populations differed in their pattern of evolution, with one population accumulating changes in a linear, time-dependent manner, while the other population evolved radially from a common variant. Changes in the population size of the two Rev quasispecies coincided with changes in the clinical stages of disease. Rev variants from each population were biologically tested, and significant differences in Rev activity were detected between the two populations. Together, these results suggested that the distinct Rev populations differed in selective advantage. A statistical correlation was found between Rev quasispecies activity differed significantly between different stages of clinical disease. This study suggests that distinct quasispecies populations, which differed in patter of evolution and niche advantage, co-existed during long term persistent infection by EIAV. A multi-population quasispecies model challenges our current thinking of viral populations and may have significant biological implications. Kara ButterworthHome Department: BotanyMajor Professor: Dr. Jonathan WendelCo-Major Professor: Dr. Dean Adams Title: Initiation and early development of fibers in wild and cultivated cotton Abstract: Gossypium (Malvaceae) is a diverse genus best known for cultivated cotton. It includes about 50 species, 45 diploid and 5 allopolyploid, which occur in arid and semi-arid regions throughout the world (Vollesen, 1987; Fryxell, 1992). The diploids are divided into eight genome groups based on chromosome pairing and size, and fertility between species (Endrizzi, Turcotte, and Kohel, 1985). These groups comprise natural lineages within the genus and correspond to geographic locations: A, B, E, F- Africa and Arabia; C, G, K- Australia; and D- New World. Allopolyploid members are founds in the New World and contain the A and D genomes (Wendel, 1995; Wendel et al., 1998; Brubaker, Bourland, and Wendel, 1999; Percival, Wendel, and Stewart, 1999; Cronn et al., 2002). This understanding of the evolutionary history of the genus allows many aspects of evolutionary differences in development and morphology to be studied in a phylogenetic context. Feng CuiHome Department: Mathematics
Major Professor: Dr. Zhijun Wu Title: Distance-based NMR Structure determination and refinement Abstract: X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy are two widely used experimental techniques for protein structure determination. In protein Data Bank (PDB), about 85% of deposited protein structures are determined by X-ray crystallography. The rest of the structures are determined by NMR spectroscopy. The main difference between these two approaches lies in the state of protein samples to which they are applied: for X-ray crystallography, a protein has to be in the crystalline state while in NMR, it may be in the solution state. Both approaches have their own pros and cons. For example, X-ray crystallography is a mature technique capable of providing more objective interpretation of data. This approach has various quality indicators such as resolution and R-factor to assess the structures. It can be applied to large molecules, e.g., virus particles, and produce a single model that is easy to visualize and interpret. Raw data processing is highly automatic. In contrast, NMR is a relatively new technique and provides more subjective interpretation of the data. It lacks established quality indicators of data and models. In addition, it is limited to determination of relatively small proteins (<20kDa) and produces an ensemble of possible structures rather than one model. Data sometimes have to be manually processed. On the other hand, a protein has to form stable crystals for X-ray analysis, which could be time-consuming and often impossible. The crystalline state is not a natural and physiological environment for the protein either. In addition, X-ray crystallography is less useful for large flexible modular proteins. In contrast, the solution state of a protein is closer to biological conditions and relatively easy to prepare. NMR can provide information on dynamics and identify individual side-chain motion, often used to monitor conformational change on ligand binding. With the pros and cons, both approaches have undergone dramatic development during the past five years, especially for NMR. Advances in data collection, spectra assignment and analysis, structure calculation and computer graphics bring no barrier among NMR spectra assignment process, NMR structure assessment and visualization. Many quality indicators such as bond length, angle and NOE violations (inter-atomic distances that lie outside of NOE ranges) have been developed and used for quality assessment of NMR structures. Novel refinement schemes aimed at increasing the accuracy of the resulting structures have been proposed and tested. As a result, nowadays, proteins in size up to 30 kDa (about 260 residues) are routinely accessible by NMR spectroscopy with increased resolution, equivalent to approximately 2.5-A resolution crystal structures.
Lixia DiaoHome Department: Computer Science Major Professor: Dr. David Fernandez-Baca Title: Consensus properties of supertree construction methods Abstract: The combination of a set of rooted perfect phylogenetic trees on overlapping leaf sets into one supertree is important and fundamental for evolutionary biology. In this thesis, we will present three supertree techniques – MRP, MRF, MinCutSupertree – and compare the consensus properties of MRP and MRF with some consensus tree criteria.
Jing DingHome Department: Electrical and Computer EngineeringMajor Professor: Dr. Dan Berleant Title: BOW-Based vs. Concept-Based Text Clustering for Functional Analysis of Genes Abstract: The rapid development in genomic technologies (e.g. microarray) has enabled biologists to simultaneously monitor expression of hundreds or even thousands of genes in a single experiment. To interpret the biological meaning of the expression patterns, it still largely relies on biologists domain knowledge, as well as collected information from literature and/or various public databases. Individual experts domain knowledge is insufficient for large datasets, and manually collecting and analyzing information from literature and/or public databases are tedious and time-consuming. Computer-aided functional analyzing tools are highly desirable. We developed GeneNarrator, a text-mining system for functional analysis of microarray data. Given a list of genes, GeneNarrator collects functional information (MEDLINE citations) from PubMed, and clusters the citations into functional topics. The genes are then mapped to the topics and clustered into groups based on their similarities in topic distribution. Pan DuHome Department: Electrical and Computer EngineeringMajor Professor: Dr. Julie Dickerson Title: Multi-scale Genetic Network Inference based on Time Series Gene Expression Profiles Abstract: This work integrates multi-scale clustering and short-time correlation to estimate genetic regulatory networks with different time resolutions and detail levels. Gene expression data are noisy and large scale. Clustering is widely used to group genes with similar pattern. The cluster centers can be used to infer the genetic networks among these clusters. This work introduces the Multi-scale Fuzzy K-means clustering algorithm to uncover groups of coregulated genes and capture the networks in different levels of detail.Time series expression profiles provide dynamic information for inferring gene regulatory relationships. Large scale network inference, identifying the transient interactions and feedback loops as well as differentiating direct and indirect interactions are among the major challenges of genetic network inference. Pairwise time correlation can detect linear interactions between genes. Estimates of the time delay and direction of causality in the inferred network can also be made. Partial correlation and d-separation theory are combined to differentiate the direct and indirect interactions and identify feedback loops. Gene expression regulation can happen in specific time periods and conditions instead of across the whole expression profile. Short-time correlation can capture transient interactions. The network discovery algorithm was validated using yeast cell cycle data. The algorithm successfully identified the yeast cell cycle development stages, cell cycle and negative feedback loops, and indicated how the networks dynamically changes over time. The inferred network reflects most interactions previously identified by genome-wide location analysis and matches extant literature results. The inferred network provides more detailed information about genes (or clusters) and the interactions among them. Interesting genes, clusters and interactions were identified, which match the literature and the gene ontology information and provide hypotheses for further studies. Tyra DunnHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Xun Gu Title: Genomic differences between humans and primates Abstract: Scientists around the world have wondered for many years what distinguishes speciation. Of particular interest is the genetic basis for human/primate (chimpanzee or gorilla) separation. Humans and chimpanzees are 99% identical in their genomic DNA sequence, thus making them very closely related. Despite this high degree of sequence similarity, humans and primates have a number of striking phenotypic differences. We hypothesize that sequence changes that have occurred between humans and primates have altered developmental programs. Because transcription factors alter the expression of numerous genes, we also hypothesize that changes in the expression or activity of transcription factors are responsible for the different phenotypic traits among humans and primates. Using human chromosome 22 as a model for comparison between human and primate DNA, a random selection of noncoding genes approximately 1-2 kilobases (kb) long upstream was sequenced. Focused on promoter regions from the sequence data, significant differences were detected when comparing humans and gorillas (p-value= < 0.01) and gorillas and chimpanzees (p-value= <0.01) suggesting that limited similarities existed between the species. When comparing humans and chimpanzees (p-value= >0.1), no significant difference was detected. Using this information, transcription factors were analyzed between the human and chimpanzee data to determine if transcription regulation was different between the species. The results indicated no significant difference between humans and chimpanzees at the single-nucleotide level even though the species differ at the genetic and phenotypic levels. The results also indicated that changes in transcription regulation have played a major role in determining speciation. This research opens new avenues in investigating how many of the differences have functional consequences and the relative contributions of these transcription factors to expression differences. Tyra DunnHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Heather Greenlee Thesis Presentation: June 12, 2007 Title: Characterizing and Influencing Differentiation Of Retinal Progenitor Cells Scott EmrichHome Department: Electrical and Computer EngineeringMajor Professor: Dr. Srinivas Aluru Title: Assembly and Analysis of Complex Plant Genomes Presentation: June 8, 2007 Abstract: Concurrent advances in high-throughput sequencing and assembly have led to the completion of many complex genomes. Even so, these assemblies require substantial computational resources. In this dissertation, we present a massively parallel approach that scales to thousands of processors without duplicating the biological expertise present in conventional assembly software. Additional bioinformatics techniques were required to accurately assemble the maize genome including novel repeat detection, and the resulting framework has been strongly supported by maize experimental data. More recently, this framework has been generalized for fruit fly, sorghum, soybean and environmental sequence assemblies. Questions in plant genome analysis were also addressed. For example, we have discovered an estimated 350 “orphan” maize genes and have shown that approximately 1% of all maize genes were recently duplicated, many of which into at least two functional copies. LCM-454 sequencing is introduced and analyses that indicate this approach can discover rare, potentially tissue-specific transcripts and thousands of SNPs will be presented. This dissertation combines high performance computing, computational biology and high-throughput sequencing for our ongoing work on the maize genome project. We conclude by describing how these contributions can be useful for any species, including non-model organisms that are unlikely to be fully sequenced. Joset EtzelHome Department: Electrical and Computer EngineeringMajor Professor: Dr. Julie Dickerson Title: Algorithms and Procedures to Analyze Physiological Signals in Psychophysiological Research Abstract: This dissertation presents analytical techniques which allow more information to be derived from psychophysiological data than otherwise possible. The techniques include an implemented algorithm for chest strain-gauge respiration signal analysis and a permutation testing method for evaluating changes over time in physiological signals. These methods are applied to three data sets, each examining physiological correlates of emotional experience. In the first study physiological correlates of moods induced using music were identified, although respiration entrainment confounds the issue of whether mood or the music caused the observed patterns. The second study examined physiological responses while subjects watched an emotional movie under three conditions; changes relating both to the movie scenes and condition were identified. Finally, the third study evaluates short term changes in heart rate while viewing words in terms of the type of word viewed and later word recall. Fang FangHome Department: StatisticsMajor Professor: Dr. Karin Dorman Title: Virus Recombination: Modeling and Data Analysis Abstract: As a key evolutionary process, recombination shapes the genetic structure of virus populations. The dramatic increase of virus full-length sequences provides a chance to study virus recombination through molecular data. Many statistical methods have been developed, and a lot of the methods are phylogenetic-based. My research focuses on recombination modeling and data analysis. I first apply an existing phylogenetic-base method, Bayesian dual change-point model (DMCP), to investigate the role of representative data types for recombination study. We conclude that consensus data is overall the best data type to represent virus genotypes. Using consensus data we studied recombination on all full-length hepatitis B virus (HBV) sequences, and set up a system for using DMCP model for large scale sequence analysis. We discovered that HBV has extremly high recombination rate. For the first time we reported circulating recombination forms of hepatitis B virus, and identified one potential recombination hotspot. One important goal of studying recombination is to find potential recombination hotspot, and to reveal the recombination molecular mechanism. This goal requires identification of all recombinants generated by different recombination events,which is not trivial when recombination sequences have similar mosaic structures. Extending the DMCP model, I developed a metnod to identify the number of recombination event producing multiple recombinants. I apply this method to several HBV recombinants that have identical mosaic structure and find at least two recombinant events. Jianmin FengHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Volker Brendel Title: A new approach for discovering protein motifs Abstract: Motif recognition is a powerful homology based sequence analysis tool for clustering new protein sequences into different families based on characteristic motifs. Compared to BLAST, these approaches typically have lower false positive rates and can reveal more remotely related family members. However, the current motif databases do not cover all the sequences in protein sequence databases. One of the major reasons for the low coverage of motif databases is that there is only a small set of known member sequences available for constructing protein motifs for many gene families. I have designed a new algorithm, “mFISHER”, to detect protein motifs from only 2-5 known member sequences by artificial evolution of given sequences based on a position specific PAM evolution model. Based on my test results on 160 motif families, the overall average recall rate or sensitivity (true/(true + false negative)) and specificity (true/(true + false positive)) are 88% and 95%, respectively. Compared with MEME (Multiple EM for Motif Extraction), mFISHER is better based on the recall rate, especially when only 2 or 3 members are available. Both approaches have the similar sensitivity. MFISHER is promising for constructing protein motifs when only a few known members. Xiang GaoHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Daniel Voytas Title: Studying the replication mechanism of the yeast retrotransposon Ty5 by molecular and computational approaches Abstract: The yeast retrotransposon Ty5 is a Ty1/copia element. Officially, it is in the Hemivirus genus of the Pseudoviridae family. The ability to genetically manipulate retrotransposons and the yeast host cell was taken advantage of to explore replication mechanisms unique to Ty5 and common to most retrotransposons. Because of the abundance and diversity of retroelement sequences, along with the fact that many retroelement enzymes have evolved unique functional specificities, computational approaches were also developed to study functional divergence in replication. By screening a randomly mutagenized Ty5 library, two mutations (Y68C, D252N) that caused higher transposition frequencies were identified. Both mutations increased Ty5 cDNA levels, but did not have dramatic effects on the steps after cDNA synthesis (i.e. integration and recombination), or protein synthesis, processing, or solubility. The D252N mutation increased the hydrogen bonding potential of the CCHC zinc finger of nucleocapsid protein (NCp), making the Ty5 NCp zinc finger more like Ty1/copia consensus zinc fingers in terms of hydrogen bonding potential. Other mutations that increased the hydrogen bonding potential (D252R, D252K) provided the same fold increase in Ty5 reverse transcription, and natural occurring mutations in the Ty5 zinc finger repress this function. Hydrogen bonding is suggested to be a universal requirement for the function of retroviral type zinc fingers and cellular zinc fingers. A half-tRNA priming mechanism for Ty5 reverse transcription was also demonstrated. Mutations in the anticodon of tRNA (IMT) and the putative PBS of Ty5 decreased transposistion, but transposistion was restored when complementarity between the IMT and PBS was restored. A tree-based method and supplemental Split Tester software were developed to study the functional divergence of reverse transcriptase (RT) with respect to half-tRNA and full-tRNA priming mechanisms. The domains identified by this computational approach were previously experimentally demonstrated to bind with the tRNA primer/template in HIV RT. Using this software, another domain related to integrase functional specificity, namely whether or not integrase carries out 3’-end processing during integration, was also consistently identified in different integrase datasets. A model describing this functional divergence is proposed. Zhong GaoHome Department: Computer ScienceMajor Professor: Dr. Vasant Honavar Title: Genome wide recognition of Tumor Necrosis Factor (TNF) related ligands in human and Arabidopsis genomes: A structural genomics approach Abstract: Tumor necrosis factors (TNFs) play a crucial role in mammalian signal transduction pathways for cell proliferation, survival, and differentiation. Human and other species (such as Arabidopsis) genome sequencing projects provide a unique opportunity for genome-wide recognition of TNF related ligand proteins and discovery of potential TNF-TNFR signal transduction mechanism in plants. Genome-wide recognition of TNF related proteins in human and Arabidopsis was carried out using secondary structure prediction and protein fold recognition. In the protein fold recognition scheme, sequence-structure models are evaluated using contact energy score based on Miyazawa-Jernigan and Li-Tan-Wingreen models. Secondary structure composition based initial screening not only reduces search space of protein fold recognition but also shifts the score distribution of the selected candidates to a higher score region. In order to investigate influence of sequence length on threading results, protein fold recognition was conducted on human and Arabidopsis genome sequences of different length. The test on known TNFs from diverse species indicates that about 83% of TNFs are able to be identified; the test on human genome sequences shows that about 80% of known TNFs can be recognized. Integration of secondary structure profiling into the scheme can improve performance by adjusting local sequence-structure relationship. However, this improvement largely depends on accuracy of secondary structure prediction. Average scoring performs better than maximal scoring in model evaluation and selection. Pattern classification algorithms such as decision tree, neural network, Naïve Bayes classifier, and support vector machine are applied to discriminate TNF related proteins from the competitive false positives which have similar secondary structure composition to known TNFs and also have high fold recognition scores. Both known TNF and false positive sequences are represented with the twenty q values corresponding to twenty amino acids in Li-Tan-Wingreen model. Cross-validation results show that Naïve Bayes classifier performs better than SVM, neural network, and decision tree, and Naïve Bayes classifier is suitable for stringent control of false positive. This genome-wide search scheme was used to search potential TNF-like signal proteins in Arabidopsis genome. Possible role of candidates in human and Arabidopsis genomes is discussed. These results demonstrate that structure based methods can facilitate functional prediction in a genome scale. Aspen GarryHome Department: Ecology, Evolution, & Organismal BiologyMajor Professor: Dr. Dean Adams Title: Geometric Morphometric analysis of shark teeth of the genus Rhizoprionodon: The modern, the ancient, and the hypothetical. Modern tooth shape analysis and test of ancestory prediction methods by comparison to fossil shapes Abstract: Shark teeth are extremely common in the fossil record, and they can potentially provide insight into the evolutionary history of sharks. However, isolated fossil teeth are difficult to assign to the correct jaw, position, and taxon without organismal context because individual sharks exhibit a variety of tooth shapes. Tooth shape varies across jaws, positions within each jaw, and taxa. Fortunately, tooth shape is quantifiable, and shapes can be compared using the techniques of geometric morphometrics, which measure shape and its covariation with other variables. Analysis of modern tooth shapes was performed in order to gain understanding of patterns of modern tooth shape variation. These results could then be applied to fossils to provide better identification of fossils in order to make use of sharks’ extensive fossil record. To quantify modern patterns of tooth shape variation, teeth of five Rhizoprionodon species and representative of three closely related genera (Loxodon,Eusphyra, and Sphyrna) were quantified and analyzed using geometric morphometric methods. Ancestral tooth shapes were estimated using the modern shape data mapped onto a phylogeny created using molecular data, and a Brownian motion model of evolution. These shapes were compared to fossil teeth from Rhizoprionodon sp. and Sphyrna spp. to evaluate the accuracy of the estimated ancestral shapes. Modern teeth at the front of the jaw displayed the most dramatic shape differences between jaws and positions. Teeth from each genus could be distinguished, but species within Rhizoprionodon could not. Fossil tooth shapes most closely resembled those of modern teeth, indicating that tooth shape did not change according to the Brownian motion model used to predict ancestral shapes. Jianying GuHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Xun Gu Title: Functional divergence and genome evolution of vertebrate protein kinases Abstract: The emerging complete and nearly complete genome sequences have provided a significant amount of materials for large-scale comparative genomic analysis. Novel methods have been developed to elucidate the function of gene products and functional interacting networks. Many of these post-genomic attempts have focused on unveiling the evolutionary forces that have shaped the network organization. Among various evolutionary forces, duplication of functional domain, individual gene, chromosomal segment, or entire genome has long been thought as primary resource for the function novelties in a vast number of gene families. It is therefore intriguing to quantitatively trace the changes of evolutionary constraints after a duplication event. This study is focused on the exploitation of the functional divergence and evolutionary patterns in vertebrate kinase complements (denoted as kinomes) and kinase-regulated signaling transduction pathways, using a combinatorial statistical and evolutionary approach. The analysis of an individual kinase gene family (Jak), protein tyrosine kinase superfamily, and a kinase mediated signaling transduction pathway (TGF- b ) showed that functional divergence (altered functional constraint) after (domain or gene) duplication is a general pattern. Moreover, the age distribution of the vertebrate kinomes showed that (1) The major kinase-related animal specific signal-transduction pathways have been generated through an ancient continuous domain shuffling (or duplications) during the time period from early stage of eukaryotes to metazoan evolution; (2) Vertebrate tissue-specificity of signal-transduction is facilitated by large-scale duplication event(s) in the early stage of vertebrates; and (3) The kinase pseudogenes are generated through either segmental duplication or retrotransposition very recently. Home Department: Genetics, Development & Cell BiologyMajor Professor: Dr. Patrick Schnable Title: Adaption of Multiclustering to the Analysis of Microarray Data Presentation Date: Thursday, May 10, 2007 Abstract: Clustering has become an integral part of microarray data analysis and interpretation. It is helpful to reduce the scale of information generated by microarray experiment to the level that biologists can generate hypothesis. There is a danger that artifacts induced by clustering methods can cause misinterpretation of the data. Clustering method that can accurately capture the natural structure of the data would be a useful tool for biologists to discovery the biological meaning buried in the data. To this end, a new clustering algorithm, called K-means multiclustering, is introduced. The method can avoid the artifacts induced by distance or similarity metrics by amalgamating the results of many K-means clusterings. Results: The multiclustering algorithm is a model-free clustering method. It is found to be reliable and consist in capturing the underlying data structure with high accuracy that is competitive with model based clustering and superior to other methods on synthetic micorarry data generated in a manner consistent with the hypothesis of model based clustering. The algorithm has a high level of immunity to artifacts introduced by the metric used to measure the distance between data points. It can successfully cluster data sets which are designed to have different shapes and variation and cannot be correctly clustered by traditional clustering method. The cut plot computed by this method is a very simple and useful summary of the data structure. A detailed view of the formation of clustering can also be generated by the method to reveal the underlying hierarchical structure of data set. Home Department: Genetics, Development & Cell BiologyMajor Professor: Dr. Daniel Voytas Title: Characterization of the Sireviruses: A unique group of Ty1/copia LTR retrotransposons in plants Abstract: Plant genomes have allowed the expansion of many types of mobile genetic elements. LTR retrotransposons are a subclass of mobile genetic elements that replicate using an RNA intermediate. The Pseudoviridae (Ty1/copia) are a family of LTR retrotransposons, and the Sireviruses are one of three genera in the Pseudoviridae. The Sireviruses have features that set them apart from classical retrotransposons. Different members of the Sireviruses show great variability in their genomic structures and the translational tricks they use to express their encoded proteins. For example, we have shown that the SIRE1 elements of soybean use stop codon suppression to express their Env-like protein. Secondly, some monocot members of the Sireviruses may use a bypass mechanism to translate Pol. Home Department: Biochemistry, Biophysics and Molecular BiologyMajor Professor: Dr. Mark Hargrove Title: Structural Characterization of Ligand Binding in Hexacoordinate Hemoglobins Presentation: Thursday, August 17, 2006 Abstract: The goal of biophysics is to study the structures of the components of living organisms and to understand the mechanics of the processes of life. Hemoglobin is a well suited model for this study. As an essential component of the life blood of mammals, and easy to obtain in large quantities, hemoglobin and its monomeric partner myoglobin are two of the most well studied and characterized components of life. Yet hemoglobin studies continue to reveal new forms of hemoglobin, raising new questions, functional possibilities, and research opportunities. My research focuses on hemoglobins classified as hexacoordinate. I have focused particularly on the structural characterization of these proteins upon ligand binding. Included below for your benefit are a list of abbreviations and terms used in my talk along with their definitions. Hbs -- hemoglobins
List of publications: Hoy, J. A., Kundu, S., Trent, J. T., 3rd, Ramaswamy, S., and Hargrove, M. S. (2004). The crystal structure of Synechocystis hemoglobin with a covalent heme linkage. J Biol Chem. 279, 16535-16542. Trent, J. T., 3rd, Kundu, S., Hoy, J. A., and Hargrove, M. S. (2004). Crystallographic analysis of synechocystis cyanoglobin reveals the structural changes accompanying ligand binding in a hexacoordinate hemoglobin. J Mol Biol. 341, 1097-1108. Smagghe, B. J., Kundu, S., Hoy, J. A., Halder, P., Weiland, T. R., Savage, A., Venugopal, A., Goodman, M., Premer, S., Hargrove, M. S. (2006). Role of Phenylalanine B10 in Plant Nonsymbiotic Hemoglobins. Biochemistry Aug 15;45(32):9735-9745. Hoy, J. A., Smagghe, B. J., Halder, P., Hargrove, M. S. (2006). Covalent heme attachement in Synechocystis hemoglobin is required to prevent ferrous heme dissociation. Manuscript in preparation. Hoy, J. A., Robinson, H., Trent, J. T., Kakar, S., Smagghe, B. J., Hargrove, M. S. (2006). Crystal structure of a nonsymbiotic plant hemoglobin; implications for the evolution of oxygen transport. Manuscript in preparation. Bio: BA in Physics and BA in Humanities from Wartburg College, Waverly, Iowa 1996 MS in Physics from Iowa State University, 1999 Temporary Instructor of Physics, ISU, 1999 - 2000 PhD studies in Biophysics, ISU, 2000 - 2006 Postdoc in Hargrove Lab LaRon HughesHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Karin Dorman Title: EIAV DB: A comprehensive Equine Infectious Anemia (EIAV) Virus database Abstract: A major problem in biology is the storage and retrieval of biological data in a meaningful and efficient manner. With the advent of mass sequencing projects, such as the human genome project, the need to store, retrieve, and analyze sequence data is stronger than ever before. The following thesis tackles a small part of this problem by presenting techniques, models, and applications for productively storing and retrieving a set of related viral sequences in a central data bank. The thesis begins by providing an overview of the relational database and its role in storing biological data. The main chapter of the thesis is a description of a novel relational database application (EIAV DB). EIAV DB is a central repository of Equine Infectious Anemia Virus sequence and feature information. The models and application provide insight into technologies that help alleviate the storage and retrieval problem. Cizhong JiangHome Department: Genetics, Development & Cell Biology Major Professors: Dr. Thomas Peterson Title: Computational and molecular analysis of Myb gene family Abstract: Myb proteins are defined by a highly conserved DNA-specific binding domain termed Myb, which is composed of approximately 50 amino acids with constantly spaced tryptophan residues. Multiple copies of Myb domains often exist as tandem repeats within a single protein. There are up to four tandem Myb repeats present in Myb proteins identified to date (termed R0R1R2R3 hereafter). In our study, we collected additional Myb genes, and performed a series of phylogenetic analyses to explore the evolutionary origin of Myb genes. The results suggest that the Myb gene family originated from an ancient one Myb-box gene. One and two intragenic duplications produced R2R3 and R1R2R3 Myb genes, respectively, which then co-existed in the primitive eukaryotes and gave rise to the currently extant Myb genes. Based on our results, we proposed that plant R1R2R3 Myb genes were derived from R2R3 Myb genes by gain of the R1 repeat through an ancient intragenic duplication; this gain model is more parsimonious than the previous proposal that plant R2R3 Myb genes were derived from R1R2R3 Myb genes by loss of the R1 repeat. The phylogenetic analysis of isolated individual Myb repeats indicates that R2 repeat has evolved more slowly than the R1 and R3 repeats. However, it is not clear which repeat is the most ancient one. Another goal of our project is to classify and predict functions of Myb genes. We clustered the closely-related Myb genes into subgroups from Arabidopsis and rice on a basis of sequence similarity and phylogeny. The gene structure analysis revealed that both the positions and phases of introns are conserved in the same subgroup, although these differ between subgroups. Conserved motifs were detected in C-terminal coding regions within subgroups, and these motifs exist specifically in Myb genes. We also found that Myb genes with similar functions are clustered together. In contrast, no conserved regulatory elements were identified in the divergent non-coding regions. Additionally, the distribution pattern of introns in the phylogenetic tree indicates that Myb domains originally had a compact size without introns. Non-coding sequences were inserted and the splicing sites were conserved during evolution. Alain LaederachHome Department: Chemical and Biological Engineering Major Professor: Dr. Peter Reilly Title: Protein-Carbohydrate and Protein-Protein interactions: Using models to better understand and predict specific molecular recognition Abstract: Any molecular recognition event results in a change in the free energy of the system. The extent of this change is related to the association constant, such that the more negative the free energy change is, the tighter the interaction between receptor and ligand. Protein-carbohydrate interactions play a critical role in signal transduction, innate immunity and metabolism. Modeling these interactions is somewhat complicated by the inherent flexibility of carbohydrates as well as their relatively large number of functional groups. An empirical scoring function for docking carbohydrates to proteins will be presented specifically tailored to predict both the correct binding orientation and free energy of binding of the carbohydrate-ligand/protein-receptor complex. This new scoring function can predict free energies of binding to within 1.1 kcal/mol residual standard error, a definite improvement over existing scoring functions which result in standard errors well over 2 kcal/mol. Application of automated docking methodology to determine carbohydrate recognition specificity of the c-type Lectin, human Surfactant Protein D will also be presented. In the second part of the thesis, the role of p-stacking interactions (e.g. between Tyr side chains) in stabilizing protein folds will be discussed. A 17-residue peptide derived from the naturally occurring anti-microbial peptide Tachyplesin I is investigated using NMR spectroscopy. NOE cross peaks were observed confirming the existence of this interaction in solution. In the final part of the thesis, a quantitative NMR investigation into the self-association behavior of the regulatory domains of several Tec family member kinases will be presented. Of particular interest, self-association within Bruton's Tyrosine Kinase (Btk) regulatory domains occurs through the formation of an asymmetric homodimer. Together this work demonstrates the importance of rigorous biophysical characterization of bio-molecular recognition events and how interdependent computational modeling and experimentation are. Darrin LemmerHome Department: Biochemistry, Biophysics & Molecular BiologyMajor Professor: Dr. Gloria Culver Title: CAVEMol: an immersive 3D molecule viewer Abstract: As the number of solved molecular structures deposited with the Protein Data Bank (PDB) increases, so too does the desire for more advanced ways of using this data. Traditional applications for viewing and manipulating molecular structures create a computer-generated model on a standard desktop computer screen. The display may employ some method of stereography to create the illusion of depth, but generally the user just sees a flat image. The user is able to interact with the molecule by magnifying it to get closer look at a particular area of interest, or by rotating it along an arbitrary axis, thus allowing all sides of the molecule to be seen, though only one side is in view at any given time. The user may also be able to see changes in the molecule over time whereby each conformation of the molecule is a separate frame of an animation, or they may even be able to make modifications to the structure in real time. Regardless of the amount of control the user has over the molecule, however, one thing remains the same: the user experiences the molecule as though it were an object floating behind the monitor screen which they can indirectly control using a mouse or other pointing device. This thesis presents the design and implementation of CAVEMol, a molecular visualization application for immersive environments. I will also give an overview of molecular visualization and immersive environments, and then discuss future work that can be done in this area as well as applications where molecular visualization in an immersive environment can be particularly valuable. Haining LinHome Department: Computer Science Major Professor: Dr. Xiaoqiu Huang Title: BACAP: An assembly program for heirarchial shotgun sequencing Abstract: We propose a sequence-based algorithm BACAP to assemble BAC sequences generated from hierarchical shotgun sequencing. Our approach relies on sequence similarity rather than physical mapping. It follows the “overlap-layout-consensus” framework used for shotgun sequencing data. BACAP uses heuristic methods to achieve efficiency and accuracy. It was tested on four simulated data sets of 200 BAC-size sequences each and one real data set of 228 rice BACs from TIGR. The average running time was 25 minutes on one 900 MHz IA-64 GenuineIntel Itanium machine. Our results show that BACAP can quickly and accurately accomplish some BAC assembly tasks without physical mapping information. Nicole LeahyHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Daniel Ashlock Title: Pseudophyte evolutionary algorithm: A simple computational model of parapatric speciation s Abstract: The Pseudophyte Evolutionary Algorithm (PEA) is an individual-based computer model of a population of haploid, annual plants used to examine the process of speciation in a patchy environment. The model incorporated both pre-mating and post-zygotic mechanisms for the evolution of reproductive isolation via pollen selection and offspring inviability, respectively. The PEA allows speciation as an emergent property rather than an explicit feature of the model to understand how environmental patchiness, number and arrangement of loci, and reproductive output of individuals affected the strength of isolating mechanisms as well as the rate at which these evolve. The effect of how genotypes were mapped to phenotypes was also explored to examine the sensitivity of the PEA to alternate representations. Yuan LinHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Xun Gu Title: The Relationship of Sequence Similarity and Expression Pattern Similarity between Yeast Genes within Gene Families Abstract: After gene duplication, the sequence and expression patterns of duplicated genes diverge. It is known that the function divergence of duplicated genes could be related to the divergence of both their coding sequence and expression profile mainly caused by the sequence change of regulatory region. But it is not known if the sequence divergence and expression pattern divergence are correlated. Former research by Andreas Wagner showed there is at most very weak correlation between them. On the contrary, our research shows there is a strong correlation between the sequence similarity and expression profile similarity if the sequences are quite conserved; the degree of coexpression of duplicated genes is consistent to their duplication order. Patricia LonoskyHome Department: Botany Major Professor: Dr. Steve Rodermel Title: Proteomics of the developing chloroplast in maize Abstract: Chloroplast protein expression profiles during the light-induced biogenesis of the maize plastid were determined from 2D gel analysis. During five time points of this ‘greening’ process (0,2,4,12, and 48 hours post-illumination), maize plant tissue was collected, plastids isolated, and protein precipitated and separated in two dimensions using 2D protein gels. From these proteome maps, quantities of spots were analyzed by: Principal Components Analysis, hierarchical pairwise average linkage cluster analysis, Adaptive Resonance Theory 2 cluster analysis, and Self Organizing Map cluster analysis to determine chloroplast protein expression profiles. 54 spots representing 26 proteins were identified by MALDI-TOF mass spectrometry and used to verify the protein expression profiles. Two main conclusions were drawn from this data: 1) ART2 may be a useful clustering tool for expression data, and 2) different forms or modifications of the same protein show different expression patterns. Wiesia MentzenHome Department: Genetics, Development & Cell BiologyMajor Professor: Dr. Eve Wurtele Abstract: I apply combined bioinformatic approaches using genomic and transcriptomic data to investigate the fatty acid biosynthesis pathway, at the molecular level, and in the context of the system biology of Arabidopsis. Fatty acids are essential components of all known bacterial and eukaryotic cells with critical role in cells as energy reserves and the metabolic precursors for biological membranes. The pathway for fatty acid synthesis seems to be conserved across all living systems. Acetyl-CoA carboxylase, a member of a superfamily of biotin-dependent enzymes, catalyzes the first committed step of the fatty acid biosynthesis pathway. Phylogenetic study exposed complex and intertwined evolutionary histories of this family, with multiple domain fusions and rearrangements. As revealed by meta-analysis of a wide array of Arabidopsis transcriptomic data, fatty acid biosynthesis is transcriptionally regulated, and this regulation not only extends across all pathway reactions, but also some substrate- and cofactor-producing reactions, thus defining a major transcriptionally co-regulated pathway. I extend the meta-analysis of the transcriptome to find groups of coexpressed genes (also called modules, or regulons) in the Arabidopsis genome. Major functionally-coherent gene groups were identified. These comprise development, information processing, defense, and metabolism, as well as tissue- and organelle- specific processes. Myron PetoHome Department: Biochemistry, Biophysics and Molecular Biology Major Professor: Dr. Robert Jernigan Title: Studies of Protein Designability using Reduced Models Presentation: July 9, 2007 Abstract: One the most important problems in computational structural biology is protein designability, that is, why protein sequences are not random strings of amino acids but instead show regular patterns that encode protein structures. Many previous studies that have attempted to solve the problem have relied upon reduced models of proteins. In particular, the 2D square and the 3D cubic lattices together with reduced amino acid alphabets have been examined extensively and have lead to interesting results that shed some light on evolutionary relationship among proteins. Here, additionally to the 2D square lattice, we study the 2D triangular and 3D face centered cubic (fcc) lattices, we perform designability studies using different shapes embedded in the 2D square lattice, and we use machine learning algorithms to classify binary sequences folding to highly- or poorly-designable conformations. In the first part of the thesis we extend the transfer matrix method to the 2D triangular lattice. The transfer matrix method is a highly efficient method of enumerating all conformations within a compact lattice area that has earlier been developed for the 2D square and 3D cubic lattices. In addition we also enumerated all compact conformations within simple geometries on the 2D triangular and 3D face centered cubic (fcc) lattices using a standard backtracking algorithm. In the second part of the thesis we described protein designability studies on various shapes in the 2D square lattice using a reduced hydrophobic-polar (HP) amino acid alphabet. We used a simple energy function that counted the number of H-H, H-P and P-P interactions within a restricted set of protein shapes that have the same number of residues and non-bonded contacts. We found a difference in the designabilities of different protein shapes. Finally, in the third part of the thesis we used standard machine learning algorithms to classify two classes of protein sequences. We first performed a designability study for two shapes, using a binary HP alphabet, on the 2D triangular lattice and separated highly- and poorly-designable conformations. Highly-designable conformations had many sequences folding to them with the lowest energy and poorly-designable conformations had few or no sequences folding to them. Sequences were classified as highly- or poorly-designable depending on whether they folded to highly- or poorly-designable structures. Using several machine learning algorithms such as Decision Tree, Naïve Bayes, and Support Vector Machine, we were able to classify highly- and poorly-designable sequences with high accuracy. Bradley PowersHome Department: Mathematics Major Professor: Dr. Dan Ashlock Title: The Effect of Tags on Non-Local Adaptation Abstract: This project investigates in greater depth in phenomenon of non-local adaptation previously observed in an evolutionary model based on the game iterated Prisoner’s Dilemma. Non-local adaptation is the ability of an agent or population of agents to perform well against other agents that share no common history or ancestry with them. Populations of agents both with and without identifying tags are evolved to perform noisy iterated prisoner’s dilemma on a toroidal grid. The agents consist of a finite state machine specialized for playing iterated prisoner’s dilemma and simple tag recognition capability. The populations are allowed to evolve for 10,000 generations and the state of the world is stored every 500 generations. Populations from these samples are placed in competition with populations from generation 10,000. This procedure is repeated for varying levels of overall mutation rate, with and without tags, and varying frequencies of tag related mutations. Non-local adaptation is seen in these populations, however, tags seem to slow the acquisition of non-local adaptation. Although the concept of non-local adaptation is not a widely accepted phenomenon in biology, these results suggest that it may happen and that they effect is persistent in the face of changes in mutation rate and in the face of increased task complexity. Further analysis of the populations tend to have a predominant tag most of the time with punctuated periods of increased tag space usage that most likely correspond to invasion of the population by an opportunistic agent with a new tag identifier. Justin RecknorHome Department: Statistics Major Professor: Dr. Dan Nettleton Title: Identification of Differentially Expressed Functional Categories in Microarray Studies Using Nonparametric Multivariate Analyses Abstract: Tests of differential expression across groups of genes, within a functional category, are performed using a method motivated by Barry, Nobel, and Wright (2005). Rather than comparing marginal distributions on a gene-by-gene basis across treatment groups, we use a test statistic that can detect general changes in multivariate distributions across treatment groups. Resampling-based methods and multiple-testing adjustments are used to obtain simultaneous inference for multiple groups of genes. Results are visualized on a directed acyclical graph, and new methods for pinpointing genes of greatest interest are provided. Shannon D. SchlueterHome Department: Genetics, Development and Cell Biology Major Professor: Dr. Volker Brendel Title: Plant genome informatics: Evaluation and Analysis of genomic DNA features involved in transcriptional processing of protein coding genes Abstract: As biological data collection methods have become more cost effective and less time consuming, the necessity of computational tools to store, manage, and analyze such data has led to the creation of a broad field of research. With the vast majority of effort in bioinformatics being applied to research on vertebrate species, researchers in the plant sciences have often been left with less than satisfactory tools to fill this need. In the course of this study, I have developed xGDB, an extensible infrastructure for integrating biological data resources and applying them to hypothesis driven research. Eleven plant species xGDB databases have been made publicly available at http://www.plantgdb.org. Using the infrastructure provided by xGDB, a sophisticated system was developed to investigate the reliability of protein coding gene structure annotations on a per gene basis. With this, I generated the necessary dataset to develop and test a plant specific probabilistic model of RNA polymerase II transcription start sites and promoters. Through application of this model, a look at individual plant protein coding gene promoters has shown unique structure and organization. Together, this work demonstrates the importance of integrated computational infrastructure and genomic domain knowledge.
Justin SchonfeldHome Department: Mathematics Major Professor: Dr. Dan Ashlock Title: A modular data analysis pipeline for the discovery of novel RNA motifs Abstract: This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shorted segments of RNA primary sequence called bricks. The bricks are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns. An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called nonlinear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produced 2-Dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences in which crossover points were localized. Sachet ShuklaHome Department: Electrical and Computer Engineering Major Professor: Dr. Srinivas Aluru Title: Region-specific implication of 5'UTR motifs in translational control mechanisms Abstract: This study uses a novel approach based on the RESCUE technique (Relative Enhancer and Silencer Classification by Unanimous Enrichment) (Fairbrother et al, 2002) to identify region-specific motifs in the 5'UTR. A highly selective screening procedure is described and implemented, which drastically reduces the false positive rate of identified motifs by the original technique. For increased accuracy, we present the results only for species that have well-curated mRNA data as maintained in the Refseq curated database. The results of these computations suggest that there are motifs in the 5'UTR that act in conjunction with the kozak consensus sequence in the process of translation initiation. Specifically, motifs have been identified in the inter-ATG regions of 5'UTRs with multiple uATGs (upstream ATGs) that may have an effect on translation initiation. Strong and weak kozak sequences have also been associated with mutually exclusive motif sets both upstream and downstream of the true start codon. Finally a number of motifs were identified as being preferentially present in the uORF (upstream Open Reading Frame) regions, which argues against the theory that uORF sequences are random. In general, uORF regions are also found to be strongly selective against motifs associated with strong kozak sequences. In addition to the above-stated results which are applicable across species, motif overlap analysis (ex.motifs that are associated with both strong kozak sequences and the inter-ATG region upstream of the true start codon) also suggests some species-specific translational control mechanisms. The region-specific identification of motifs itself is probably indicative of higher-order secondary and tertiary structures and interactions. The experimental validation of these results could lead to the discovery of novel primary/secondary motifs and translational contro mechanisms encoded in the 5' untranslated regions of different species. Michael SparksHome Department: Genetics, Development and Cell Biology Major Professor: Dr. Volker Brendel Title: Computational annotation of eukaryotic gene structures: algorithms development and software systems Abstract: An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology. The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution. Therefore, the continued development of methods not explicitly reliant on homology information—the so-called ab initio gene prediction methods—should help to more rapidly achieve a comprehensive understanding of gene content in model eukaryotes. This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in ab initio gene prediction. The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular. Splice site prediction, coding fragment recognition, translation initiation site prediction and overall gene structure prediction will be discussed. Robert ThompsonHome Department: Veterinary Microbiology and Preventive Medicine Major Professor: Dr. Susan Carpenter Title: Application of computational tools to analyze evolution of equine infectious anemia virus Abstract: Evolution is the study of how variation alters the phenotype and population dynamics over time. Population genetics theories fit viral evolution well because of the properties of a viral population. Retroviruses are characterized by a high mutation and replication rate, which produces a heterogeneous mixture of viral variants commonly referred to as a quasispecies. Equine infectious anemia virus (EIAV) infection is a well-studied model for retrovirus variation and evolution (32, 33, 34). EIAV infection is characterized by a rapid, variable, dynamic disease course. Dynamic features of clinical disease as well as the ability of the horse to control the infection makes EIAV an excellent system to study evolution of viral quasispecies during progression of clinical disease. Here, we describe analyses of genetic data from longitudinal studies of genetic variation in a horse experimentally infected with equine infectious anemia virus. These studies include the genes encoding the regulatory protein Rev and the surface envelope glycoprotein, SU. Phylogenetic and cluster analyses suggested that the population of Rev variants was comprised of two distinct quasispecies that co-existed during infection, the populations shifted rapidly during febrile and afebrile periods with as little as 10 days between changes in population dominance of populations. In this study, we also examined evolution of EIAV envelope quasispecies in the chronic period evolve by random processes while quasispecies in the inapparent period evolve by a combination of Darwinian selection and random processes. These results propose that the envelope evolves by different processes during different stages of disease. Different evolutionary mechanisms during different stages of disease require unique approaches to anti-retroviral therapy during different stages of disease. Together, these results suggest there are unique host environments and viral population interactions during different stages of disease. Multiple quasispecies and varying processes of evolution during persistent retrovirus infection challenges the current thinking and has important biological implications for control of viral infections. Peter VedellHome Department: Mathematics Major Professor: Dr. Zhijun Wu Title: Boundary Value Approaches To Molecular Dynamics Simulation Abstract: Conformational transitions of biomolecules like proteins play an important role in many cellular processes, most often in a positive way, but sometimes in a detrimental way, perhaps causing diseases. Knowledge about conformational transitions of proteins and other biomolecules has the potential to be important in many areas of biological research. Simulation is an important means of studying these transitions. When a molecule has more than one known stable conformation, one can consider study of conformational transitions by a boundary value approach to molecular dynamics simulation. Application of multiple-shooting methods – an iterative numerical method for solving boundary value problems for ordinary differential equations – is proposed to find Newtonian molecular dynamics trajectories for a system subject to an all-atom molecular mechanics force field. These trajectories correspond to conformational transitions of proteins. Important aspects of this work include assessment of potential biological significance and computational challenges. The many computational issues include feasibility of the approach for larger systems, convergence properties, global optimization algorithms, efficient methods for finding initial trajectories, choice of boundary conditions, methods for parameter reduction, algorithms for handling the initial-value sub-problems, for computing Jacobian matrices, and for solving resulting nonlinear systems of equations. Distance matrix interpolation methods, which are particularly useful for constructing approximate trajectories for application in situations where all-atom Newtonian trajectories are not feasible, have previously been described ([Kim2002]). We introduce different distance matrix interpolation approaches that hold some promise for useful application for the purpose of efficiently constructing initial trajectories as well as for possible progress in construction of approximate trajectories. The results from simulating conformational transitions of alanine dipeptide are presented. REFERENCES [Kim2002] Kim M, Jernigan R, Chirikjian G. Efficient generation of feasible pathways for protein conformational transitions. Biophysical Journal, 83: 1620 (2002). Kent Vander VeldenHome Department: Genetics, Development & Cell Biology Major Professor: Dr. Gavin Naylor Title: Spatial Clustering of Differences in Measured Homoplasy with Respect to Protein Structure Abstract: The identification of residues that hold misleading phylogenetic signals and those that are of functional significance are intertwined. Advances in the one area can support the other mainly because misleading phylogenetic signals come from residues that are not evolving as a random process. This paper is a case study of the comparison of a widely accepted phylogenetic tree to trees that have been inferred from sequence data of five proteins. A new metric, RI Difference and based on Retention Index, is suggested measuring the relative support that individual sites provide for two trees. Through the identification of sites harboring disproportionally large misleading phylogenetic signal, we attempt to identify residues that are cooperating to define the function of the protein. This information is presented in the presence of the structure of the protein where clustering patterns (or lack of) are observed in the implicated residues. A new bioinformatic software tool, RI Compare, is presented implementing the metric and blending heterogeneous information from alignments, phylogenetic trees, and structure promoting this research. The results are offered followed by some speculation to what might be causing erroneous trees to be inferred. The relationship of the implicated residues to those of known importance is also discussed. While, regrettably, the results of this paper do |