Bioinformatics & Computational Biology Bioinformatics & Computational Biology
Menu:
Iowa State University

Professor Mike Sanderson

Computer Science Colloquium

Exploiting large sequence databases to build the phylogenetic tree of life: problems, strategies and algorithms

Professor Mike Sanderson
University of California, Davis
Tuesday September 24, 2002
3:30 p.m.
1414 Molecular Biology Building

Abstract
Biological species are related to one another by a pattern of ancestry and descent generally represented as a rooted phylogenetic tree. This tree includes upwards of 5 million species, but only a small fraction of this diversity has been captured in phylogenetic analysis to date. Large molecular sequence databases are a rapidly growing source of data for these studies; GenBank, for example, archives sequence information on over 100,00 species. Exploitation of these databases for phylogenetic purposes will require solving significant methodological and computational problems that are only now being identified. Given that it is neither practical nor informative to analyze the entire database at once, most of these problems relate to how the database should be partitioned, analyzed and then recombined to provide synthesis. This talk discusses three of these problems. First, single-copy genes must be separated from those in gene families, because the former are much more readily useful for species tree reconstruction than the latter. Second, optimal concatenations of genes must be found to maximize the sequence data per species used in any one analysis. Third, trees resulting from partially overlapping sets of species must be combined into phylogenetic supertrees that are larger than any of the input trees. The first problem is addressed using phylogenetic randomization tests. The second is addressed by a novel application of existing algorithms for finding maximal bicliques in bipartite graphs. The third is addressed by a relatively new suite of consensus-like algorithms on trees with partially overlapping label sets. All three problems are illustrated by reference to a set of 100,000 proteins from green plants, extracted from GenBank. From these, we characterize the information content of the database with respect to building comprehensive phylogenetic trees, provide upper bounds on the size of concatenated data sets, and lower bounds on the number of phylogenetic supertrees needed.

Bio
Mike Sanderson, Professor, UC Davis (1995- present)
B.S. in Physics, University of Arizona, 1982
Ph.D. in Ecology and Evolutionary Biology, University of Arizona, 1989
Postdoc (Sloan Fellowship in Molecular Evolution), Cornell University
Assistant Professor, University of Nevada, 1992-1995.


URL:
Copyright © 2002, Iowa State University, all rights reserved.
Last Modified: