Title: The Evolutionary Origins of Orphan Genes
With every new sequenced species we discover hundreds of novel protein coding genes. Many of these "orphan" have been experimentally proven to have dramatic functions in development, sexual dimorphism, pathogen resistance, and social traits like symbiosis. Whereas in the past, researchers viewed genes as the product of continuous variation acting on ancient material, we now know that novel genes may arise de novo from non-genic sequence. Thus evolutionary experimentation is not limited to tweaking existing genes or their regulatory patterns. Any orphan genes that arose in the distant past, should appear today as lineage-specific genes (or gene families). The search for genes by their relative time of origin is called "phylostratigraphy". However, phylostratigraphy has proven to be a challenging task with different methodologies often yielding contradictory conclusions. Standard phylostratigraphy infers the age of a gene by finding the most distant species that has an inferred homolog. However, this approach is highly sensitive to annotation quality and cannot easily distinguish between rapidly evolving genes and genes of de novo origin.
This dissertation contributes a suite of tools for more accurately determining the phylostratigraphic age of genes and the level of support for the classification. First, we developed phylostratr to automate standard phylostratigraphy. Second, we developed a program, synder, to infer syntenic-homologs of query features using a synteny map. Third, we developed fagin, a package that builds on synder to search related species of a query genes for traces of genic or non-genic orthology. The pipeline can distinguish orphans with high-confidence data support from orphans identified due to bad assembly or missing data. We traced many orphans to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species and the amount of data support provides a launching point for future orphan studies.