CS

Perfect Phylogeny and Haplotype Assignment

Report ID: TR-678-03

Author: Karp, Richard M. / Halperin, Eran

Date: 2003-10-00

Pages: 11

Download Formats: |PDF| |Postscript|

Abstract:

This paper is concerned with the reconstruction of perfect phylogenies from binary character data with missing values, and related problems of inferring complete haplotypes from haplotypes or genotypes with missing data. In cases where the problems considered are NP-hard we assume a rich data hypothesis under which they become tractable. Natural probabilistic models are introduced for the generation of character vectors, haplotypes or genotypes with missing data, and it is shown that these models support the rich data hypothesis. The principal results include:

- A near-linear time algorithm for inferring a perfect phylogeny from binary character data (or haplotype data) with missing values, under the rich data hypothesis;

- A quadratic-time algorithm for inferring a perfect phylogeny from genotype data with missing values with high probability, under certain distributional assumptions;

- Demonstration that the problems of maximum-likelihood inference of complete haplotypes from partial haplotypes or partial genotypes can be cast as minimum-entropy disjoint set cover problems;

- In the case where the haplotypes come from a perfect phylogeny, a representation of the set cover problem as minimum-entropy covering of subtrees of a tree by nodes;

- An exact algorithm for minimum-entropy subtree covering, and demonstration that it runs in polynomial time when the subtrees have small diameter;

- Demonstration that a simple greedy approximation algorithm solves the minimum-entropy subtree covering problem with relative error tending to zero when the number of partial haplotypes per complete haplotype is large;

- An asymptotically consistent method of estimating the frequencies of the complete haplotypes in a perfect phylogeny, under an iid model for the distribution of missing data;

- Computational results on real data demonstrating the effectiveness of a the greedy algorithm for inferring haplotypes from genotypes with missing data, even in the absence of a perfect phylogeny.