I will begin with a gentle introduction to the diversity and scale of ENCODE data and a brief overview of robust, statistical methods that we developed for automated detection of DNA binding sites of hundreds of regulatory proteins from noisy, experimental data. Regulatory proteins can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. I developed a novel discriminative machine learning formulation based on regularized Rule-based ensembles that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant item-sets of co-binding events at an unprecedented level of detail. I found extensive evidence that regulatory proteins could switch partners at different sets of genomic domains within a single cell-type and across different cell-types affecting structural and chemical properties of DNA and regulating different functional categories of target genes. Using regulatory elements discovered from ENCODE data, we were also able to provide putative functional interpretations for up to 81% of all publicly available sequence variants (mutations) identified in large-scale disease studies and generate new hypotheses by integrating multiple sources of data.
Finally, I will present a brief overview of my recent efforts on using multivariate Hidden Markov models to analyze the dynamics of various chemical modifications to DNA across three key axes of variation - across multiple species, across different cell-types in a single species (human), and across multiple human individuals for the same cell-type. Our results indicate a remarkable universality of chemical modifications defining hidden regulatory states across the animal kingdom with dramatic differences in the variation and functional impact of these regulatory elements between cell-types and individuals.
Together, these efforts take us one step closer to learning comprehensive models of gene regulation in humans in order to improve our system-level understanding of cellular processes and complex diseases.