May 5, 2020
Originally published by the Association of Computing Machinery as part of the "People of ACM" series.
What initially drew you to the field of computational molecular biology?
Computational molecular biology as a field did not really exist when I was in college, and I entered graduate school at MIT thinking I would work in algorithms. But while I was in graduate school, the Human Genome Project was ramping up, and it was clear that there were many interesting computational problems that needed to be tackled in order to successfully decipher the human genome. I had always been fascinated with the questions asked in biology and medicine, but never enjoyed doing experimental lab work, so it was exciting to think about these questions computationally. I was also extremely fortunate to be mentored by Bonnie Berger in the MIT Math department, and Peter Kim who was then in the MIT Biology department. Their vision of this new field was instrumental in my becoming a computational biologist.
A major focus of your work is using computing to understand the interactions among proteins, and between proteins and other molecules. Why is understanding interactions among proteins so important? What is one crucial way in which machine learning and algorithms are advancing research in this area?
Proteins participate in nearly all biological processes, and they accomplish their functions not in isolation, but by interacting with other molecules (including other proteins, DNA, etc.). Altogether, these networks of molecular interactions comprise the “wiring diagram” of a cell, and knowing this network is a key step toward understanding how cells function (or malfunction, in the case of disease). While many protein interactions have been determined experimentally, our knowledge is incomplete; machine learning is used to fill this gap. My group has developed machine learning techniques to predict, for example, interactions between proteins and DNA that regulate which proteins are expressed. On the flip side, once large-scale cellular networks are known, significant additional work is needed to uncover how they actually function. My group has developed algorithms to uncover the structure and organization of these networks and to annotate proteins that work together to accomplish some function.
You were a co-author of the recent paper "An integrative approach uncovers genes with perturbed interactions in cancers." Will you tell us a little about how the PertInInt software, which you and your co-authors introduced in this paper, can help detect cancer genes?
Cancer is a disease where our own cells acquire mutations in their DNA that lead them to divide in an uncontrolled fashion. Thus, the idea of cancer genomics is that if you sequence the genomes of cancer cells, then you can uncover which genes (and their encoded proteins) are mutated, and this may lead to a better understanding of cancer biology and may also identify putative drug targets. Unfortunately, it turns out that typically numerous mutations are found in any individual’s cancer, and yet only a handful of them are relevant for cancer initiation or progression.
The overall idea behind PertInInt is that, because the interactions that proteins make are so critical for their functioning, mutations that affect interactions are more likely to be the ones that are causal for cancer. Thus, PertInInt uncovers cancer genes by identifying those that, when considering cancer genomes across individuals, have an enrichment of mutations in positions that are involved in interactions. PertInInt’s framework is actually generalized so that it integrates multiple sources of information about which positions within a gene are important for functionality, and discovers genes whose functionalities tend to be disrupted in cancers. Crucially, PertInInt is based on analytical calculations that obviate the need to perform time-prohibitive permutation-based significance tests.
A great deal of media attention is currently focused on efforts to develop a treatment for the novel coronavirus (COVID-19). Broadly speaking, how might computational biology help in this effort?
By analyzing viral and patient genomes, along with patient data, there are many mysteries with COVID-19 that computation can help address. For example, why do some individuals have mild symptoms and others more serious ones? How infectious is COVID-19 and how has it spread across the world? What changed in the COVID-19 genome as compared to other closely-related viruses that has made it so dangerous for humans? There are already thousands of sequenced genomes for SARS-Cov-2, the virus that causes COVID-19, from patient samples; phylogenetic analysis of this data is already revealing how the virus has been spreading across the world and how long it has been circulating in certain communities.
Network modeling of how the disease has spread will uncover important aspects of infectivity. The structure for the SARS-Cov-2 “spike” protein has been determined, and computational structural modeling of it may give insight into how to block the critical interaction it makes with a human protein that enables the virus to enter human cells. As genomic data is collected for patients, statistical modeling can be used to try to uncover variants in individuals that affect how they respond to the virus, and this can be used to guide treatment.