CSML, PICSciE and DataX help researchers launch new cancer analysis software

News Body

September 22, 2021

From the Center for Statistics and Machine Learning

Graphic of DNA double helixTo probe the origin and spread of cancers in the human body more effectively, Ben Raphael, professor of computer science at Princeton University, and his research lab created HATCHet or Holistic Allele-specific Tumor Copy-number Heterogeneity, an algorithm that is capable of finding and analyzing genes that have been duplicated or deleted in multiple tumor samples from a single cancer patient.

Raphael’s team first released the open-source algorithm to the public in 2018. But to make the HATCHet software more widely accessible and help advance cancer research in the larger biomedical community, Raphael turned to Vineet Bansal, senior research software engineer jointly appointed to the Center for Statistics and Machine Learning (CSML) and Princeton Institute for Computational Science and Engineering (PICSciE).

The collaboration, which also pulled in expertise from data scientist Brian Arnold, who’s part of Princeton's Schmidt DataX Initiative, has resulted in software that is more efficient and robust, and that can operate in the cloud. HATCHet’s various components are currently accessible on GitHub.

“It’s a really nice, fruitful partnership. Vineet, who's a software engineer, has great software engineering skills, but hadn't worked too much with biological data. Brian comes from a biology background and he understands what genomes are and has extensive experience with biological data,” said Raphael, whose research focus is on computational biology and bioinformatics research.

Bansal and Arnold said they felt gratified that their contributions were impactful.

“I learn something new in each project I tackle and HATCHet was no exception,” said Bansal. “I am pleased that my work on HATCHet will contribute to its wider use.”

“It’s been a great collaboration because we bring different expertise to the project. And I think this work has made HATCHet much better than it would have been otherwise,” said Arnold.

Raphael and Simone Zaccaria, a former postdoctoral research associate at Princeton, first released HATCHet to the public in the research paper, “Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data,” which was published in the journal Nature Communications in 2018.

In cancer research, it’s been standard practice to process and analyze the DNA sequences of tumor samples individually. In recent years, researchers have begun to sequence multiple samples from a cancer patient. HATCHet analyzes the DNA sequencing data from these multiple samples simultaneously and has also proven to outperform current state-of-the-art methods, according to the research paper.

When they published HATCHet, Raphael said he recognized that the software still needed work.

“In many research groups that develop new computational methods, there's a gap between the software that a student or postdoc develops to analyze data for a research paper and a more polished piece of software that another researcher can easily download, install and use," said Raphael. “This gap exists in most academic research groups because there is a substantial difference between a software prototype and a usable application.”

Bansal added, “And most of the time, the interest in these software projects is a function of how easy or difficult it is to install the package and use it for your own institution.”

The push to make HATCHet more accessible and practical to use started last summer with Bansal being assigned to the project. Bansal's primary job is to collaborate with CSML-affiliated faculty members to develop computational tools that enhance their research. More on the process of choosing projects for Bansal can be read here.

“Vineet took the existing HATCHet tool and transformed it into a more productive piece of software that is easier for researchers to install and use on their own computers,” said Raphael.

The primary aim of Bansal’s work was to restructure performance-critical components of the HATCHet package, including the development and implementation of a 'cloud mode' in HATCHet, where the entire computational pipeline can be deployed and run on a commercial cloud service, said Raphael. Researchers then don’t have to worry about software installation or data transfer issues, and can tap other benefits of cloud infrastructure such as accessibility and data security.

Another objective was to make the software more streamlined and frictionless to use, said Raphael. The previous iteration had “quirks” that researchers had to work around.

“If you want to analyze one hundred cancer genomes or a thousand cancer genomes, you want something that's much more robust and efficient,” said Raphael.

Arnold, a data scientist who’s part of Princeton's Schmidt DataX Initiative, joined the HATCHet project earlier this year as well. Arnold provided help via his expertise on biology, genomic sequencing, and bioinformatics, which is concerned with using computational techniques to analyze and process biological datasets.

“Brian was a great collaborator, and I learned quite a bit from him,” said Bansal. “Brian filled in the gaps of my knowledge since my background is in software engineering.”

In addition, Arnold and Bansal also collaborated on enhancing HATCHet documentation, which is critical for outside adoption, said Arnold.

Next month, Bansal and Arnold will be teaming up for a special event that draws from their HATCHet experience. On October 1st, Arnold and Bansal are planning to run a short DataX workshop, “Best Practices on Python Packaging,” on how to package and document code for public release. More on that event can be found here.