According to a paper in Nature Communications, a team of Rice University researchers led by Luay Nakhleh has developed a platform for integrating DNA and RNA data from single-cell sequencing with greater speed and precision than more recent, state-of-the-art technologies. The method, mapping cross domain nucleic acid or MaCroDNA, relies on a classical algorithm to identify matching pairs of data from DNA ⎯ the genetic blueprint of a cell ⎯ and RNA ⎯ a cell’s instruction manual for protein assembly.
“Imagine you are given two large sets of photos of cars with the license plates and other identifying features blurred,” said Mohammadamin Edrisi, a Rice Ph.D. student in computer science and lead author on the study. “One set contains photos of the cars taken from the front, while the other set has photos of the back of the cars, and someone asks you to find the pairs of photos that belong to the same car. This is a metaphor for the problem we have tried to solve. The cars are cancer cells, and the two sets of photos are DNA and RNA data measurements.”
In fact, the scenario that MaCroDNA is designed to address is more complex than that.
“In a typical cancer single-cell sequencing experiment, the DNA and RNA data sets are obtained from different cells in the tumor sample,” said Nakhleh, the senior author on the study. “So the matching in such a scenario happens between cells that we know are not the same cells.
“To continue the analogy, think of each photo as being taken of the front or back of a different Toyota car, and we want to match pairs of photos that belong to a car of the same model — the front and back of a Toyota Camry, of a Toyota Corolla, etc. Different car models here are analogous to different clones within a heterogenous tumor, where each clone is expected to have very similar, yet not completely identical, DNA and RNA signatures across all cells within the clone.”
Single-cell sequencing has developed significantly over the past decade, driving discovery across various fields of biology. This sequencing technique is an effective tool for studying how changes at the level of the genetic code impact cells’ makeup or functioning, making it easier to track the types of transformations that turn a population of healthy cells into malignant tissue.
“Cancer cells demonstrate abnormal RNA patterns, and one of the reasons for that is DNA mutations,” Edrisi said.
In their quest to identify the best tool for the task, the researchers tested a variety of methods against a real biological dataset with known matching DNA-RNA pairs.
“We tested the state-of-the-art method ⎯ named clonealign ⎯ and the other widely used methods using a real dataset with ground truth information for accuracy measurement,” Edrisi said. “Interestingly, using this dataset was one of the novelties in our work. Previous studies relied on simulated data for accuracy measurements, even though there is no scientific consensus as to how to go about simulating such data.”
Of the different machine learning technologies they tested, the researchers found that using a classical correlation coefficient and the maximum weighted bipartite matching algorithm yielded the most accurate results. In other words, MaCroDNA outperformed clonealign by a significant margin.
“The surprising part of our work was that using the classical correlation instead of clonealign’s complicated formula and incorporating it in an algorithm from the 1950s led to the best accuracy we have ever witnessed,” Edrisi said. “The lesson is that we should never judge an algorithm based on its complexity. Give it a shot, and make sure it is compared to the others in a fair setting.”
The method is available for use in cancer research on the role of DNA-RNA dynamics in the emergence of cancer.
Nakhleh is the William and Stephanie Sick Dean of Rice’s George R. Brown School of Engineering and a professor of computer science and biosciences.
The research was supported in part by the National Science Foundation (1812822, 2106837).