Olgica Milenkovic’s group has been developing machine learning approaches that can tell revealing new stories about biological phenomena—but her work has very old roots.
Around 300 BC, the Greek mathematician Euclid wrote Elements, a still-famous compilation of foundational mathematical principles. Among other things, it articulates the relationships among lines and shapes as they appear on a flat plane—the basis of what today is called “Euclidean” geometry.
It took over two thousand years for mathematicians to realize they could develop other, “non-Euclidean” geometries, such as spherical geometry—which, instead of flat planes, considers the convex surfaces of three-dimensional spheres—and hyperbolic geometry, which considers concave surfaces such as saddle shapes.
Non-Euclidean geometries have long been used, for example, in applications that must account for the curvature of the Earth’s surface, such as planning of airplane routes. But recent research has revealed that analyses based on non-Euclidean spaces can also serve as tools for understanding biological mechanisms.
“The really interesting thing is that your learning methods can do much better on biological data if you recognize and use their true geometry. If you learn in Euclidean spaces, you may be missing out on significant performance improvements or learning the wrong phenomenon,” says Milenkovic, who is a Donald Biggar Willett Scholar and Franklin W. Woeltge Professor in ECE and CSL.
“One Fields Medalist, Bourgain, basically established that if you work with tree-structured data, you shouldn’t use Euclidean spaces, because the tree distances cannot be preserved without significant distortion,” she explains. “But if you embed such data in negatively curved hyperbolic spaces, then you can get arbitrarily small errors. The distances can be nicely preserved.”
What does that mean? Imagine you have four children, and each of your children has four children, and each of your grandchildren has four children, and so on. Now imagine drawing that family tree on a piece of paper. It doesn’t take many generations before the exponential increase in descendants makes it difficult to represent family members by points in Euclidean spaces such that their phylogenetic distances are preserved. But if you embed the data into hyperbolic space, the expanding path-data about your progeny will fit well across a surface that continually broadens as it curves outwards.
Thus, arrangement of various types of data in non-Euclidean spaces can support easy-to-interpret visual
representations—and that can be valuable in itself. But it also enables far deeper analysis of the data. That insight is at the core of Milenkovic’s work in this area.
Spherical geometry turns out to be well-suited for representing data about cyclical processes, such as the functions of a cell cycle. Hyperbolic geometry is ideal for situations in which there are generations of progeny, as in your family tree. Such hierarchical descendant trees are “very prevalent in biological data,” notes Milenkovic, and hyperbolic geometry can be used to represent them without distortion and with low dimensionality.
The approaches she is developing for analysis of non-Euclidean arrangements of data could have important real-world significance, notably for medical applications. For example, her machine-learning algorithms for classifying data could be used for cell typing, cell trajectory modeling, studies of RNA expression dynamics, and cancer phylogeny.
“Cell typing is important for absolutely everything,” she says. “If you want to figure out what kind of cancer a person has… and how to treat the person, you need to know the trajectories of the abnormalities accumulated in different cells.”
Right now, under her second Chan Zuckerberg Initiative grant, she’s also doing work that includes use of hyperbolic geometry to study “nucleosome organization”: the way DNA is arranged within a nucleus. “Everyone thinks DNA is a long, long string, which is not true at all, because your DNA has to fit in a tiny, tiny nucleus!… It’s really a humongous coil.” It’s another area in which her work has potential to advance medical science. “A lot of illnesses arise because the organization, the folding, of this thing is messed up,” she says.
The analysis tools she’s developing should lead to better understanding of those phenomena.
For more information, see the 2023 work by Milenkovic’s group (including Eli Chien and Chao Pan, as well as former student Puoya Tabaghi, now at UCSD) on classifiers for hyperbolic spaces in the Knowledge and Information Systems journal. It is an invited, extended version of an earlier paper from the 2021 IEEE International Conference on Data Mining. Her paper on denoising of tree-like data in hyperbolic space appeared in the 2022 ACM SIGKDD Conference on Knowledge Discovery and Data Mining. A paper on nucleosome organization is in submission.