Replacing hype about artificial intelligence with accurate measurements of success

The hype surrounding machine learning, a form of artificial intelligence, can make it seem like it is only a matter of time before such techniques are used to solve all scientific problems. While impressive claims are often made, those claims do not always hold up under scrutiny. Machine learning  may be useful for solving some problems but falls short for others.

In a new paper in Nature Machine Intelligence, researchers at the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (PPPL) and Princeton University performed a systematic review of research comparing machine learning to traditional methods for solving fluid-related partial differential equations (PDEs). Such equations are important in many scientific fields, including the plasma research that supports the development of fusion power for the electricity grid. 

The researchers found that comparisons between machine learning methods for solving fluid-related PDEs and traditional methods are often biased in favor of machine learning methods. They also found that negative results were consistently underreported. They suggest rules for performing fair comparisons but argue that cultural changes are also needed to fix what appear to be systemic problems.

“Our research suggests that, though machine learning has great potential, the present literature paints an overly optimistic picture of how machine learning works to solve these particular types of equations,” said Ammar Hakim, PPPL’s deputy head of computational science and the principal investigator on the research. 

Comparing results to weak baselines

PDEs are ubiquitous in physics and are particularly useful for explaining natural phenomena, such as heat, fluid flow and waves. For example, these kinds of equations can be used to figure out the temperatures along the length of a spoon placed in hot soup. Knowing the initial temperature of the soup and the spoon, as well as the type of metal in the spoon, a PDE could be used to determine the temperature at any point along the utensil at a given time after it was placed in the soup. Such equations are used in plasma physics, as many of the equations that govern plasmas are mathematically similar to those of fluids.

Scientists and engineers have developed various mathematical approaches to solving PDEs. One approach is known as numerical methods because it solves problems numerically, rather than analytically or symbolically, to find approximate solutions to problems that are difficult or impossible to solve exactly. Recently, researchers have explored whether machine learning can be used to solve these PDEs. The goal is to solve problems faster than they could with other methods.

The systematic review found that in most journal articles, machine learning hasn’t been as successful as advertised. “Our research indicates that there might be some cases where machine learning can be slightly faster for solving fluid-related PDEs, but in most cases, numerical methods are faster,” said Nick McGreivy. McGreivy is the lead author of the paper and recently completed his doctorate at the Princeton Program in Plasma Physics.

Numerical methods have a fundamental trade-off between accuracy and runtime. “If you spend more time to solve the problem, you’ll get a more accurate answer,” McGreivy said. “Many papers didn’t take that into account in their comparisons.”

Furthermore, there can be a dramatic difference in speed between numerical methods. In order to be useful, machine learning methods need to outperform the best numerical methods, McGreivy said. Yet his research found that comparisons were often being made to numerical methods that were much slower than the fastest methods.

Two rules for making fair comparisons

Consequently, the paper proposes two rules to try to overcome these problems. The first rule is to only compare machine learning methods against numerical methods of either equal accuracy or equal runtime. The second is to compare machine learning methods to an efficient numerical method. 

Of 82 journal articles studied, 76 claimed the machine learning method outperformed when compared to a numerical method. The researchers found that 79% of those articles touting a machine learning method as superior actually had a weak baseline, breaking at least one of those rules. Four of the journal articles claimed to underperform when compared to a numerical method, and two articles claimed to have similar or varied performance.

“Very few articles reported worse performance with machine learning, not because machine learning almost always does better, but because researchers almost never publish articles where machine learning does worse,” McGreivy said.

The researchers created the image above to convey the cumulative effects of weak baselines and reporting biases on samples. The circles or hexagons represent articles. Green indicates a positive result, for example, that the machine learning method was faster than the numerical method, while red represents a negative result. Column (a) shows what the system would likely look like if strong baselines were used and reporting bias was not an issue. Column (b) depicts the likely results without reporting bias. Column (c) shows the actual results seen in the published literature. (Image credit: Nick McGreivy)

McGreivy thinks low-bar comparisons are often driven by perverse incentives in academic publishing. “In order to get a paper accepted, it helps to have some impressive results. This incentivizes you to make your machine learning model work as well as possible, which is good. However, you can also get impressive results if the baseline method you’re comparing to doesn’t work very well. As a result, you aren’t incentivized to improve your baseline, which is bad,” he said. The net result is that researchers end up working hard on their models but not on finding the best possible numerical method as a baseline for comparison.

The researchers also found evidence of reporting biases, including publication bias and outcome reporting bias. Publication bias occurs when a researcher chooses not to publish their results after realizing that their machine learning model doesn’t perform better than a numerical method, while outcome reporting bias can involve discarding negative results from the analyses or using nonstandard measures of success that make machine learning models appear more successful. Collectively, reporting biases tend to suppress negative results and create an overall impression that machine learning is better at solving fluid-related PDEs than it is.

“There’s a lot of hype in the field. Hopefully, our work lays guidelines for principled approaches to use machine learning to improve the state-of-art,” Hakim said.

To overcome these systemic, cultural issues, Hakim argues that agencies funding research and large conferences should adopt policies to prevent the use of weak baselines or require a more detailed description of the baseline used and the reasons it was selected. “They need to encourage their researchers to be skeptical of their own results,” Hakim said. “If I find results that seem too good to be true, they probably are.”

withyou android app