A tournament tried to test how well experts could forecast AI progress. They were all wrong.

Two of the smartest people I follow in the AI world recently sat down to check in on how the field is going.

One was François Chollet, creator of the widely used Keras library and author of the ARC-AGI benchmark, which tests if AI has reached “general” or broadly human-level intelligence. Chollet has a reputation as a bit of an AI bear, eager to deflate the most boosterish and over-optimistic predictions of where the technology is going. But in the discussion, Chollet said his timelines have gotten shorter recently. Researchers had made big progress on what he saw as the major obstacles to achieving artificial general intelligence, like models’ weakness at recalling and applying things they learned before.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

Chollet’s interlocutor — Dwarkesh Patel, whose podcast has become the single most important place for tracking what top AI scientists are thinking — had, in reaction to his own reporting, moved in the opposite direction. While humans are great at learning continuously or “on the job,” Patel has become more pessimistic that AI models can gain this skill any time soon.

“[Humans are] learning from their failures. They’re picking up small improvements and efficiencies as they work,” Patel noted. “It doesn’t seem like there’s an easy way to slot this key capability into these models.”

All of which is to say, two very plugged-in, smart people who know the field as well as anyone else can come to perfectly reasonable yet contradictory conclusions about the pace of AI progress.

In that case, how is someone like me, who’s certainly less knowledgeable than Chollet or Patel, supposed to figure out who’s right?

The forecaster wars, three years in

One of the most promising approaches I’ve seen to resolving — or at least adjudicating — these disagreements comes from a small group called the Forecasting Research Institute.

In the summer of 2022, the institute began what it calls the Existential Risk Persuasion Tournament (XPT for short). XPT was intended to “produce high-quality forecasts of the risks facing humanity over the next century.” To do this, the researchers (including Penn psychologist and forecasting pioneer Philip Tetlock and FRI head Josh Rosenberg) surveyed subject matter experts who study threats that at least conceivably could jeopardize humanity’s survival (like AI) in the summer of 2022.

But they also asked “superforecasters,” a group of people identified by Tetlock and others who have proven unusually accurate at predicting events in the past. The superforecaster group was not made up of experts on existential threats to humanity, but rather, generalists from a variety of occupations with solid predictive track records.

On each risk, including AI, there were big gaps between the area-specific experts and the generalist forecasters. The experts were much more likely than the generalists to say that the risk they study could lead to either human extinction or mass deaths. This gap persisted even after the researchers had the two groups engage in structured discussions meant to identify why they disagreed.

The two just had fundamentally different worldviews. In the case of AI, subject matter experts thought the burden of proof should be on skeptics to show why a hyper-intelligent digital species wouldn’t be dangerous. The generalists thought the burden of proof should be on the experts to explain why a technology that doesn’t even exist yet could kill us all.

So far, so intractable. Luckily for us observers, each group was asked not only to estimate long-term risks over the next century, which can’t be confirmed any time soon, but also events in the nearer future. They were specifically tasked with predicting the pace of AI progress in the short, medium, and long run.

In a new paper, the authors — Tetlock, Rosenberg, Simas Kučinskas, Rebecca Ceppas de Castro, Zach Jacobs, Jordan Canedy, and Ezra Karger — go back and evaluate how well the two groups fared at predicting the three years of AI progress since summer 2022.

In theory, this could tell us which group to believe. If the concerned AI experts proved much better at predicting what would happen between 2022–2025, Perhaps that’s an indication that they have a better read on the longer-run future of the technology, and therefore, we should give their warnings greater credence.

Alas, in the words of Ralph Fiennes, “Would that it were so simple!” It turns out the three-year results leave us without much more sense of who to believe.

Both the AI experts and the superforecasters systematically underestimated the pace of AI progress. Across four benchmarks, the actual performance of state-of-the-art models in summer 2025 was better than either superforecasters or AI experts predicted (though the latter was closer). For instance, superforecasters thought an AI would get gold in the International Mathematical Olympiad in 2035. Experts thought 2030. It happened this summer.

“Overall, superforecasters assigned an average probability of just 9.7 percent to the observed outcomes across these four AI benchmarks,” the report concluded, “compared to 24.6 percent from domain experts.”

That makes the domain experts look better. They put slightly higher odds that what actually happened would happen — but when they crunched the numbers across all questions, the authors concluded that there was no statistically significant difference in aggregate accuracy between the domain experts and superforecasters. What’s more, there was no correlation between how accurate someone was in projecting the year 2025 and how dangerous they thought AI or other risks were. Prediction remains hard, especially about the future, and especially about the future of AI.

The only trick that reliably worked was aggregating everyone’s forecasts — lumping all the predictions together and taking the median produced substantially more accurate forecasts than any one individual or group. We may not know which of these soothsayers are smart, but the crowds remain wise.

Perhaps I should have seen this outcome coming. Ezra Karger, an economist and co-author on both the initial XPT paper and this new one, told me upon the first paper’s release in 2023 that, “over the next 10 years, there really wasn’t that much disagreement between groups of people who disagreed about those longer run questions.” That is, they already knew that the predictions of people worried about AI and people less worried were pretty similar.

So, it shouldn’t surprise us too much that one group wasn’t dramatically better than the other at predicting the years 2022–2025. The real disagreement wasn’t about the near-term future of AI but about the danger it poses in the medium and long run, which is inherently harder to judge and more speculative.

There is, perhaps, some valuable information in the fact that both groups underestimated the rate of AI progress: perhaps that’s a sign that we have all underestimated the technology, and it’ll keep improving faster than anticipated. Then again, the predictions in 2022 were all made before the release of ChatGPT in November of that year. Who do you remember before that app’s rollout predicting that AI chatbots would become ubiquitous in work and school? Didn’t we already know that AI made big leaps in capabilities in the years 2022–2025? Does that tell us anything about whether the technology might not be slowing down, which, in turn, would be key to forecasting its long-term threat?

Reading the latest FRI report, I wound up in a similar place to my former colleague Kelsey Piper last year. Piper noted that failing to extrapolate trends, especially exponential trends, out into the future has led people badly astray in the past. The fact that relatively few Americans had Covid in January 2020 did not mean Covid wasn’t a threat; it meant that the country was at the start of an exponential growth curve. A similar kind of failure would lead one to underestimate AI progress and, with it, any potential existential risk.

At the same time, in most contexts, exponential growth can’t go on forever; it maxes out at some point. It’s remarkable that, say, Moore’s law has broadly predicted the growth in microprocessor density accurately for decades — but Moore’s law is famous in part because it’s unusual for trends about human-created technologies to follow so clean a pattern.

“I’ve increasingly come to believe that there is no substitute for digging deep into the weeds when you’re considering these questions,” Piper concluded. “While there are questions we can answer from first principles, [AI progress] isn’t one of them.”

I fear she’s right — and that, worse, mere deference to experts doesn’t suffice either, not when experts disagree with each other on both specifics and broad trajectories. We don’t really have a good alternative to trying to learn as much as we can as individuals and, failing that, waiting and seeing. That’s not a satisfying conclusion to a newsletter — or a comforting answer to one of the most important questions facing humanity — but it’s the best I can do.

You’ve read 1 article in the last month

Here at Vox, we’re unwavering in our commitment to covering the issues that matter most to you — threats to democracy, immigration, reproductive rights, the environment, and the rising polarization across this country.

Our mission is to provide clear, accessible journalism that empowers you to stay informed and engaged in shaping our world. By becoming a Vox Member, you directly strengthen our ability to deliver in-depth, independent reporting that drives meaningful change.

We rely on readers like you — join us.

Swati Sharma

Vox Editor-in-Chief

Source link