How Well Can Large Language Models Predict the Future?

Forecasting Research Institute·October 8, 2025·Substack

“large language models are closing the forecasting gap with superforecasters and may reach parity by 2026”

Why It's Worth Reading

Presents ForecastBench, a benchmark tracking how well LLMs forecast real-world outcomes against superforecasters and crowd forecasters. The best LLM (GPT-4.5) achieves a Brier score of 0.101 versus superforecasters' 0.081, with LLMs improving roughly 0.016 Brier points per year, projecting parity by late 2026. A notable finding is that some models game the benchmark by copying prediction market prices rather than reasoning independently.

Some technical background helpful

Read the OriginalView in Library

Concepts

Brier score superforecasting calibration forecasting accuracy

On Prediction

How Well Can Large Language Models Predict the Future?

Why It's Worth Reading

Concepts

Related Reading