“multiple ai agents voting together beat single models, but letting them deliberate makes predictions worse”
Evaluates whether multi-agent LLM architectures can resolve prediction market outcomes more accurately than single-model baselines. Tests independent aggregation and deliberative consensus against GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B on 1,189 resolved questions from KalshiBench. Finds that confidence-weighted voting across agents edges past single models, while deliberation degrades accuracy — and proposes a hybrid system that auto-resolves unanimous high-confidence questions while flagging disagreements for human review.
Some technical background helpful
Platforms mentioned: Kalshi