Friday, October 3, 2025

We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations. - OpenAI

We found that today’s best frontier models are already approaching the quality of work produced by industry experts. To test this, we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below.... We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend. In addition, we found that frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts.

Thursday, October 2, 2025

We urgently call for international red lines to prevent unacceptable AI risks. - AI Red Lines

Some advanced AI systems have already exhibited deceptive and harmful behavior, and yet these systems are being given more autonomy to take actions and make decisions in the world. Left unchecked, many experts, including those at the forefront of development, warn that it will become increasingly difficult to exert meaningful human control in the coming years.  Governments must act decisively before the window for meaningful intervention closes. An international agreement on clear and verifiable red lines is necessary for preventing universally unacceptable risks. These red lines should build upon and enforce existing global frameworks and voluntary corporate commitments, ensuring that all advanced AI providers are accountable to shared thresholds. We urge governments to reach an international agreement on red lines for AI — ensuring they are operational, with robust enforcement mechanisms — by the end of 2026. 


Wednesday, October 1, 2025

AI Hallucinations May Soon Be History - Ray Schroeder, Inside Higher Ed

On Sept. 14, OpenAI researchers published a not-yet-peer-reviewed paper, “Why Language Models Hallucinate,” on arXiv. Gemini 2.5 Flash summarized the findings of the paper: "Systemic Problem: Hallucinations are not simply bugs but a systemic consequence of how AI models are trained and evaluated. Evaluation Incentives: Standard evaluation methods, particularly binary grading systems, reward models for generating an answer, even if it’s incorrect, and punish them for admitting uncertainty. Pressure to Guess: This creates a statistical pressure for large language models (LLMs) to guess rather than say “I don’t know,” as guessing can improve test scores even with the risk of being wrong."