On Sept. 14, OpenAI researchers published a not-yet-peer-reviewed paper, “Why Language Models Hallucinate,” on arXiv. Gemini 2.5 Flash summarized the findings of the paper: "Systemic Problem: Hallucinations are not simply bugs but a systemic consequence of how AI models are trained and evaluated. Evaluation Incentives: Standard evaluation methods, particularly binary grading systems, reward models for generating an answer, even if it’s incorrect, and punish them for admitting uncertainty. Pressure to Guess: This creates a statistical pressure for large language models (LLMs) to guess rather than say “I don’t know,” as guessing can improve test scores even with the risk of being wrong."
