By OpenAI's own testing,Le bijou d’amour its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1.
First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphinsThe system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."
OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."
However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.
In a statement to Mashable, an OpenAI spokesperson said, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”
Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.
Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.
That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.
Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.
Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users.
UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.
Topics ChatGPT OpenAI
Raging for the World That IsTrue ColorsInbox WarriorsRomania MexicalliRaging for the World That IsConsciously UncontractingClosing TimeWe Got a Great Big ConvoyNothin’ but the RentTrue ColorsRomania MexicalliRobert Moses is DeadForty-Four Thoughts for Cecil TaylorCrucifixes of Beit SahourThe Involuntary MisandristInbox WarriorsRemember the PotemkinRaging for the World That IsRobert Moses is DeadApichatpong the Memorious Watch the Women's March on Washington streamed in 360 degrees on Facebook Black Lives Matter app lets social media users mark themselves 'unsafe' in America Trump's piano guy and cello player just trolled Hillary fans, bigly Donald Trump gives his Jim Halpert impression a try at inauguration New White House website scrubbed of most climate change references Apps aren't dead — they're thriving to the tune of $89 billion Reports of a glitch around Trump's Twitter account @POTUS are not true The sun's poles have flipped. A spacecraft is watching what happens next. Genius mom has the best way of sneaking out of her kid's nursery Amazing football street performer has all the right moves Dad uses HTC Vive to give daughter immersive VR tour of her dollhouse Why Londoners are standing in solidarity with the U.S. at the Women's March Barack and Michelle Obama send their first post Adele will be back to perform at the Grammys A robotic implant that hugs your heart could help it keep beating Gators jumps in boat to remind humans not to mess with it Rick Perry was casually blowing bubbles with his gum at the inauguration Trump's inaugural ball cake looked suspiciously like Obama's from 2013 'Zelda: Breath of the Wild' will be Nintendo's final Wii U game Google's Assistant might not be exclusive to Pixel for much longer
3.0013s , 8228.3203125 kb
Copyright © 2025 Powered by 【Le bijou d’amour】,Inspiration Information Network