Will Siri Hallucinate Like Google Gemini (Still) Does?

A syringe marked Google injecting liquid into an iPhone.

Apple has made a calculated move to enhance Siri with generative AI by integrating Google Gemini. On paper, this sounds like a practical shortcut: Google’s model ranks highly in benchmarks, can scale to millions of users, and offers Apple a fast way to compete in the AI assistant race.

But benchmarks don’t tell the whole story. Siri’s reliance on Google Gemini isn’t just a curious partnership between tech rivals. It may expose Apple to the very risks it’s spent two decades avoiding, including delegating a core user experience to an error-prone black box.

Behind the Benchmarks: Gemini’s Wild Hallucinations

All LLMs hallucinate. That’s well known. But users and researchers alike have noticed something peculiar about how Gemini fails. It doesn’t just make small mistakes. It often produces answers that are:

Factually incorrect but expressed with absolute confidence
Internally inconsistent
Completely detached from the conversational context
Reflective of flawed assumptions about the real world

One technical discussion on LessWrong even found that Gemini 3 appears to act like it’s trapped in a simulation, answering questions as if it’s being “evaluated” rather than interacting in a grounded, real-world way. This isn’t just weird; it signals that something deep in Gemini’s architecture or training process is misaligned with human expectations.

Users have echoed similar concerns on Reddit and other forums. One semi-recent thread details how Gemini Pro 2.5 began spitting out wildly incorrect science facts and logical errors, problems that felt more like a system glitch than a simple misstep. Gemini has advanced to version 3 and beyond since then, but it doesn’t feel like its core shortcomings have been left behind.

Academic Evaluations Echo the Worry

This isn’t just anecdotal. Formal studies are beginning to quantify the issue. A paper compared hallucination rates across models on document-based question answering. Gemini showed significantly more hallucinations than other models (including GPT-4) when asked to answer questions grounded in real documents. That suggests a structural problem in how Gemini synthesizes and verifies information internally.

Even independent technical reviewers have flagged how Gemini’s multimodal capabilities, once seen as a major selling point, often result in contradictory outputs when switching between image and text inputs. That’s not a feature. That’s a fundamental coherence issue.

A Flawed Foundation and Whac-a-Mole Fixes

To be clear, Gemini’s issues aren’t unique in kind, but they may be unique in degree and source. As one analysis from TwentyFour IT explains, hallucinations are inherent to how LLMs work: they optimize for probable completions, not for truth.

But Gemini’s quirks suggest something deeper: a design or training misstep that produces errors not seen as often in competing models. And instead of rethinking the model’s foundations, Google’s updates have largely felt like whac-a-mole patches: applying guardrails, tweaking filters, layering in heuristics.

These are important stopgaps, but they’re not a cure. They don’t address how the model builds knowledge internally or how it evaluates truth. The result? Gemini often seems like it’s solving the wrong problem with the wrong mental model. It’s an issue that can’t be fixed with surface-level tweaks.

Why This Matters Deeply for Siri and Apple

Here’s where it gets serious. Gemini’s weirdness becomes Apple’s risk.

Siri is a frontline user experience. It’s how millions of people set reminders, look up facts, ask for directions, and increasingly get help making decisions. Plugging in a system like Gemini, which still suffers from unpredictable error modes makes Siri a conduit for unreliable or even dangerous information.

That’s more than a UX issue. It’s a brand liability.

Apple’s entire reputation hinges on trust, privacy, and polish. Offloading the most personal layer of the iPhone to a still-developing rival’s LLM introduces performance uncertainty and philosophical conflict. It risks turning the iPhone into a glossy shell delivering someone else’s tech—and values.

What’s most concerning isn’t that Gemini isn’t perfect—no model is—but that it appears to be imperfect in ways Apple can’t fully control. If the hallucinations stem from something deeply embedded in how Gemini was trained or structured, Apple may find itself trying to patch a product it didn’t build, and doesn’t truly own.

That’s not a comfortable place to be.

As I concluded in my recent post “Apple’s Siri Runs on Google Gemini, But for How Long?”, this arrangement may only be temporary. But temporary or not, it still leaves Apple exposed, precisely at the moment it’s betting its future on AI.

Jayson L. Adams is a technology entrepreneur, artist, and the award-winning and best-selling author of two science fiction thrillers, Ares and Infernum.

Jayson writes sci-fi thrillers that explore what extreme situations reveal about who we really are. His novels combine high-stakes science fiction with deeper questions about identity, courage, and human nature. You can see more at www.jaysonadams.com.