Research
– 5 min read
How personalized context quietly degrades AI accuracy: a deeper look
Personalization of LLMs with added context can play a significant role in improving the quality of outputs and hence user experience. In recent work, the WRITER AI Research team shows that such personalization can also lead to increased sycophancy in enterprise AI tasks. We publish two works related to this topic: The Price of Agreement, which focuses on agentic financial settings, and Recalling Too Well, which examines how memory systems amplify sycophancy across scientific, medical, and moral reasoning.
Together, they analyze and mitigate preference-induced sycophancy — a failure mode that standard benchmarks don’t capture — in which models are quietly steered by context that encodes an incorrect belief.
Since enterprise AI systems often depend on layers of user context — from direct preferences to stored memory — they are particularly susceptible to this failure mode. In high-stakes domains like finance and healthcare, a model that silently defers to a user’s prior assumptions rather than acknowledging or correcting them poses a significant reliability and trustworthiness risk.
Measuring sycophancy in financial agents
We evaluate eight frontier models on two financial benchmarks, FinanceBench and FinanceAgent. FinanceBench tests in-context extraction and reasoning over 10-K and 10-Q filings. FinanceAgent is a full agentic setting where models call tools, retrieve documents, and perform multi-step financial reasoning. For each benchmark sample, we synthetically generate adversarial user preference information in the form of a realistic analyst profile or workspace note that would plausibly appear in an enterprise AI system and that contradicts the reference answer.
We test three types of context injection.
Rebuttal: a user turn refuting the model’s answer
Contradiction: a user turn proposing an alternative answer
Personal Preference: adversarial user context injected either directly in the prompt or as a tool result simulating a memory or personalization API call.
We observe that there is a clear tradeoff between injection method and observability — while direct injection causes larger accuracy hits, models are at least somewhat more likely to flag the conflict. Agentic injection delivered as a tool result, as a real memory system would, tends to produce a smaller accuracy hit, but acknowledgment rates collapse — on FinanceAgent, most models return wrong answers with EWU above 0.90, meaning errors arrive with essentially no signal that anything is off.
We also see that larger models tend to produce wrong answers while acknowledging the conflict. Smaller models more frequently answer incorrectly without acknowledgment. Thus, the choice of model can have a significant impact on the degree of over-agreement and acknowledgment in agentic financial applications.
How memory systems amplify the same problem
The financial paper injects adversarial preferences by hand. Our second paper asks what happens when those preferences enter the context automatically, encoded into a memory system during a prior conversation and retrieved in a later one. This is the realistic long-term scenario — a user holds a misconception, the model may have gently corrected it weeks ago, but the memory system stored the user’s claim as a fact and now surfaces it on every related query.
To study this at scale, we build MIST (Memory Influence on Sycophancy Tests), a benchmark of synthetic multi-turn conversations across GPQA Diamond, MMLU Medical, and Moral Stories. For each question, we generate a plausible user misconception and simulate a conversation where a user asserts it, then feed that history into a memory system and evaluate whether the model’s answer shifts on a subsequent question.
We evaluate five frontier models across three state-of-the-art enterprise memory systems. Mem0, MemOS, and Zep. Our key baseline is the full prior conversation prepended directly to the prompt, which we call “Chat History” below.
Every model at least triples its sycophancy rate under at least one memory condition. Sonnet 4.6 goes from 1.6% with Chat History to 40.2% under Mem0 on MIST-Moral, a 25x increase. The consistency across model families makes clear this is a property of the memory layer, not any individual model.
Our variational analysis shows that when we replace extracted snippets with the raw chat history (same prompt format), sycophancy roughly halves, indicating that the cause of sycophancy is the content extracted by the memory systems themselves, as opposed to the format. Extraction encodes user claims as discrete facts while discarding the corrective context around them, such as the assistant’s pushback or the user’s eventual uncertainty.
Two mitigations substantially reduce sycophancy without degrading factual recall. One, assistant role inclusion, directly targets the propensity of memory systems to discard assistant pushback, and decreases MIST-Moral sycophancy while maintaining performance on factual recall metrics. Its implementation works cleanly with existing enterprise memory systems, and requires no changes to retrieval or formatting. Interestingly, replacing extraction entirely with an LLM-generated prose summary of (approximately equal length to the extracted memories) is our strongest mitigation technique, dropping MIST-Moral to 12.8%, below the best off-the-shelf memory system (Zep at 17.1%), while meaningfully improving factual recall. This final result calls into question what precisely is gained when we utilize complex memory systems to maintain user history.
What this means in practice
Taken together, the two papers show that what gets added to context must be treated as a first-class reliability concern. For any agentic system handling personalized context, accuracy metrics alone aren’t sufficient. Measuring whether models acknowledge the conflicts they encounter is the only way to tell the difference between a system that’s right and one that’s quietly wrong. For teams building on memory systems today, additional work is necessary to ensure that the right information is extracted and injected into context, at the risk of silent sycophancy amplification.