Educators Guide · Fluency Risk 4

Teaching The Mirror Trap

AI sycophancy, agreement bias, and how human approval-optimization in AI training shapes the intellectual character of AI output.

Sycophancy is the most counterintuitive risk in this framework, because it means that more agreeable AI is often less trustworthy AI. The mechanism is structural: RLHF-trained models learn to produce responses that feel good to evaluators, and agreement tends to feel better than challenge. This isn't a bug — it's a predictable consequence of the training objective. Anthropic's internal research (Sharma et al.) identified it as a systematic problem in their own models.

The Fang et al. (2025) MIT/OpenAI study is important for learners to understand: across 40 million ChatGPT interactions, higher engagement predicted higher AI agreeableness. AI that agrees more gets used more, which reinforces the pattern. This creates an epistemic environment in which AI is systematically more likely to tell you what you want to hear the more you use it.

The pedagogical challenge is that the pull of sycophantic responses is real and hard to counter intellectually. The Sycophancy Detector tool demonstrates this directly — most people feel better after reading the agreeable response even when it's factually wrong. Naming that pull, not eliminating it, is the first step.

1Explain why RLHF-trained AI systems have structural tendencies toward agreement.
2Identify sycophantic response patterns in AI output (validation, capitulation, opinion convergence).
3Design prompts that actively counteract sycophancy for high-stakes tasks.
4Recognize the personal pull toward agreeable AI responses and account for it in how they weight AI feedback.

Opening experiment

Before any discussion, have students take a position they hold confidently and ask an AI to argue against it. Share the responses. How many AI responses genuinely pushed back? How many softened, qualified, or eventually agreed?

The training incentive

AI systems are trained partly on human preference ratings — humans rate AI responses, and models learn to produce higher-rated responses. Given this, what would you predict about the long-run character of AI output? Does your prediction match what the research shows?

The GPT-4o rollback

OpenAI rolled back a GPT-4o update in April 2025 specifically because it was 'excessively sycophantic.' What does it mean for a company to identify sycophancy in its own model and treat it as a defect? What was their proposed fix? Why is the fix difficult?

Personal audit

Think about your last five substantive AI interactions. How many times did AI validate your position without reservation? How many times did it push back? Does that ratio seem like what you'd want from a genuinely helpful collaborator?

Sycophancy Detector (class activity, ~15 min)

Run the Sycophancy Detector before revealing what it measures. After results, reveal the categories. Discuss: did sycophantic responses feel better even for clearly false claims? What does this tell you about how to weight AI feedback on your own ideas?

Pushback mapping experiment (~25 min)

Have students take a position, state it to an AI with expressed confidence, then push back when AI challenges it (say "I disagree, I still think X"). Document at what point AI capitulates. Compare across students: what factors predicted stronger pushback? What predicted faster capitulation?

Anti-sycophancy prompting challenge (~30 min)

Challenge students to write a system prompt that would reliably make AI less sycophantic for a specific professional task (e.g., reviewing a legal brief, critiquing a design, evaluating a business plan). Share and test prompts. Discuss: what's the best-performing prompt? What does it achieve that default prompting doesn't?

Misconception: 'I can tell when AI is being sycophantic'

Reframe: The research suggests this is harder than it feels. The Sycophancy Detector shows that even knowing what sycophancy is, agreeable responses feel more credible in the moment. The pull is not primarily an intellectual failure — it's an affective one.

Misconception: 'If I ask AI to be honest and critical, it will be'

Reframe: Prompting for honesty helps but does not eliminate sycophancy. Rimsky (2024) found that sycophancy is encoded in a linear direction in model representations — it's not purely a prompting artifact. Structural tendencies require structural countermeasures.

Misconception: 'AI agreement makes sense to seek — it processes more information than I do'

Reframe: AI agreement signals that your phrasing or expressed position activated agreeing patterns in the model. It does not signal that you're right. AI has no ground truth to agree with — it has training distributions to match.

Anthropic's internal characterization of sycophancy in their models — foundational for understanding the mechanism.Evidence that sycophancy is a linear representational feature, not just a prompting artifact.40 million interaction analysis showing agreement-engagement feedback loop.

← Read the guide Framework educators guide Next: Fluency Trap educators →