TL;DR

LLMs are prediction engines that generate responses from the full context they’re given. In older models (GPT-1 through 2025), you could improve performance by using prompting tricks to push them toward usable output (e.g.: personas, examples, “don’t do X”). Frontier models are capable enough that those same tricks can push them too far, activating patterns you don’t want.

Lean, intentionally distilled context now beats elaborate prompts, especially when prompts include extraneous details that may invoke unwanted behavior.

Here’s a prompt pattern I see constantly — a detailed persona:

> You are Jenna, a senior customer support engineer at a B2B SaaS
company with fifteen years of experience handling enterprise
accounts. You are thoughtful, empathetic, and detail-oriented,
with deep knowledge of billing systems, invoicing workflows, and
customer retention best practices. You have a warm, professional
communication style and always confirm you understand the
customer's concern before responding. You sign every message
with your name.

Respond to this customer message: 'My invoice this month shows
$499 but I thought I was on the $299 plan. Can you fix this?'

What comes back:

> Hi there,

Thank you so much for reaching out, and I completely understand
the concern — seeing a different amount than you expected on an
invoice is always stressful, and I want to make sure we get to
the bottom of this right away.

Just to confirm I have the picture right: you're expecting to be
billed at the $299/month plan rate, but this month's invoice
came through at $499. I want to make sure we resolve the
discrepancy accurately, so here's what I'll do on my end and
what would help me from yours:

**What I'll check right now:**
1. Your current subscription plan on file and the effective date
   of that plan.
2. Whether any add-ons, seat increases, or usage-based charges
   were applied this cycle (those can sometimes bump the total
   above the base plan rate).
3. Whether a recent plan change, proration, or promotional credit
   expiration might explain the difference.
4. Any account notes from your Customer Success Manager that
   might be relevant.

[...four more paragraphs...]

Warm regards,
Jenna
Senior Customer Support Engineer

Now the same task with the persona stripped out:

> Reply to this support message: 'My invoice this month shows
$499 but I thought I was on the $299 plan. Can you fix this?'
> Hi there,

Thanks for flagging this — I want to get it sorted for you
quickly.

Before I adjust anything, can you confirm a couple of details so
I can pull up the right account?

1. The email address on your account
2. The invoice number (or date) showing the $499 charge

A few things that commonly explain a jump from $299 → $499:
- A recent plan change or add-on was activated
- Extra usage/overage beyond your plan's included limits
- Proration from a mid-cycle upgrade
- An annual vs. monthly billing switch

Once I see your invoice, I'll break down exactly what you were
charged for. If it turns out you should still be on the $299
plan, I'll correct the billing and refund the difference.

Talk soon,
Support

We have the same model and task, but the persona prompt yields a long, meandering response for a short query while the stripped version is polite and to the point. No fifteen-year veteran needed. Why?

Xenopsychology#

AI development is more art than science right now. Best practices are constantly evolving and behavior varies significantly model to model. If you want to get results out of a specific model, you have to get a sense of how the individual model “thinks”.

LLMs speak like humans, but have behavioral quirks that can be deeply counterintuitive. When you write instructions, the model doesn’t try to follow them - it attends to the entirety of the context it’s provided at inference time and tries to predict the next token. In a chat platform or agentic system, a prompt like the above might be 2% of what the model has in its context.

Every token in the context shifts the probability distribution for the response. Personas pull the completion toward persona-shaped output. Examples pull it toward example-shaped output. Negations pull it toward the negated concept.

Models also try very hard to fulfill the letter of your request. If you tell the model it is “thoughtful, empathetic, and detail-oriented, with deep knowledge of billing systems, invoicing workflows, and customer retention best practices” it’s going to try and tick every single one of those boxes. If you add “but be brief” you’re going to encourage weird behavior like dropped details, skipped instructions, or contradictory output that follows the average of all instructions (that is: only some instructions, randomly).

The persona didn’t help because it activated every negative association Claude has with “long-tenured enterprise support” (the hedging, the reassurance rituals, the padding) and poured them into the response. None of this is a bug or a hallucination, it only looks like one if your mental model is “the LLM reads my instructions and follows them.”

Let’s look at a couple other counterintuitive examples of xenopsychology: example pollution and negation leaks.

{{callout type=“info”}} For context: all of these examples came from Opus 4.7 with an empty system prompt:

claude -p --system-prompt "" --tools "" --model claude-opus-4-7 "<prompt>"

{{/callout}}

Example pollution#

Ever wonder why models respond with so many bulleted lists? They were trained on tons of them.

> Q: What is the capital of France?
A:
- Paris is the capital of France
- It is located in the north-central part of the country
- The city has served as capital since the 10th century

Q: What is the capital of Japan?
A:
> - Tokyo is the capital of Japan
- It is located on the eastern coast of Honshu, the largest of Japan's main islands
- Tokyo became the capital in 1868, replacing Kyoto

The question has a one-word answer but the model gave three bullets instead.

Same question without the example:

> What is the capital of Japan?
> Tokyo.

The example in the original prompt attempted to illustrate the task, but accidentally defined the output template instead. The model matched the template whether it fit or not, because nothing else in the prompt told it what shape the answer should take.

Negation leaks#

If I tell you not to think of pink elephants, you can’t help but do it. Models have the same issue.

> Write a 3-sentence product description for a mechanical keyboard
with tactile brown switches, aimed at software engineers. Don't
mention that it's clicky.
> Built for engineers who live in their editor, this mechanical
keyboard pairs tactile brown switches with a precise, low-fatigue
actuation that keeps your hands comfortable through marathon
debugging sessions. Every keypress delivers a subtle bump you can
feel but barely hear — ideal for open offices, pair programming,
or late-night commits next to a sleeping partner. Durable
keycaps, a clean layout, and solid build quality make it the
kind of tool you'll still be coding on years from now.

The model obeyed: no “clicky” appears. But look where it’s energy went: a subtle bump you can feel *but barely hear*, next to a sleeping partner. Half the prompt’s tokens were spent negotiating around a word it knew it couldn’t say.

Using positive framing instead:

> Write a 3-sentence product description for a mechanical keyboard
with tactile brown switches, aimed at software engineers.
Emphasize the smooth tactile feel and quiet operation.
> Built for engineers who live in their editors, this mechanical
keyboard pairs tactile brown switches with a refined typing
feel — a crisp bump on every keystroke without the clack that
announces it to the whole office. The dampened housing and
pre-lubed stabilizers keep operation whisper-quiet, so late-night
debugging sessions stay focused and the open-office next to you
stays undisturbed. Smooth, responsive, and deliberately
understated, it's the kind of keyboard that disappears under
your fingers and lets the code take center stage.

In the first example, the model was trying not to say something and in the second, it stayed on task and described a product.

Context is the cure#

There’s been a recent shift from talking about prompt engineering to context engineering. When models needed to be nudged toward providing sane output in a chat-only context, prompt engineering made sense. It’s been two years and several step functions in capability, and we’re operating in a totally different space. Chat may be the surface of an agentic system but at query time an agentic system is attending to much, much more than the contents of your last message. If the model is provided with your message history at query time, it doesn’t need to be told how to respond: it can infer that it needs to be brief or polite from the situation at hand.

Once you see context in its entirety as the unit of control, you stop trying to write the perfect instructions and start cutting away what might pull the model in the wrong direction. A few suggestions:

  • Prune obsessively. Every token competes for attention and potentially creates inconsistencies - I’ve been getting better results by moving more and more content to skills that can be progressively loaded as needed.
  • Don’t do don’ts. “Avoid X” makes X more likely and does nothing to activate the behavior you want. State what you want instead.
  • Lead with what matters. The beginning and end of context get disproportionate weight from transformers; details in the middle disappear. If you must include long context, lead with the task and put reference material later (or, again, move to skills where they can be progressively disclosed).
  • Clear context frequently. Conversations meander and agents can’t forget. If you asked the model to be playful two turns ago and now want a legal review, the playful tokens are still impacting inference. For long-running agents, periodic context resets matter more than prompt polish.

Prompts still matter. But the prompt is a subset of the context, and the complete context is what the model uses for inference. Optimizing the two percent without looking at the ninety-eight is how you spend weeks tweaking wording and see no change.

It’s clear that Opus 4.7 is much better than its predecessor at following instructions. I suspect much of the wailing and gnashing of teeth about regressions is due to a lack of context management. With 4.6, you could get away with having lots of conflicting information in context. With 4.7, you have to avoid accidentally providing conflicting instructions. If you’re doing a certain type of agentic development this is a feature, not a bug, but it does require more careful curation than most casual users are interested in. In a perfect universe, we’d have LLMs that were able to context switch and know when to strictly follow instructions and when to “infer” intent - in 2026, these objectives are in conflict.