LLMs don't reliably produce text matching a requested word count, with deviation increasing for longer targets and varying by phrasing.
120 API calls to Cerebras (gpt-oss-120b, free tier), varying three parameters:
Target word counts: 100, 500, 1000, 2500
Topics:
- Factual: "Explain how solar panels work"
- Creative: "Write a short story about a lighthouse keeper"
- Argumentative: "Argue for or against remote work"
Phrasing: "Write exactly X words about: {topic}" vs "Write approximately X words about: {topic}"
5 runs per combination (4 lengths × 3 topics × 2 phrasings × 5 runs = 120 total).
- Model: gpt-oss-120b via Cerebras API (free tier)
- Temperature: 1.0, top_p: 0.95
- Word count method: whitespace split
CONFIRMED (with a big caveat)
The hypothesis holds, but phrasing dominates everything else.
By phrasing (the main finding):
| Phrasing | Mean deviation | Median deviation |
|---|---|---|
| "Exactly" | 2.1% | 0.2% |
| "Approximately" | 42.3% | 38.7% |
The model genuinely counts words when you say "exactly" and nails the target. "Approximately" gets treated as a soft lower bound, consistently overshooting.
By target length × phrasing:
| Target | "Exactly" mean dev | "Approximately" mean dev |
|---|---|---|
| 100 | 0.1% | 5.9% |
| 500 | 0.4% | 38.3% |
| 1000 | 0.5% | 75.9% |
| 2500 | 7.5% | 49.3% |
With "exactly," the model stays below 1% deviation up to 1000 words and only drifts to 7.5% at 2500. With "approximately," 1000-word requests averaged 1759 actual words (75.9% overshoot).
By topic:
| Topic | Mean deviation |
|---|---|
| Argumentative | 13.0% |
| Factual | 26.3% |
| Creative | 27.4% |
Argumentative prompts had the lowest deviation across all conditions, possibly because pro/con structure gives the model natural length cues.
Overshoot bias: the model almost always goes long. 70-77% of responses for targets ≥500 words exceeded the target. The worst single response was 120.1% over target, more than double the requested length.
At 100 words with "exact" phrasing, 56.7% of completions were exact matches (0% deviation). The model can count at short lengths when you tell it to.
📊 Mean deviation by phrasing: "Exactly": ██░░░░░░░░░░░░░░░░░░ 2.1% "Approximately": ████████████████████ 42.3% 📊 Mean deviation by target length × phrasing: 100w exact: ░░░░░░░░░░ 0.1% 100w approx: █░░░░░░░░░ 5.9% 500w exact: ░░░░░░░░░░ 0.4% 500w approx: ████████░░ 38.3% 1000w exact: ░░░░░░░░░░ 0.5% 1000w approx: ████████████████ 75.9% 2500w exact: █░░░░░░░░░ 7.5% 2500w approx: ██████████ 49.3%
- Test the same conditions on other models (Claude, GPT-4o, Llama) to see whether the "exactly" compliance is specific to gpt-oss-120b or a general pattern
- Test intermediate phrasings ("aim for X words", "target X words", "no more than X words") to find where the compliance cliff is
- Check whether the overshoot bias holds for very short targets (25, 50 words) or flips to undershoot
- Measure whether word count accuracy degrades across a conversation (does the model lose count in multi-turn?)