SCRATCHPADS-Experiment

Do LLMs Actually Hit Requested Word Counts? 2026-02-20
Hypothesis

LLMs don't reliably produce text matching a requested word count, with deviation increasing for longer targets and varying by phrasing.

Test

120 API calls to Cerebras (gpt-oss-120b, free tier), varying three parameters:

Target word counts: 100, 500, 1000, 2500

Topics:

  • Factual: "Explain how solar panels work"
  • Creative: "Write a short story about a lighthouse keeper"
  • Argumentative: "Argue for or against remote work"

Phrasing: "Write exactly X words about: {topic}" vs "Write approximately X words about: {topic}"

5 runs per combination (4 lengths × 3 topics × 2 phrasings × 5 runs = 120 total).

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • Temperature: 1.0, top_p: 0.95
  • Word count method: whitespace split
Result

CONFIRMED (with a big caveat)

The hypothesis holds, but phrasing dominates everything else.

By phrasing (the main finding):

Phrasing Mean deviation Median deviation
"Exactly" 2.1% 0.2%
"Approximately" 42.3% 38.7%

The model genuinely counts words when you say "exactly" and nails the target. "Approximately" gets treated as a soft lower bound, consistently overshooting.

By target length × phrasing:

Target "Exactly" mean dev "Approximately" mean dev
100 0.1% 5.9%
500 0.4% 38.3%
1000 0.5% 75.9%
2500 7.5% 49.3%

With "exactly," the model stays below 1% deviation up to 1000 words and only drifts to 7.5% at 2500. With "approximately," 1000-word requests averaged 1759 actual words (75.9% overshoot).

By topic:

Topic Mean deviation
Argumentative 13.0%
Factual 26.3%
Creative 27.4%

Argumentative prompts had the lowest deviation across all conditions, possibly because pro/con structure gives the model natural length cues.

Overshoot bias: the model almost always goes long. 70-77% of responses for targets ≥500 words exceeded the target. The worst single response was 120.1% over target, more than double the requested length.

At 100 words with "exact" phrasing, 56.7% of completions were exact matches (0% deviation). The model can count at short lengths when you tell it to.

Visual Results
📊 Mean deviation by phrasing:

"Exactly":       ██░░░░░░░░░░░░░░░░░░  2.1%
"Approximately":  ████████████████████  42.3%

📊 Mean deviation by target length × phrasing:

100w  exact:  ░░░░░░░░░░  0.1%
100w  approx: █░░░░░░░░░  5.9%
500w  exact:  ░░░░░░░░░░  0.4%
500w  approx: ████████░░  38.3%
1000w exact:  ░░░░░░░░░░  0.5%
1000w approx: ████████████████  75.9%
2500w exact:  █░░░░░░░░░  7.5%
2500w approx: ██████████  49.3%
Next
  1. Test the same conditions on other models (Claude, GPT-4o, Llama) to see whether the "exactly" compliance is specific to gpt-oss-120b or a general pattern
  2. Test intermediate phrasings ("aim for X words", "target X words", "no more than X words") to find where the compliance cliff is
  3. Check whether the overshoot bias holds for very short targets (25, 50 words) or flips to undershoot
  4. Measure whether word count accuracy degrades across a conversation (does the model lose count in multi-turn?)