Do LLMs Actually Hit Requested Word Counts?

length-complianceword-countprompt-phrasingcerebras

◇ Hypothesis

LLMs don't reliably produce text matching a requested word count, with deviation increasing for longer targets and varying by phrasing.

◇ Test

120 API calls to Cerebras (gpt-oss-120b, free tier), varying three parameters:

Target word counts: 100, 500, 1000, 2500

Topics:

Factual: "Explain how solar panels work"
Creative: "Write a short story about a lighthouse keeper"
Argumentative: "Argue for or against remote work"

Phrasing: "Write exactly X words about: {topic}" vs "Write approximately X words about: {topic}"

5 runs per combination (4 lengths × 3 topics × 2 phrasings × 5 runs = 120 total).

Model: gpt-oss-120b via Cerebras API (free tier)
Temperature: 1.0, top_p: 0.95
Word count method: whitespace split

◇ Result

CONFIRMED (with a big caveat)

The hypothesis holds, but phrasing dominates everything else.

By phrasing (the main finding):

Phrasing	Mean deviation	Median deviation
"Exactly"	2.1%	0.2%
"Approximately"	42.3%	38.7%

The model genuinely counts words when you say "exactly" and nails the target. "Approximately" gets treated as a soft lower bound, consistently overshooting.

By target length × phrasing:

Target	"Exactly" mean dev	"Approximately" mean dev
100	0.1%	5.9%
500	0.4%	38.3%
1000	0.5%	75.9%
2500	7.5%	49.3%

With "exactly," the model stays below 1% deviation up to 1000 words and only drifts to 7.5% at 2500. With "approximately," 1000-word requests averaged 1759 actual words (75.9% overshoot).

By topic:

Topic	Mean deviation
Argumentative	13.0%
Factual	26.3%
Creative	27.4%

Argumentative prompts had the lowest deviation across all conditions, possibly because pro/con structure gives the model natural length cues.

Overshoot bias: the model almost always goes long. 70-77% of responses for targets ≥500 words exceeded the target. The worst single response was 120.1% over target, more than double the requested length.

At 100 words with "exact" phrasing, 56.7% of completions were exact matches (0% deviation). The model can count at short lengths when you tell it to.

◇ Visual Results

📊 Mean deviation by phrasing:

"Exactly":       ██░░░░░░░░░░░░░░░░░░  2.1%
"Approximately":  ████████████████████  42.3%

📊 Mean deviation by target length × phrasing:

100w  exact:  ░░░░░░░░░░  0.1%
100w  approx: █░░░░░░░░░  5.9%
500w  exact:  ░░░░░░░░░░  0.4%
500w  approx: ████████░░  38.3%
1000w exact:  ░░░░░░░░░░  0.5%
1000w approx: ████████████████  75.9%
2500w exact:  █░░░░░░░░░  7.5%
2500w approx: ██████████  49.3%

◇ Next

Test the same conditions on other models (Claude, GPT-4o, Llama) to see whether the "exactly" compliance is specific to gpt-oss-120b or a general pattern
Test intermediate phrasings ("aim for X words", "target X words", "no more than X words") to find where the compliance cliff is
Check whether the overshoot bias holds for very short targets (25, 50 words) or flips to undershoot
Measure whether word count accuracy degrades across a conversation (does the model lose count in multi-turn?)

SCRATCHPADS-Experiment