Word Count Range Compliance (gpt-oss-120b)

length-compliancerange-instructionsword-countcerebras

◇ Hypothesis

When given a word count range ('between X and Y words'), LLMs produce output within the range more reliably than with exact or approximate point targets, and output clusters near the midpoint.

◇ Test

Third phrasing variant after the exact/approximate experiment. Tests whether explicit slack improves compliance.

81 completions: 9 range configurations × 3 topics × 3 runs.

Ranges: 3 positions (short ~250w, medium ~750w, long ~2000w) × 3 widths (tight ~50w, medium ~150w, wide ~400-500w)

Topics: factual, creative, argumentative

Prompt: "Write between {lower} and {upper} words about: {topic}"

Model: gpt-oss-120b via Cerebras API (free tier)
Temperature: 1.0, top_p: 0.95, max_completion_tokens: 65000

◇ Result

REJECTED

Range instructions do not improve compliance compared to point targets. Overall in-range rate: 37.0% (30/81).

By position (the dominant factor):

Position	In-range	Rate	Mean abs deviation
Short (~250w)	26/27	96.3%	11.0%
Medium (~750w)	4/27	14.8%	45.4%
Long (~2000w)	0/27	0.0%	37.6%

By width (minimal impact):

Width	In-range rate
Tight (~50w)	40.7%
Medium (~150w)	37.0%
Wide (~400-500w)	33.3%

Wider ranges don't help — the model overshoots so dramatically at medium/long positions that even a 500-word range can't contain the excess. Of 51 out-of-range completions, 47 (92%) were over the upper bound.

Compared to point targets:

Target	"Exactly" dev	"Approx" dev	Range (tight) dev
~250w	0.07%	5.2%	3.78%
~750w	0.51%	63.8%	24.6%
~2000w	7.49%	42.5%	14.7%

"Exactly" phrasing outperforms range phrasing at every position. Range phrasing is slightly better than "approximately" at medium/long targets but fundamentally can't overcome the model's tendency to produce ~1000-1500 words regardless of instruction.

◇ Next

Test range instructions on other models to see if the overshoot pattern is universal
Try combining range with "exactly" framing ("Write exactly between X and Y words")
Test whether iterative prompting ("you wrote N words, that's over the limit, please shorten") improves compliance

SCRATCHPADS-Experiment