SCRATCHPADS-Experiment

Character Count Range Compliance (gpt-oss-120b) 2026-02-25
Hypothesis

Character count ranges ('between X and Y characters') produce more reliable compliance than exact or approximate point targets, particularly at 5000 characters where exact targeting broke down catastrophically.

Test

Follow-up to the character count compliance experiment, where "exactly 5000 characters" produced 625% mean deviation with outputs ranging from 0 to 211K characters. Tests whether range instructions prevent this catastrophic failure.

108 completions: 12 range configurations × 3 topics × 3 runs.

Ranges: 4 positions (SMS ~160, tweet ~280, medium ~1500, long ~5000) × 3 widths (tight, medium, wide)

Prompt: "Write between {lower} and {upper} characters about: {topic}"

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • Temperature: 1.0, top_p: 0.95, max_completion_tokens: 65000
Result

REJECTED

Ranges do not produce more reliable in-range compliance than exact point targets at the scales where it matters.

By position:

Position In-range Rate Mean abs deviation
SMS (~160) 27/27 100.0% 7.94%
Tweet (~280) 27/27 100.0% 7.93%
Medium (~1500) 8/27 29.6% 15.9%
Long (~5000) 0/27 0.0% 59.6%

At small targets, 100% in-range — but exact phrasing already achieved 0.04-0.07% deviation at these scales, far more precise. At 5000 characters, range compliance is 0% with 59.6% mean deviation, comparable to "approximately" phrasing (~59%).

The key improvement over exact phrasing at 5000 chars: zero finish_reason=length events (vs 3/15 with exact phrasing), no zero-length outputs, no 200K+ runaway outputs. The longest was 10,372 characters. Range instructions eliminate the catastrophic tail behavior but don't achieve compliance.

Width had minimal impact: tight 55.6%, medium 55.6%, wide 61.1%.

Systematic overshoot: 45 completions over the upper bound vs 1 under. The model treats the lower bound as more salient than the upper.

Next
  1. Compare against word count range compliance to see if character-based and word-based ranges show the same position-dependent pattern
  2. Test whether combining range with a hard stop instruction ("stop immediately at Y characters") improves upper-bound compliance