I ran six experiments on gpt-oss-120b to figure out how well it follows length instructions. The short answer: if you say "exactly," it's shockingly accurate at short-to-medium targets. If you say anything else, good luck.
This started with the simplest version of the question: does the model actually count words? 120 API calls later, the answer turned out to be more interesting than expected, so I kept going. Six experiments and 573 calls total, testing word counts, character counts, ranges, and what happens when you throw both constraints at once.
All experiments used gpt-oss-120b via Cerebras' free tier at temperature 1.0. The individual scratchpads have full methodology, but the story that emerges from putting them together is what I want to focus on here.
The first experiment tested "write exactly X words" vs "write approximately X words" across four target lengths (100, 500, 1000, 2500 words). The gap between the two phrasings is enormous:
| Phrasing | Mean deviation |
|---|---|
| "Exactly" | 2.1% |
| "Approximately" | 42.3% |
With "exactly" phrasing, the model stays below 1% deviation for targets up to 1000 words. At 100 words, 57% of completions were exact matches, zero deviation. The model is genuinely counting.
"Approximately" gets treated as a soft lower bound. The model consistently overshoots, and at the 1000-word target it averaged 1759 actual words. It's not that the model can't count; it's that "approximately" seems to turn off the counting mechanism entirely.
The character count experiment tested the same idea with character targets. At small scales, the results are even more impressive: "exactly 160 characters" (SMS length) yields 0.04% mean deviation, with 14 out of 15 completions hitting the target perfectly. The model can count characters at tweet and SMS lengths with near-perfect accuracy.
But at 5000 characters, "exactly" phrasing completely breaks down. The mean deviation is 625%, driven by catastrophic outliers: outputs of 211K, 137K, and 88K characters that hit the token limit, plus a couple of near-empty responses (0 and 86 characters). The model appears to enter a mode where it tries to construct the output character by character, and either runs away or produces nothing.
"Approximately 5000 characters" (59.9% mean deviation, consistent ~60% overshoot) is actually the safer option at this scale, because at least it doesn't produce 200K-character catastrophes.
Before testing more phrasing variants, I wanted to know what the model produces when you don't constrain it at all. The default generation length experiment measured this across 60 completions with no length instruction.
The median is 1359 words, with topic being the only meaningful variable:
| Topic | Mean words |
|---|---|
| Factual | 1586 |
| Creative | 1399 |
| Argumentative | 1015 |
Factual prompts ran almost 60% longer than argumentative ones, a ~570-word gap from the same model with the same parameters. The model seems to have picked up on genre conventions: factual topics get the encyclopedic treatment while argumentative prompts get tighter essay structures. The SD column is telling too, since argumentative outputs barely varied while creative writing spread three times wider, which makes sense if the model has a firm sense of how long an argument should be but treats fiction as more open-ended.
Even so, the overall range across all 60 completions was 770 to 2061 words. Nothing tiny, nothing enormous. Prompt framing ("Write about the following topic: X" vs just "X") made no difference (2.1% gap, well within variance).
This baseline reframes what the compliance experiments are actually measuring. Asking for "exactly 1000 words" on an argumentative topic is asking for something close to what the model would produce anyway, so high compliance there isn't that impressive. Asking for 2500 words means fighting the model's natural length, and as the range experiments will show, the natural length usually wins.
A common suggestion is to give the model a range instead of a point target. "Write between 225 and 275 words" sounds like it should be easier to hit than "write exactly 250 words," since you're giving 50 words of slack.
The word count range experiment tested nine range configurations across three target positions (short ~250w, medium ~750w, long ~2000w) and three widths. The results make the natural length finding click into place:
| Position | In-range rate |
|---|---|
| Short (~250w) | 96.3% |
| Medium (~750w) | 14.8% |
| Long (~2000w) | 0.0% |
At ~250 words the model complies easily, since the target is well below its natural length and it can just stop early. Once you get to ~750 words, though, it's approaching the natural length and starts overshooting. By ~2000 words, not a single completion out of 27 lands in range, with the model averaging 2700+ words regardless of what you asked for.
Range width doesn't help either. Tight, medium, and wide ranges all land around 35-41% compliance, because once the model overshoots by 700+ words, a 500-word range can't contain it.
The character range experiment tells the same story: 100% compliance at SMS and tweet scales, 0% at 5000 characters. The one upside of ranges at large scales is that they prevent the catastrophic failure mode from the exact-phrasing experiment. No 200K-character outputs, no zero-length outputs. The worst was 10,372 characters. Still wrong, but at least predictably wrong.
The compound constraints experiment tested what happens when you specify both word count and character count simultaneously. Since the single-constraint experiments showed near-perfect compliance at certain targets (100 words with "exactly" phrasing: 0.07% deviation), I expected the model might use both constraints to triangulate.
It doesn't. The second constraint makes the first one worse:
| Target words | Single-constraint dev | Compound dev |
|---|---|---|
| 100 | 0.07% | 1.11% |
| 500 | 0.44% | 14.98% |
| 1000 | 0.51% | 69.20% |
At 100 words the degradation is small, but at 1000 words the model goes from 0.51% deviation (nearly perfect) to 69.2% deviation (producing roughly 1700 words). Adding the character constraint didn't help; it actively confused the model.
The incompatible pairs were the most interesting part. Given 100 words / 5000 characters (physically impossible at natural word lengths), the model writes ~100 words and ignores the character target. But flip it to 1000 words / 2000 characters (also impossible) and it writes ~200 words near the character target instead, abandoning the word count. In both cases it gravitates toward whichever constraint produces the more natural output length.
Two catastrophic outliers stood out: one response produced 112 words but 2,040,101 characters (the model generated normal text then padded with massive repetition), and another hit 27,803 words at exactly 2.0 characters per word, as if the model was trying to satisfy both constraints by using impossibly short words.
If you need a specific length from gpt-oss-120b: use "exactly" phrasing, and keep your target under ~1000 words for word counts or ~1500 characters for character counts. In that range the model will nail it. Median word deviation is 0.2%, and at 100 words you'll often get an exact match. Don't bother with "approximately," "about," or "around."
Ranges sound like they should help, but they don't outperform "exactly" at any scale I tested. Where ranges work (short targets), "exactly" already works better. Where "exactly" breaks down (long targets), ranges break down the same way. And combining word and character constraints just makes both worse, so pick one.
The hard ceiling is the model's natural length. Above ~1300 words, all instruction formats lose precision because you're asking the model to exceed its own prior about how long a response should be. "Exactly" is the only phrasing that works above it, and even that degrades at 2500 words.
The finding that ties all six experiments together is that "exactly" seems to activate a different mode in the model, something closer to actual counting rather than vibes-based length matching. It's an order of magnitude more accurate than any other phrasing I tested, until you hit the scale where the counting mechanism itself breaks down. Everything else ("approximately," ranges, compound constraints) gets interpreted loosely, and the model just writes however much it was going to write anyway, adjusted slightly in whatever direction you pointed.
The scratchpads with full data: word count, character count, default length, word ranges, character ranges, compound constraints.