SCRATCHPADS-Experiment

Addition Accuracy on Positive Integers (gpt-oss-120b) 2026-02-21
Hypothesis

LLMs can reliably perform simple addition of two positive integers below 100.

Test

100 random pairs of positive integers in range 1-99 (seed=42), sums ranging from 9 to 186.

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • Temperature: 0, top_p: 1, max_completion_tokens: 1024
  • Prompt: "What is {a} + {b}? Reply with only the number."
  • 1 run per pair, 2-second delay between calls
  • Responses parsed as integers and compared to ground truth
  • Results bucketed by max operand size: 1-9, 10-49, 50-99

Note on reasoning tokens: gpt-oss-120b is a reasoning model that produces internal chain-of-thought before answering. The initial max_completion_tokens of 32 from the experiment spec was far too low since reasoning tokens count against this limit, causing every response to truncate before producing output. Increasing to 1024 resolved it.

Result

CONFIRMED

100% accuracy across all 100 pairs. No errors, no parse failures, no API failures.

Bucket Total Correct Accuracy
1-9 1 1 100.0%
10-49 25 25 100.0%
50-99 74 74 100.0%

Every response was the correct integer with no extraneous text. The reasoning traces showed the model explicitly restating the problem and verifying its answer before responding.

The 1-9 bucket only had 1 sample because both operands need to fall in 1-9 for the max to land there, while the random distribution heavily favors at least one operand being ≥10.

Worth noting: this is a reasoning model on trivial arithmetic. Results may differ significantly for non-reasoning models, larger operands, or more complex operations.

Next
  1. Test the same task on a non-reasoning model to see if the internal chain-of-thought is what makes the difference
  2. Scale up operand size (3-digit, 4-digit, larger) to find where accuracy starts dropping
  3. Test other operations (subtraction, multiplication, division) at the same operand range
  4. Try adversarial prompts that might trip up the reasoning step (e.g. embedding the addition in a word problem with distractors)