SCRATCHPADS-Experiment

Addition Accuracy on Negative Integers (gpt-oss-120b) 2026-02-21
Hypothesis

LLMs can reliably perform simple addition involving negative integers below 100 in absolute value.

Test

Follow-up to the positive integers scratchpad, which showed 100% accuracy. Negative numbers introduce sign handling that may cause errors.

100 addition problems involving negative integers, stratified into 4 cases:

  • Case 1 (30 pairs): one negative operand, positive result
  • Case 2 (30 pairs): one negative operand, negative result
  • Case 3 (10 pairs): one negative operand, zero result (additive inverses)
  • Case 4 (30 pairs): both operands negative

Operands drawn from integers with absolute value in 1-99 (seed=42). Sums ranged from -186 to 82.

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • Temperature: 0, top_p: 1, max_completion_tokens: 1024
  • Prompt: "What is {a} + {b}? Reply with only the number."
  • 1 run per pair, 2-second delay between calls
Result

CONFIRMED

100% accuracy across all 100 problems. No errors, no parse failures.

Case Total Correct Accuracy
One negative, positive result 30 30 100.0%
One negative, negative result 30 30 100.0%
Additive inverses (zero result) 10 10 100.0%
Both negative 30 30 100.0%

Sign handling introduced no errors at all. The both-negative case produced sums with magnitudes up to 186, beyond the operand range, without issue.

Combined with the positive integer experiment, gpt-oss-120b is 200/200 on simple two-operand addition across positive, negative, and mixed-sign integers.

Next
  1. Scale up operand magnitude (3-digit, 4-digit) to find where accuracy drops
  2. Test subtraction and multiplication at the same range to see if other operations introduce errors that addition doesn't
  3. Run the same problems on a non-reasoning model to check whether the chain-of-thought is carrying the result