Addition Accuracy on Negative Integers (gpt-oss-120b)

arithmeticnegative-numbersaccuracyreasoning-modelcerebras

◇ Hypothesis

LLMs can reliably perform simple addition involving negative integers below 100 in absolute value.

◇ Test

Follow-up to the positive integers scratchpad, which showed 100% accuracy. Negative numbers introduce sign handling that may cause errors.

100 addition problems involving negative integers, stratified into 4 cases:

Case 1 (30 pairs): one negative operand, positive result
Case 2 (30 pairs): one negative operand, negative result
Case 3 (10 pairs): one negative operand, zero result (additive inverses)
Case 4 (30 pairs): both operands negative

Operands drawn from integers with absolute value in 1-99 (seed=42). Sums ranged from -186 to 82.

Model: gpt-oss-120b via Cerebras API (free tier)
Temperature: 0, top_p: 1, max_completion_tokens: 1024
Prompt: "What is {a} + {b}? Reply with only the number."
1 run per pair, 2-second delay between calls

◇ Result

CONFIRMED

100% accuracy across all 100 problems. No errors, no parse failures.

Case	Total	Correct	Accuracy
One negative, positive result	30	30	100.0%
One negative, negative result	30	30	100.0%
Additive inverses (zero result)	10	10	100.0%
Both negative	30	30	100.0%

Sign handling introduced no errors at all. The both-negative case produced sums with magnitudes up to 186, beyond the operand range, without issue.

Combined with the positive integer experiment, gpt-oss-120b is 200/200 on simple two-operand addition across positive, negative, and mixed-sign integers.

◇ Next

Scale up operand magnitude (3-digit, 4-digit) to find where accuracy drops
Test subtraction and multiplication at the same range to see if other operations introduce errors that addition doesn't
Run the same problems on a non-reasoning model to check whether the chain-of-thought is carrying the result

SCRATCHPADS-Experiment