LLMs can reliably perform simple addition involving negative integers below 100 in absolute value.
Follow-up to the positive integers scratchpad, which showed 100% accuracy. Negative numbers introduce sign handling that may cause errors.
100 addition problems involving negative integers, stratified into 4 cases:
- Case 1 (30 pairs): one negative operand, positive result
- Case 2 (30 pairs): one negative operand, negative result
- Case 3 (10 pairs): one negative operand, zero result (additive inverses)
- Case 4 (30 pairs): both operands negative
Operands drawn from integers with absolute value in 1-99 (seed=42). Sums ranged from -186 to 82.
- Model: gpt-oss-120b via Cerebras API (free tier)
- Temperature: 0, top_p: 1, max_completion_tokens: 1024
- Prompt: "What is {a} + {b}? Reply with only the number."
- 1 run per pair, 2-second delay between calls
CONFIRMED
100% accuracy across all 100 problems. No errors, no parse failures.
| Case | Total | Correct | Accuracy |
|---|---|---|---|
| One negative, positive result | 30 | 30 | 100.0% |
| One negative, negative result | 30 | 30 | 100.0% |
| Additive inverses (zero result) | 10 | 10 | 100.0% |
| Both negative | 30 | 30 | 100.0% |
Sign handling introduced no errors at all. The both-negative case produced sums with magnitudes up to 186, beyond the operand range, without issue.
Combined with the positive integer experiment, gpt-oss-120b is 200/200 on simple two-operand addition across positive, negative, and mixed-sign integers.
- Scale up operand magnitude (3-digit, 4-digit) to find where accuracy drops
- Test subtraction and multiplication at the same range to see if other operations introduce errors that addition doesn't
- Run the same problems on a non-reasoning model to check whether the chain-of-thought is carrying the result