LLMs will show lower accuracy on multiplication of positive integers below 100 compared to addition, especially for larger operands.
Follow-up to the addition experiments (positive, negative), both of which showed 100% accuracy. Multiplication is harder — larger result spaces, more carry operations — so accuracy might degrade at the same operand range.
100 random multiplication pairs (seed=42), stratified into 5 buckets of 20:
- Small × Small (1-9 × 1-9)
- Small × Medium (1-9 × 10-49)
- Small × Large (1-9 × 50-99)
- Medium × Medium (10-49 × 10-49)
- Large × Large (50-99 × 50-99)
Products ranged from 1 (1×1) to 9024 (96×94).
- Model: gpt-oss-120b via Cerebras API (free tier)
- Temperature: 0, top_p: 1, max_completion_tokens: 1024
- Prompt: "What is {a} * {b}? Reply with only the number."
- 1 run per pair, 2-second delay between calls
REJECTED
The hypothesis is wrong — multiplication accuracy matches addition at 100%.
| Bucket | Total | Correct | Accuracy |
|---|---|---|---|
| Small × Small | 20 | 20 | 100.0% |
| Small × Medium | 20 | 20 | 100.0% |
| Small × Large | 20 | 20 | 100.0% |
| Medium × Medium | 20 | 20 | 100.0% |
| Large × Large | 20 | 20 | 100.0% |
No errors even in the large×large bucket, where products reach 4 digits and require multiple carry operations. The model's internal chain-of-thought handles the multi-step arithmetic correctly throughout.
Combined with the addition results, gpt-oss-120b is now 300/300 on basic two-operand arithmetic with operands below 100.
- Scale up to 3-digit and 4-digit operands to find where multiplication accuracy actually breaks down
- Test division at the same operand range, since it introduces remainders and potential floating-point ambiguity
- Run the same problems on a non-reasoning model to isolate the effect of chain-of-thought on arithmetic accuracy