Multiplication Accuracy on Positive Integers (gpt-oss-120b)

arithmeticmultiplicationaccuracyreasoning-modelcerebras

◇ Hypothesis

LLMs will show lower accuracy on multiplication of positive integers below 100 compared to addition, especially for larger operands.

◇ Test

Follow-up to the addition experiments (positive, negative), both of which showed 100% accuracy. Multiplication is harder — larger result spaces, more carry operations — so accuracy might degrade at the same operand range.

100 random multiplication pairs (seed=42), stratified into 5 buckets of 20:

Small × Small (1-9 × 1-9)
Small × Medium (1-9 × 10-49)
Small × Large (1-9 × 50-99)
Medium × Medium (10-49 × 10-49)
Large × Large (50-99 × 50-99)

Products ranged from 1 (1×1) to 9024 (96×94).

Model: gpt-oss-120b via Cerebras API (free tier)
Temperature: 0, top_p: 1, max_completion_tokens: 1024
Prompt: "What is {a} * {b}? Reply with only the number."
1 run per pair, 2-second delay between calls

◇ Result

REJECTED

The hypothesis is wrong — multiplication accuracy matches addition at 100%.

Bucket	Total	Correct	Accuracy
Small × Small	20	20	100.0%
Small × Medium	20	20	100.0%
Small × Large	20	20	100.0%
Medium × Medium	20	20	100.0%
Large × Large	20	20	100.0%

No errors even in the large×large bucket, where products reach 4 digits and require multiple carry operations. The model's internal chain-of-thought handles the multi-step arithmetic correctly throughout.

Combined with the addition results, gpt-oss-120b is now 300/300 on basic two-operand arithmetic with operands below 100.

◇ Next

Scale up to 3-digit and 4-digit operands to find where multiplication accuracy actually breaks down
Test division at the same operand range, since it introduces remainders and potential floating-point ambiguity
Run the same problems on a non-reasoning model to isolate the effect of chain-of-thought on arithmetic accuracy

SCRATCHPADS-Experiment