SCRATCHPADS-Experiment

Multiplication Accuracy on Positive Integers (gpt-oss-120b) 2026-02-21
Hypothesis

LLMs will show lower accuracy on multiplication of positive integers below 100 compared to addition, especially for larger operands.

Test

Follow-up to the addition experiments (positive, negative), both of which showed 100% accuracy. Multiplication is harder — larger result spaces, more carry operations — so accuracy might degrade at the same operand range.

100 random multiplication pairs (seed=42), stratified into 5 buckets of 20:

  • Small × Small (1-9 × 1-9)
  • Small × Medium (1-9 × 10-49)
  • Small × Large (1-9 × 50-99)
  • Medium × Medium (10-49 × 10-49)
  • Large × Large (50-99 × 50-99)

Products ranged from 1 (1×1) to 9024 (96×94).

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • Temperature: 0, top_p: 1, max_completion_tokens: 1024
  • Prompt: "What is {a} * {b}? Reply with only the number."
  • 1 run per pair, 2-second delay between calls
Result

REJECTED

The hypothesis is wrong — multiplication accuracy matches addition at 100%.

Bucket Total Correct Accuracy
Small × Small 20 20 100.0%
Small × Medium 20 20 100.0%
Small × Large 20 20 100.0%
Medium × Medium 20 20 100.0%
Large × Large 20 20 100.0%

No errors even in the large×large bucket, where products reach 4 digits and require multiple carry operations. The model's internal chain-of-thought handles the multi-step arithmetic correctly throughout.

Combined with the addition results, gpt-oss-120b is now 300/300 on basic two-operand arithmetic with operands below 100.

Next
  1. Scale up to 3-digit and 4-digit operands to find where multiplication accuracy actually breaks down
  2. Test division at the same operand range, since it introduces remainders and potential floating-point ambiguity
  3. Run the same problems on a non-reasoning model to isolate the effect of chain-of-thought on arithmetic accuracy