SCRATCHPADS-Experiment

Output Diversity vs Generation Parameters (gpt-oss-120b) 2026-02-27
Hypothesis

Higher temperature and higher top_p values produce more diverse outputs when repeating the same creative prompt, as measured by vocabulary richness, n-gram overlap, semantic similarity, and token entropy.

Test

Establishes a quantitative baseline for how temperature and top_p affect output diversity. Future passphrase seeding experiments compare against this.

70 completions: 7 configurations × 10 runs.

Prompt: "Write a short story about a traveler arriving in a strange city."

Temperature sweep (top_p=0.95 fixed): 0.3, 0.7, 1.0, 1.5

Top-p sweep (temp=1.0 fixed): 0.5, 0.8, 0.95, 1.0

The temp=1.0/top_p=0.95 config is shared between both series.

  • Model: gpt-oss-120b via Cerebras API (free tier)
  • logprobs=True, top_logprobs=5 for entropy computation
  • Metrics: unique word ratio, mean token entropy, pairwise bigram Jaccard, pairwise cosine similarity (all-MiniLM-L6-v2), unique first sentences
Result

CONFIRMED

Both parameters increase diversity, but temperature dominates.

Temperature sweep (top_p=0.95):

Temp Cosine similarity Bigram Jaccard Token entropy
0.3 0.741 0.064 0.191
0.7 0.718 0.064 0.454
1.0 0.652 0.055 0.622
1.5 0.592 0.036 1.000

Cosine similarity drops 20% from lowest to highest temperature (CIs non-overlapping). Token entropy scales near-linearly with temperature, increasing 425%. Bigram overlap drops 43% — the jump from 1.0 to 1.5 is where most of the gain happens.

Top-p sweep (temp=1.0):

Top-p Cosine similarity Bigram Jaccard Token entropy
0.5 0.714 0.065 0.609
0.8 0.649 0.057 0.626
0.95 0.652 0.055 0.622
1.0 0.635 0.049 0.639

Top-p effects are much weaker. The big jump is 0.5→0.8 for cosine similarity; beyond that, diminishing returns. Token entropy barely changes across the full top-p range (5% increase vs 425% for temperature).

Unique word ratio changes are not statistically significant for either series at n=10 — CIs overlap substantially. Cosine similarity is the most informative single metric with the clearest monotonic trend and cleanly separated CIs.

Next
  1. Test passphrase seeding as a complementary diversity technique on top of these parameter settings
  2. Measure whether higher diversity correlates with lower coherence or quality
  3. Test at temp > 1.5 to find where diversity gains plateau or quality degrades