Output Diversity vs Generation Parameters (gpt-oss-120b)

diversitytemperaturetop-pgeneration-parametersbaselinecerebras

◇ Hypothesis

Higher temperature and higher top_p values produce more diverse outputs when repeating the same creative prompt, as measured by vocabulary richness, n-gram overlap, semantic similarity, and token entropy.

◇ Test

Establishes a quantitative baseline for how temperature and top_p affect output diversity. Future passphrase seeding experiments compare against this.

70 completions: 7 configurations × 10 runs.

Prompt: "Write a short story about a traveler arriving in a strange city."

Temperature sweep (top_p=0.95 fixed): 0.3, 0.7, 1.0, 1.5

Top-p sweep (temp=1.0 fixed): 0.5, 0.8, 0.95, 1.0

The temp=1.0/top_p=0.95 config is shared between both series.

Model: gpt-oss-120b via Cerebras API (free tier)
logprobs=True, top_logprobs=5 for entropy computation
Metrics: unique word ratio, mean token entropy, pairwise bigram Jaccard, pairwise cosine similarity (all-MiniLM-L6-v2), unique first sentences

◇ Result

CONFIRMED

Both parameters increase diversity, but temperature dominates.

Temperature sweep (top_p=0.95):

Temp	Cosine similarity	Bigram Jaccard	Token entropy
0.3	0.741	0.064	0.191
0.7	0.718	0.064	0.454
1.0	0.652	0.055	0.622
1.5	0.592	0.036	1.000

Cosine similarity drops 20% from lowest to highest temperature (CIs non-overlapping). Token entropy scales near-linearly with temperature, increasing 425%. Bigram overlap drops 43% — the jump from 1.0 to 1.5 is where most of the gain happens.

Top-p sweep (temp=1.0):

Top-p	Cosine similarity	Bigram Jaccard	Token entropy
0.5	0.714	0.065	0.609
0.8	0.649	0.057	0.626
0.95	0.652	0.055	0.622
1.0	0.635	0.049	0.639

Top-p effects are much weaker. The big jump is 0.5→0.8 for cosine similarity; beyond that, diminishing returns. Token entropy barely changes across the full top-p range (5% increase vs 425% for temperature).

Unique word ratio changes are not statistically significant for either series at n=10 — CIs overlap substantially. Cosine similarity is the most informative single metric with the clearest monotonic trend and cleanly separated CIs.

◇ Next

Test passphrase seeding as a complementary diversity technique on top of these parameter settings
Measure whether higher diversity correlates with lower coherence or quality
Test at temp > 1.5 to find where diversity gains plateau or quality degrades

SCRATCHPADS-Experiment