Higher temperature and higher top_p values produce more diverse outputs when repeating the same creative prompt, as measured by vocabulary richness, n-gram overlap, semantic similarity, and token entropy.
Establishes a quantitative baseline for how temperature and top_p affect output diversity. Future passphrase seeding experiments compare against this.
70 completions: 7 configurations × 10 runs.
Prompt: "Write a short story about a traveler arriving in a strange city."
Temperature sweep (top_p=0.95 fixed): 0.3, 0.7, 1.0, 1.5
Top-p sweep (temp=1.0 fixed): 0.5, 0.8, 0.95, 1.0
The temp=1.0/top_p=0.95 config is shared between both series.
- Model: gpt-oss-120b via Cerebras API (free tier)
- logprobs=True, top_logprobs=5 for entropy computation
- Metrics: unique word ratio, mean token entropy, pairwise bigram Jaccard, pairwise cosine similarity (all-MiniLM-L6-v2), unique first sentences
CONFIRMED
Both parameters increase diversity, but temperature dominates.
Temperature sweep (top_p=0.95):
| Temp | Cosine similarity | Bigram Jaccard | Token entropy |
|---|---|---|---|
| 0.3 | 0.741 | 0.064 | 0.191 |
| 0.7 | 0.718 | 0.064 | 0.454 |
| 1.0 | 0.652 | 0.055 | 0.622 |
| 1.5 | 0.592 | 0.036 | 1.000 |
Cosine similarity drops 20% from lowest to highest temperature (CIs non-overlapping). Token entropy scales near-linearly with temperature, increasing 425%. Bigram overlap drops 43% — the jump from 1.0 to 1.5 is where most of the gain happens.
Top-p sweep (temp=1.0):
| Top-p | Cosine similarity | Bigram Jaccard | Token entropy |
|---|---|---|---|
| 0.5 | 0.714 | 0.065 | 0.609 |
| 0.8 | 0.649 | 0.057 | 0.626 |
| 0.95 | 0.652 | 0.055 | 0.622 |
| 1.0 | 0.635 | 0.049 | 0.639 |
Top-p effects are much weaker. The big jump is 0.5→0.8 for cosine similarity; beyond that, diminishing returns. Token entropy barely changes across the full top-p range (5% increase vs 425% for temperature).
Unique word ratio changes are not statistically significant for either series at n=10 — CIs overlap substantially. Cosine similarity is the most informative single metric with the clearest monotonic trend and cleanly separated CIs.
- Test passphrase seeding as a complementary diversity technique on top of these parameter settings
- Measure whether higher diversity correlates with lower coherence or quality
- Test at temp > 1.5 to find where diversity gains plateau or quality degrades