back to articles
2025-01-08

Why LLMs Always Name Cyberpunk Characters "Jack"

An investigation into absorption points in character generation, revealing how training data biases create repetitive outputs and what we can do about it.

#absorption-points#character-generation#analysis

The Problem

If you've used LLMs to generate cyberpunk characters, you've probably noticed something strange: a disproportionate number of them are named "Jack." In our analysis of 10,000 generated characters across GPT-4, Claude, and Llama models, we found that 18-34% of protagonists defaulted to this single name.

This isn't a coincidence. It's an absorption point - a statistical artifact where models collapse toward high-frequency patterns in training data.

What Are Absorption Points?

Absorption points occur when:

  1. Training data has strong associations (cyberpunk + protagonist → "Jack" from Neuromancer, Cyberpunk 2077)
  2. Autoregressive sampling reinforces early choices (picking "Jack" constrains subsequent tokens)
  3. Temperature/top-p doesn't escape the attractor basin (even with temp=1.0, "Jack" dominates)

Distribution Analysis

We analyzed name frequency in 10,000 cyberpunk character generations:

Claude-3.5-Sonnet (temp=0.9):

  • Jack: 34%
  • Raven: 12%
  • Ace: 9%
  • Cipher: 8%
  • All others: 37%

GPT-4-Turbo (temp=0.95):

  • Jack: 28%
  • Case: 14%
  • Molly: 11%
  • Raven: 9%
  • All others: 38%

Llama-3.1-70B (temp=0.85):

  • Jack: 18%
  • Alex: 15%
  • Kai: 12%
  • Zero: 10%
  • All others: 45%

The pattern is clear: Llama shows less concentration (likely due to different training data), but all models exhibit absorption toward a small set of names.

Why "Jack" Specifically?

We traced this to several sources in likely training data:

  1. William Gibson's Neuromancer - Canonical cyberpunk text
  2. Cyberpunk 2077 - Massive game with "V" and "Jackie"
  3. Reddit/fan fiction - Thousands of discussions using "Jack" as placeholder
  4. Procedural generation tutorials - Ironically, many use "Jack" as example output

The model learned: cyberpunk protagonist ≈ "Jack"

Breaking the Absorption Point

We tested three mitigation strategies:

1. Negative Prompting (FAILED)

"Generate a cyberpunk character. DO NOT name them Jack, Ace, or Raven."

Result: 12% still named "Jack" (down from 34%, but not eliminated)

2. Upstream Constraints (MODERATE SUCCESS)

"Character name (use this exactly): Itzpapalotl
Generate a cyberpunk backstory for this character."

Result: 0% "Jack" (by definition), but reduced quality/coherence by 15% (human eval)

3. Thematic Seeding (BEST APPROACH)

"Passphrase: obsidian mercury quartz
Generate a cyberpunk character with unique name and backstory."

Result:

  • "Jack" reduced to 4%
  • Unique names increased to 68% (from 37%)
  • Coherence maintained (95% baseline quality)

Recommendations for Users

If you want diverse character names:

  1. Use thematic seeds related to your genre but NOT character names
  2. Specify constraints upstream (ethnicity, profession) to break default paths
  3. Sample multiple times and reject "Jack" programmatically
  4. Use smaller models (Llama > GPT-4 for diversity, though lower quality)

For model developers:

  • Train on more diverse fiction (not just sci-fi canon)
  • Apply DPO/RLHF to penalize repetitive names
  • Build creativity benchmarks into evals

Conclusion

Absorption points reveal a fundamental tension in LLMs: they're optimized for likelihood, not novelty. "Jack" isn't a bug - it's the statistically correct answer to "name a cyberpunk protagonist."

Breaking these patterns requires understanding model internals and strategically constraining the generation space. In our follow-up article, we'll explore how quantization amplifies absorption points by compressing the model's creative range.


Related scratchpads:

Code & data:

$ exploring absorption points, seeding strategies, and creative constraints in language models