CLIR: A Fixed-Vocabulary Intermediate Representation for LLM Code Generation

Instead of a smaller generalist, a 103-token fixed vocabulary that eliminates naming entirely. The model only handles computation structure while a co-designed harness owns everything else: workspace assembly, validation, expansion to target code.

llmcode-generationmachine-learningarchitecture

Large models are generalists. Their intelligence capacity is bounded and diluted across every domain they've been trained on, so while they can write code in a dozen languages, the distance from "can write code" to "expert-level code generation" is vast, and the capacity that could close that gap is spent on things like natural language fluency, spreadsheet knowledge, and regulatory expertise that have nothing to do with producing correct functions. A much smaller model could potentially close that gap by not being a generalist at all, by concentrating its entire capacity on the one thing that matters: computation structure.

The other half of the idea comes from how current models relate to their tooling. Large models are trained with SFT and RL, but the harness around them (tool use, retrieval, validation) is usually an afterthought that gets bolted on after training, which means the two evolve on different timelines and capabilities expand faster than the model can adapt to use them. If you co-design the harness and the model from the start, the harness can own everything deterministic (workspace assembly, retrieval, validation, expansion to target code) while the model's entire capacity goes to the part that actually requires learned judgment.

CLIR (Compact LLM Intermediate Representation) is an attempt at exactly this. It's a 103-token fixed vocabulary where an encoder-decoder model receives a structured workspace on the encoder side and emits computation structure on the decoder side, while the harness owns both ends of the pipeline. The workspace format defines what the model sees, the fixed vocabulary defines what it can emit, the validator defines what gets through, and neither side makes sense without the other.

◇ The encoder-decoder split

The architecture is an encoder-decoder model with different vocabularies on each side, which is what makes the specialization possible. The encoder receives a structured workspace describing the function to generate: a natural language task description, the function signature with types, declared constants, and available helper functions. This side can use whatever vocabulary it needs to understand the problem. The decoder emits the function body from a fixed vocabulary of 103 tokens: structural markers, slot references, arithmetic and comparison operators, builtins. It never sees a variable name, a string literal, or a language keyword.

This asymmetry is the point. The encoder handles understanding (what does this function need to do, what types are involved, what helpers are available) while the decoder handles generation from a vocabulary so small that every output can be validated against a finite grammar. All of the decoder's learned capacity goes to choosing which operations to compose and in what order, because that's all it can do.

◇ From names to slots

With 103 tokens there's no room for arbitrary identifiers. A conventional tokenizer handles total, target, balance as separate tokens or subword pieces, but a fixed vocabulary can't since every possible variable name would need its own entry. The same goes for function names, string literals, and numeric constants: they're all open-ended and can't fit in a finite set.

CLIR solves this the same way assembly solves it: positional slots. Instead of naming variables you number them. Arguments are arg0 through arg3, local variables are var0 through var15, constants are c0 through c7, and available functions are func0 through func7. The actual names, values, and signatures live in the workspace that the encoder sees, and the decoder only ever references slots by position.

The slot limits are deliberate and the register analogy holds: just as register pressure in assembly encourages cleaner function boundaries, the bounds here force decomposition when a function gets too large, which is better structure anyway. Function arguments rarely exceed four, sixteen locals covers most function bodies, and a function that calls more than eight helpers is probably doing too much. Builtins don't need special vocabulary either since they're just entries in the workspace's FUNCTIONS section, surfaced by the harness via RAG or a purpose-built model that determines what the function needs.

The full instruction set fits in a table:

Category	Tokens
Control flow	`func_def` `func_end` `cond_block` `cond_else` `cond_end` `loop_block` `loop_eternal` `loop_end` `break` `return`
Bindings	`var_def` `call1`..`call4` `self`
Slots	`arg0`..`arg3` `var0`..`var15` `c0`..`c7` `func0`..`func7`
Literals	`zero` `one` `two` `true` `false`
Arithmetic	`add` `sub` `mult` `div` `mod`
Comparison	`eq` `neq` `gt` `lt` `gte` `lte`
Logic	`is_true` `not` `and` `or`
Collections	`list_new` `list_append` `index` `index_set` `len`
Option/Result	`some` `none` `ok` `err` `is_some` `is_ok` `unwrap`

Every body line is one of three forms: a binding (var_def varN <op> <operands>), a mutation statement (list_append, index_set), or a structural token. Loops come in two forms: loop_block for foreach-style iteration over a collection, and loop_eternal with explicit cond_block / break for everything else. There's no while condition or for-range because making termination conditions explicit as tokens in specific structural positions is what lets the static checks reason about them.

◇ Structured input and expansion

The harness assembles a workspace for every generation call, which the encoder receives as structured input. Here's a simple example, counting occurrences of a value in a list:

TASK: Count how many times a given value appears in the provided list

SIGNATURE: [arg0:list[int], arg1:int] -> int
CONSTANTS: []
FUNCTIONS: []
NAMES: [arg0=lst, arg1=value]

EMIT BODY:

Everything is positional on the decoder side, but the workspace carries the full context: arg0 is called lst, arg1 is value, the signature specifies types, and the NAMES section gives the expander what it needs to produce readable output. The harness retrieves helper functions via embedding similarity against the codebase or a planner provides them directly, populating the FUNCTIONS section. Because CLIR bodies are compact, the workspace can include full CLIR implementations of retrieved helpers, which means dozens of them fit in a context window that would struggle to hold a handful of Python functions.

The decoder produces:

func_def [arg0:list[int], arg1:int] -> int
  var_def var0    zero
  loop_block var1 arg0
    cond_block      eq var1 arg1
      var_def var0    add var0 one
    cond_end
  loop_end
  return var0
func_end

Seven lines. loop_block var1 arg0 iterates over the list, binding each element to var1. The body checks whether var1 equals arg1 and increments the counter var0 when it does. No naming decisions, no syntax choices, just the computation structure.

On the output side the expander reverses the abstraction, taking the CLIR body together with the workspace context to produce idiomatic target-language code. It maps arg0 back to lst, arg1 back to value, and var0/var1 to contextually appropriate names like count and item. The decoder never has to think about any of this because the harness owns both ends of the pipeline.

◇ Python to CLIR

Absolute value:

def abs_val(a):
    if a < 0:
        return -a
    return a

func_def [arg0:int] -> int
  cond_block      lt arg0 zero
    var_def var0    sub zero arg0
    return var0
  cond_end
  return arg0
func_end

No named variables, no - operator, no if keyword — six body lines from the fixed vocabulary where each token's meaning is determined by its position and the grammar rules.

Dot product:

def dot(a, b):
    total = 0
    for i in range(len(a)):
        total += a[i] * b[i]
    return total

func_def [arg0:list, arg1:list] -> int
  var_def var0    zero
  var_def var1    len arg0
  var_def var2    zero
  loop_eternal
    cond_block      gte var2 var1
      break
    cond_end
    var_def var3    index arg0 var2
    var_def var4    index arg1 var2
    var_def var5    mult var3 var4
    var_def var0    add var0 var5
    var_def var2    add var2 one
  loop_end
  return var0
func_end

More lines than the Python, but each line is a short sequence from the 103-token vocabulary with no ambiguity about what it does. Index-based iteration uses loop_eternal because CLIR has no for-range, so the counter is managed manually with var2 incrementing via add var2 one on each pass.

Binary search:

def binary_search(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            lo = mid + 1
        else:
            hi = mid - 1
    return -1

func_def [arg0:list, arg1:any] -> int
  var_def var0    zero
  var_def var1    len arg0
  var_def var1    sub var1 one
  loop_eternal
    cond_block      gt var0 var1
      break
    cond_end
    var_def var2    add var0 var1
    var_def var2    div var2 two
    var_def var3    index arg0 var2
    cond_block      eq var3 arg1
      return var2
    cond_end
    cond_block      lt var3 arg1
      var_def var0    add var2 one
    cond_else
      var_def var1    sub var2 one
    cond_end
  loop_end
  return c0
func_end

The return -1 becomes return c0, where c0 is declared as an int constant in the header and the expander knows its value from the workspace since the decoder never touches literal values. The CLIR body isn't shorter than the Python — it's roughly the same length — but every token comes from the 103-token vocabulary with known grammar rules, which is what makes constrained decoding and static validation possible.

◇ Constrained decoding and static validation

Because the decoder vocabulary is finite and the grammar is known, the decoder uses a finite state machine during generation that masks invalid tokens at each step. The FSM tracks nesting depth, block types, and loop ancestors to determine which tokens are legal next, so you can't emit break outside a loop or cond_else without a preceding cond_block, and block nesting is balanced by construction rather than by sampling luck. The transition table is small enough to fit in GPU memory and run batched across sequences without meaningful overhead.

Since the workspace declares types for every argument, constant, and available function including their arities, the FSM can also enforce type constraints during decoding. When the decoder is choosing a source operand for an add it masks out non-numeric variables, and when emitting a call1 func0 it knows func0 expects exactly one argument, which means type mismatches, wrong argument counts, and operand type errors can't be generated at all.

The FSM guarantees syntactic validity, but there are structural properties it can't enforce during generation because they depend on the full token sequence. That's where a post-generation static analysis pass comes in, checking four categories before the output ever reaches the expander:

Scope safety: every variable read was defined earlier in scope, block structure is balanced, slot indices are within declared bounds.

Type safety: arithmetic operations have numeric operands, logic operations have boolean operands, index is called on a list, field_get on a struct, return type matches the declared signature.

Option/Result safety: every unwrap call appears inside a corresponding is_some or is_ok guard. A bare unwrap is a static error.

Data flow: unused variables, unreachable code after unconditional returns, loops with no reachable break.

None of these require executing the code since they're structural properties of the token sequence, cheap to check because the grammar is constrained enough that static analysis covers them fully. Between the FSM and the post-generation checks, every CLIR program that reaches the expander is guaranteed well-formed.

◇ Training data and open questions

Training data comes from transpiling real source code into CLIR, and the source languages don't always cooperate. Python has optional typing, and when types are present they're not enforced, so they can be outright wrong, which means the transpiler has to infer, validate, or reject annotations per-function. There's also a specialization problem since CLIR has no generics: a function that works on List[int] and one that works on List[str] become different CLIR programs with different type signatures, so deciding how to specialize generic source functions into concrete instances and which specializations to include is a dataset curation decision that directly affects what the model can learn.

The type system is currently basic: int, float, bool, string, list, struct, enum, and Option/Result wrappers. No generics, no maps or dicts, no tuples. A function over a HashMap<String, Vec<i32>> can't be expressed in the current vocabulary, which limits CLIR to algorithmic functions for now.

Struct and enum handling adds friction because field access is positional (field0, field1, ...) with the human-readable names living in the workspace's type declarations. The expander maps positions back to names when generating target code, which works when types are fully declared but requires careful bookkeeping when types are inferred or partially specified.

The no-names constraint also means constants have to be declared in the workspace header before the decoder runs, so the planner has to anticipate what constants the function will need. For simple cases this is straightforward: -1 for sentinel returns, 0 for accumulators. For functions where the constants depend on the algorithm structure, it's harder, and whether the workspace assembler can reliably infer them is an active question.

There's a more speculative direction too. Since data dependencies between tokens are known at vocabulary-construction time, attention masks during training could theoretically be partially precomputed: a token at position N that references var3 only needs to attend to the var_def var3 line and its subsequent mutations, though whether that's worth the implementation complexity remains to be seen.

◇ Current status

I'm actively training the encoder-decoder on this, with workspace assembly, grammar-constrained decoding, and validation all in the loop. Results once I'm happy with where the numbers land.

If any of this sounds interesting or you have thoughts, I'd love to hear them.

ARTICLES-Reading