DSPy's GEPA optimizer does something simple: it runs your model on training data, has a reflection model look at the failures, proposes a better prompt, and keeps it only if validation accuracy goes up. We wanted to try this on tiny local models (SmolLM2 135M and 1.7B, served through LM Studio) to see what GEPA's optimization loop actually discovers about how these models fail. The process uses a reflection model, for which we used the free tier of the "Big Pickle" model on the OpenCode CLI.
We set up two classification tasks at different difficulty levels.
Sentiment is the straightforward one: given a text, label it positive, negative, or neutral. 22 examples with clear-cut cases like "I absolutely love this product!" and "This is terrible, it broke after one use." Any model with basic language understanding should get most of these right, which makes it a good baseline for seeing whether GEPA can push a model that's already close.
Agent conversation monitoring is what we're actually building toward. When you run an AI agent that calls tools over multiple turns, it can fail in ways that look similar on the surface but mean different things. A doom_loop agent repeats the same action without progressing, while an api_confusion agent also repeats itself but because it's calling the wrong function and ignoring the errors. There's also verbose_thinking (pages of internal monologue for trivial tasks) and making_progress (things are fine). We labeled 16 traces across the four states, and the hard part is that doom_loop and api_confusion both involve repetition, so a model matching on surface patterns will confuse them.
Quick setup note: DSPy's JSONAdapter sends response_format={"type": "json_object"} which LM Studio doesn't support, so we bypassed the adapter layer and built a raw OpenAI client using response_format={"type": "text"} with manual parsing.
The 135M model started at 50% because of a strong positivity bias: it just said "positive" for everything, even obviously negative inputs. GEPA's first iteration caught this and proposed few-shot examples starting with negative cases, which pushed validation to 66.7%.
The more surprising thing was about prohibitions. We tried adding "DO NOT echo the input" to the prompt, and the 135M model literally output "DO NOT echo the input" and then echoed the input. At this model size, prohibition text seems to get treated as content rather than instruction, so the model reproduces it instead of following it. Few-shot examples worked better because the model pattern-matches the output format instead of needing to interpret a rule. Hard to say whether that's specific to SmolLM2 or a more general small-model thing, but it's a useful heuristic: if your prohibitions are getting echoed, switch to examples.
The 1.7B model got 100% out of the box. Three-class sentiment apparently doesn't need prompt engineering at that size.
The 135M model couldn't do this at all. It echoed the class definitions back instead of classifying, which is a capacity problem, not a prompt problem.
The 1.7B model was more interesting. It started at 57.1% and kept mixing up doom_loop and api_confusion, which makes sense since we'd described both classes qualitatively ("agent is repeating the same actions" vs "agent misunderstands tools/APIs") and the model couldn't distinguish those descriptions when the traces looked similar. GEPA's reflection step noticed this and proposed something we hadn't thought of: quantitative thresholds. "3+ identical consecutive actions = doom_loop, repeated same error = api_confusion." Giving the model an explicit counting rule instead of a vague description brought validation to 71.4%.
This feels like it applies more broadly. Qualitative descriptions are fine when the categories are obviously different (positive vs negative), because the model picks up on tone. But when categories share surface features, you need rules the model can check mechanically rather than interpret, and GEPA turns out to be good at finding those because it's working from the exact misclassifications where the boundary gets fuzzy.
Every iteration after the first made things worse, on both tasks.
| Task | Model | Baseline | Iter 1 | Iter 2 | Iter 3 |
|---|---|---|---|---|---|
| Sentiment | 135M | 50.0% | 66.7% | 66.7% | 33.3% |
| Agent monitor | 1.7B | 57.1% | 71.4% | 28.6% | 42.9% |
Iteration 2 on agent monitoring tried to handle every remaining edge case by piling on rules, and training accuracy dropped to 25% because the model couldn't follow the instruction anymore. GEPA's Pareto frontier keeps those bad prompts from propagating, but you're still wasting reflection calls on proposals that make things worse.
At this model size there's only so much instruction the model can absorb. One round of targeted fixes (reorder examples, add a counting rule, clarify a boundary) is about all the prompt can take before it starts working against itself, because each new rule competes with the existing ones for the model's limited attention.
| Task | Model | Baseline | After GEPA | Change |
|---|---|---|---|---|
| Sentiment | 135M | 50.0% | 66.7% | +16.7% |
| Sentiment | 1.7B | 100% | 100% | (already perfect) |
| Agent monitor | 135M | 28.6% | 28.6% | (too small for the task) |
| Agent monitor | 1.7B | 57.1% | 71.4% | +14.3% |
These were small, synthetic experiments with tiny datasets, so we're not drawing grand conclusions from them. That said, even on tasks this simple we got a real uplift (+16.7% on sentiment, +14.3% on agent monitoring), and the benefits were concentrated in a single GEPA pass, which is interesting in itself.
The main friction point was DSPy's assumption that the model can produce structured output. Its default adapter sends response_format=json_object, but a 135M parameter model may not be able to reliably produce valid JSON at all, regardless of what server you're running it on. We ended up bypassing the adapter entirely and parsing raw text output ourselves, which worked but means you're stepping outside DSPy's pipeline to make it happen.
Overall this was a quick first test to get familiar with GEPA's loop, and the setup was genuinely simple: LM Studio serving the models, a reflection model, and a short Python script tying it together. I'm very much looking forward to more complex tests, with an api-based verifier model and more complete & accurate dataset for training.