LEVI: Better LLM Optimization for the Price of a Cup of Coffee
A harness-first framework for LLM-guided evolutionary search.
TLDR
Existing LLM-guided evolutionary frameworks have weak diversity mechanisms that cause early convergence, then compensate by throwing expensive frontier models at the problem. LEVI takes a harness-first approach: fix the search architecture so the archive preserves structurally diverse solutions throughout the run, and strong performance follows even with cheap models. The result is better scores than OpenEvolve, ShinkaEvolve, and GEPA on the ADRS benchmark at 1.5–6.7× lower cost. LEVI will be open-sourced on GitHub soon.

Figure 1: Controlled comparison on Transaction Scheduling. Same model (Qwen3-30B-A3B), same budget (750 evaluations), three seeds. LEVI's archive sustains exploration well past the point where baselines converge.
Background and Motivation
Why existing frameworks couple strong performance with large budgets, and why that coupling is a design choice rather than a fundamental requirement.
The idea of pairing large language models with evolutionary search over programs was introduced by FunSearch, which used an island-based method to discover solutions to problems that are easy to verify but hard to solve. AlphaEvolve scaled the paradigm to stronger LLMs and larger codebases, and subsequent work extended it to mathematical constructions, heuristic design, prompt optimization, and systems research. The core loop is simple: an LLM proposes candidate programs, an evaluator scores them, and a selection mechanism guides the population toward better solutions.
Several open-source frameworks now implement this loop: OpenEvolve, ShinkaEvolve, and GEPA being the most widely used. These have demonstrated strong results, but they share a common characteristic: strong performance is tightly coupled with large budgets and frontier-scale models. Most published runs assume access to frontier models like Opus, GPT, or Gemini Pro,1,21 OpenEvolve config: uses Claude Opus for mutations.2 ShinkaEvolve (Ye et al, 2025): relies on frontier-scale models throughout the search. making the paradigm expensive to use and difficult to iterate on.
We believe this coupling reflects a design assumption more than a fundamental requirement. Existing frameworks were built with frontier models as the default, and their search architectures reflect this: when diversity stalls, the response tends to be additional layers of mechanism (islands, embedding-based novelty filters, LLM judges), each patching over convergence that still occurs, rather than preventing it at the archive level. GEPA takes a cleaner approach through per-instance Pareto fronts, but its diversity signal weakens when performance across instances is highly correlated. The result across the board is that capable models end up doing double duty: both proposing new solutions and compensating for a selection layer that lets the population narrow too quickly.
LEVI takes a different starting point. Rather than building the harness around the assumption of a strong model, we ask what the search architecture should look like if model calls are expensive and limited. By improving the archive’s ability to maintain structurally diverse solutions throughout the search, we reduce the burden on the model, making it possible to get strong results with cheaper models and smaller budgets. The goal is not to eliminate the need for capable models, but to ensure they are used where they matter most, and that researchers without frontier-model budgets can still push the state of the art.
LEVI
LEVI is built on two core ideas: stratified model allocation and improved diversity maintenance. While explained separately, they are best understood as extensions of each other. The archive provides the structure that makes principled allocation possible, and principled allocation is what makes a diversity-preserving archive practical under tight budgets.
Hover over each component for a detailed description.
Stratified Model Allocation
Match model capacity to task demand: cheap models for refinement, expensive models for paradigm shifts.
Frontier models help, but they are a waste if used for every mutation. Smaller LLMs may actually be preferred under tight budgets, since the sheer quantity of solutions they produce can outweigh the quality advantage of larger models.55 The original FunSearch paper (Romera-Paredes et al, 2024) exclusively used smaller models and reported being unable to benefit from larger ones. However, smaller models have a narrower pretraining distribution, limiting their range of ideas and ability to propose fundamentally different approaches. Neither model class is strictly better; they have different strengths.
Some existing frameworks already support multiple models, but treat them as interchangeable, sampling from an ensemble uniformly or routing calls without regard to what the mutation actually demands. This ignores a natural asymmetry: proposing an entirely new algorithmic direction requires broad knowledge and creative reasoning, while refining an existing approach (adjusting constants, reordering operations, tuning edge cases) requires far less. The harness should be aware of this distinction and allocate accordingly.
LEVI introduces stratified model allocation, which matches model capacity to task demand. Smaller, cheaper models handle the majority of the search: local refinements and incremental improvements within an established algorithmic family. Larger models are reserved for infrequent paradigm shifts: mutations that aim to propose structurally different approaches rather than polish existing ones. The principle is straightforward: allocate each model toward its strength. Small models for breadth and throughput, large models for creative leaps.
However, this raises two questions. First, how do we select representative solutions from each algorithmic family to give the larger model meaningful context for paradigm shifts? Second, since we now rely more heavily on smaller models and their volume of output, we need a more robust mechanism to prevent the archive from converging, because quantity without diversity is just noise.
Improved Diversity Maintenance
A unified fingerprint space with noise-perturbed initialization keeps the archive structurally diverse throughout the search.
The diversity mechanism must address two things: how to represent solutions and how to initialize the archive.
Unifying structural and performance diversity. Existing frameworks maintain diversity along different axes. OpenEvolve considers structural features like code length; GEPA considers per-instance performance trade-offs through Pareto fronts. Both capture something real, but neither captures the full picture. Structure alone misses behavioral differences, and per-instance scores alone miss solutions that perform similarly but work in fundamentally different ways. Rather than choosing one, we use both as dimensions of a single behavioral descriptor. Each solution is mapped to a fingerprint: a vector of AST-based structural features (depth, loop count, cyclomatic complexity, etc.) alongside per-instance performance scores, normalized and projected to [0, 1]. This fingerprint lives in a CVT-MAP-Elites archive, where a Voronoi tessellation over the combined space maintains geometric structure that neither axis provides alone. This also directly answers the first question from the previous section: the Voronoi regions naturally cluster solutions into algorithmic families, giving us representative solutions for paradigm shifts.
Initializing between two extremes. Traditional CVT-MAP-Elites initializes centroids uniformly across the descriptor space. With the higher dimensionality we use (6 to 10 dims), this leads to an extremely sparse tessellation where most regions will never be visited. A purely data-driven approach (fitting centroids to the first observed solutions) solves sparsity but creates the opposite problem: the archive’s geometry overfits to whatever strategies appear early, leaving little room for novel approaches that emerge later. We take a middle path: data-driven initialization with noise. We generate a small set of structurally distinct seed programs (fewer than 10), expand them into variants, fingerprint them all, and then add Gaussian noise before fitting centroids. The seeds anchor the tessellation in regions of the space that viable programs actually occupy, while the noise broadens each family’s footprint, ensuring the archive can accept innovations that fall between or outside the initial seed families. In practice, this is much more effective than either extreme.
Preliminary Results: ADRS Benchmark
We evaluate on the ADRS benchmark suite44 ADRS (Cheng et al, 2025): benchmark suite from UC Berkeley for LLM-guided optimization on real-world systems problems. introduced by Cheng et al.: a set of real-world systems problems spanning cloud scheduling, load balancing, SQL optimization, and transaction scheduling. We evaluate on seven of the nine problems. Our archive uses 50 centroids initialized via the fingerprint-then-perturb procedure with 5 structurally distinct seeds. Approximately 90% of LLM calls are routed to lightweight models (Qwen3-30B-A3B and MiMo-v2-Flash), with the remaining 10% reserved for paradigm shifts via Gemini Flash 3. We run 600--2,000 generations per problem.
Benchmark Scores
| Framework | Average | Cloudcast | EPLB | LLM-SQL | Prism | Spot Multi-Reg | Spot Single-Reg | Txn Scheduling |
|---|---|---|---|---|---|---|---|---|
|
GEPA
|
71.9 | 96.6 | 70.2 | 67.7 | 87.4 | 62.2 | 51.4 | 67.7 |
|
OpenEvolve
|
70.6 | 92.9 | 62.0 | 72.5 | 87.4 | 66.7 | 42.5 | 70.0 |
|
ShinkaEvolve
|
67.4 | 72.0 | 66.4 | 68.5 | 87.4 | 63.6 | 45.6 | 68.2 |
|
LEVI
|
76.5 | 100.0 | 74.6 | 78.3 | 87.4 | 72.4 | 51.7 | 71.1 |
Figure 3: ADRS benchmark scores (%). Bold indicates best per problem. LEVI achieves the highest score on every problem where improvement is possible.
LEVI achieves the highest score on every problem where improvement is possible, with an average of 76.5 compared to 71.9 for the next-best framework (GEPA), a +4.6 point improvement over the prior state of the art. On Cloudcast, LEVI reaches a perfect 100.0, indicating the problem is fully solved under the benchmark’s scoring function. The largest gains appear on LLM-SQL (+5.8) and Spot Multi (+5.7), while more modest improvements on Spot Single (+0.3) and Transaction Scheduling (+1.1) reflect problems with smaller decision spaces or harder optimization landscapes. Prism remains tied at 87.4 across all frameworks, confirming that the current problem formulation admits a single dominant solution.
An additional observation: no single baseline is consistently second-best across problems, reflecting how the different diversity mechanisms each framework employs interact unevenly with different problem structures. LEVI’s consistent first-place performance suggests that CVT-MAP-Elites with fingerprint-initialized centroids provides a more robust diversity mechanism regardless of problem characteristics.

Figure 4: LEVI best score progression over generations. The archive sustains steady improvement throughout the search rather than plateauing early.
Cost
Stratified allocation drops per-generation cost by ~10x, enabling more generations at lower total spend.
LEVI’s stratified allocation is the primary driver of cost reduction. By routing the majority of mutations through lightweight models, the per-generation cost drops by roughly an order of magnitude compared to baselines that use GPT-5 or Gemini-3.0-Pro for every call. This allows LEVI to run substantially more generations while still spending less in total: $4.50 per problem on most tasks (Transaction Scheduling: $13), versus $15 to $30 for baselines.
The cost reduction is not the point; it is evidence that the harness-first approach works. When the archive maintains diversity, cheap models suffice for most of the search.
Controlled Architecture Comparison
Same model, same budget, three seeds: isolating the search architecture's contribution.
The main results compare frameworks that differ simultaneously in model choice, budget, and architecture. To isolate the contribution of the search architecture, we run LEVI, OpenEvolve, and GEPA under identical conditions: a single locally-served Qwen3-30B-A3B model, 750 successful evaluations,33 OpenEvolve required reducing parent count from 5 to 2 for the smaller model and still produced many failures. We report successful evaluations rather than total to give OpenEvolve a fair comparison. and three random seeds on two representative problems.
Transaction Scheduling is a variant of an NP-hard ordering problem where multiple algorithmic families (greedy, simulated annealing, genetic) are viable but performance is measured on a single instance, giving Pareto-based diversity no trade-off to exploit. LEVI reaches a score of 62 within the first 100 evaluations, a level neither baseline achieves at any point. Final scores: LEVI 64.9, OpenEvolve 59.9, GEPA 54.4. Both baselines plateau sharply, consistent with early convergence onto a single algorithmic family; LEVI’s curve continues rising past evaluation 500.

Figure 5: Controlled comparison on Transaction Scheduling. Same model, same budget. LEVI reaches 62 within 100 evaluations, a level neither baseline achieves at any point during the run.
Can’t Be Late is scored across 1,080 simulations that give Pareto-based approaches a richer signal. The final-score gap narrows (LEVI 44.9, OpenEvolve 43.2, GEPA 32.6), but the efficiency gap widens dramatically. LEVI reaches near-peak performance by roughly evaluation 50, while OpenEvolve requires over 600 evaluations to approach the same level, a roughly 12× advantage in sample efficiency.

Figure 6: Controlled comparison on Can't Be Late. LEVI reaches near-peak by evaluation 50; OpenEvolve requires 600+ evaluations to approach the same level.
These controlled results confirm that the performance gains are attributable to the search architecture, not to model choice or budget. A 30B model under LEVI’s search regime matches or exceeds what the same model achieves under alternative selection mechanisms.
More benchmarks and domains are in progress. ADRS is a first validation, not the full story.
LEVI will be open-sourced on GitHub soon. Point it at a scoring function and a seed program and it runs until the budget is spent.