← Back to Blog

ACE: Teaching AI Agents to Learn from Experience

Stanford researchers introduce ACE (Agentic Context Engineering), a framework that makes LLMs self-improving through evolving 'playbooks.' What can we learn from this approach?

Every developer who has used AI coding assistants has experienced the Groundhog Day problem. You explain your project's conventions. The AI gets it wrong. You correct it. Next session, same mistakes. Rinse, repeat, forever.

The standard industry response has been fine-tuning—training the model on domain-specific data to change its weights. It works, but it's expensive, slow, and requires significant ML expertise. For most teams, it's simply not practical.

A team of researchers from Stanford, SambaNova Systems, and UC Berkeley recently proposed something different. What if we could make models improve without changing their weights at all?

Their framework is called ACE (Agentic Context Engineering), and it treats the context window as an evolving knowledge base—a "playbook" that accumulates wisdom over time.

The Core Insight

ACE builds on a simple observation: the context we feed to LLMs matters enormously. A well-crafted prompt can dramatically improve output quality. But crafting that context manually is tedious work, and the patterns that make a prompt effective are often domain-specific and hard to articulate.

What if we could have agents build and refine their own context automatically?

ACE introduces a three-agent architecture that does exactly this:

Generator — Produces outputs and executes tasks, using whatever playbook knowledge exists. This is the agent doing the actual work.

Reflector — Analyzes outcomes and extracts lessons. When something goes wrong (or right), the Reflector figures out why and documents the insight.

Curator — Manages the playbook itself. It adds new insights, merges duplicates, and prunes advice that isn't helping.

The separation of roles matters. Previous approaches often had a single agent both generating content and updating its own context. This led to "context collapse"—where iterative rewrites gradually eroded useful details until the context became generic and unhelpful.

By giving each role to a specialized agent, ACE avoids this trap.

The Playbook

At the heart of ACE is the playbook—a structured document that accumulates domain knowledge:

STRATEGIES & INSIGHTS

[str-00001] helpful=5 harmful=0 :: Always verify data types before processing [str-00002] helpful=3 harmful=1 :: Consider edge cases in financial data

COMMON MISTAKES TO AVOID

[mis-00004] helpful=6 harmful=0 :: Don't forget timezone conversions

CODE SNIPPETS & TEMPLATES

[code-00005] helpful=8 harmful=0 :: NPV = Σ(Cash Flow / (1+r)^t)

Each bullet has an ID, effectiveness counters, and the actual advice. The counters get updated by the Reflector based on real outcomes—did following this advice lead to success or failure?

This creates a natural feedback loop. Advice that works gets reinforced. Advice that doesn't gets weakened and eventually pruned.

Two Key Design Choices

ACE's effectiveness comes from two design choices that prevent common failure modes:

Incremental Delta Updates — Rather than rewriting the entire playbook from scratch, ACE makes localized edits. Add a bullet here, update a counter there, merge two similar items. This preserves accumulated knowledge instead of risking its loss in a full rewrite.

Grow-and-Refine — The playbook expands when new insights emerge, but the Curator periodically consolidates redundant bullets using semantic similarity. This prevents bloat while maintaining detail.

These sound obvious in retrospect, but previous approaches often tried to have the model regenerate the entire context each time. That's where context collapse happens—each rewrite loses a little fidelity until you're left with generic platitudes.

The Numbers

ACE shows genuine improvements across benchmarks:

    1. AppWorld agent tasks: +10.6% improvement over baselines
    2. Finance reasoning (FiNER + XBRL): +8.6% improvement
    3. Adaptation efficiency: 82-92% latency reduction versus competing approaches
Perhaps most impressively, ReAct+ACE (using the open-source DeepSeek-V3.1 model) achieved 59.4% on the AppWorld leaderboard—matching production-level GPT-4.1 based agents despite using a smaller model.

The framework outperforms both Dynamic Cheatsheet (which uses persistent memory) and GEPA (which evolves prompts through Pareto optimization). And it does so while being significantly cheaper to run.

Online Learning Mode

What makes ACE particularly interesting for real-world applications is its "online" mode. The system can learn while processing actual tasks, not just during dedicated training sessions.

The flow looks like this:

  1. Generator uses current playbook to complete a task
  2. User accepts, modifies, or rejects the output
  3. Reflector analyzes the outcome and extracts insights
  4. Curator integrates new insights into the playbook
  5. Next task benefits from accumulated wisdom
This means the system gets better as you use it, without any special training runs or data collection exercises.

Reflections: What Can We Learn?

Building LightSprint, we've been thinking about similar problems from a different angle. Our AI task generation system scans your codebase and creates implementation-ready tasks with related files, todos, and complexity assessments. But right now, every task generation starts fresh—the system doesn't remember what worked well for your specific codebase.

ACE suggests some interesting directions.

Learning from edits is underutilized. When a user modifies a generated task—changing the title, rewriting todos, adding files we missed—that's valuable signal. ACE's Reflector architecture shows how to systematically capture and learn from these corrections rather than just accepting the edit and moving on.

The counter system is elegant. Tracking helpful/harmful counts per insight creates a natural quality filter. Low-performing advice gets pruned automatically. We could apply similar thinking to codebase patterns—if the AI consistently identifies certain file types as relevant and users consistently remove them, that's a signal.

Separation of concerns matters. Having distinct agents for generation, reflection, and curation prevents the feedback loop from destabilizing. A single agent trying to both create content and evaluate its own performance tends to either ignore feedback or over-correct.

Token budgets force discipline. ACE enforces an 80,000 token budget on playbooks. This constraint prevents runaway growth and forces the Curator to make real prioritization decisions. Unbounded accumulation leads to noise drowning out signal.

Our automatic todo completion feature already does something Reflector-like—it analyzes commits and matches them against tasks to determine what's actually been accomplished. The difference is we're using this for status tracking, not for improving future task generation. ACE suggests we could do both.

The Bigger Picture

ACE represents a shift in how we think about LLM improvement. Instead of the traditional "train a better model" approach, it asks: what if we built better scaffolding around existing models?

This is appealing for practical reasons. Training models requires massive compute, specialized expertise, and creates versioning headaches. Improving context is something any team can iterate on. The model itself becomes a commodity; the intelligence lives in the evolved playbook.

There's something almost biological about it. Evolution doesn't redesign organisms from scratch each generation—it makes incremental modifications that accumulate over time. ACE brings that same dynamic to AI systems.

Of course, context-based improvement has limits. There are things you genuinely can't teach through prompting alone. But for domain adaptation, for learning project-specific conventions, for avoiding repeated mistakes—the approach shows real promise.

Looking Forward

ACE is research software, and the paper is recent (October 2025). It'll be interesting to see how the ideas get adopted and adapted by the broader community.

For those building AI-assisted development tools, the core insights are immediately applicable:

    1. Systematically capture corrections as training signal
    2. Separate generation from evaluation from curation
    3. Use counters to track what actually helps
    4. Preserve knowledge through incremental updates
    5. Enforce budgets to prevent noise accumulation
The context window is an underexploited resource. ACE shows one way to make it dynamic—an evolving playbook that gets smarter with use rather than staying static.

That's the future we're all building toward: AI systems that genuinely learn from working with you, not just systems that reset to zero every session.


ACE is open source and available at github.com/ace-agent/ace. The paper Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models has full technical details.

Interested in AI-native project management? Try LightSprint—where your task board stays in sync with your codebase, automatically.