Designing a Memory System for Coding Agents: RAG, Vector DBs, and What We Got Wrong
Three weeks into our first agent prototype, we watched it try to add a database migration for a field that already existed. The field had been added six days earlier. By the same agent. In the same codebase.
That's when we realized we had a memory problem.
Building DevOS — where AI agents work as employees inside your sprint — means agents need to remember context across days and weeks, not just within a single chat session. This is the gap that single-agent tools like Devin and Cursor don't address. A developer doesn't forget what they shipped last Tuesday. An agent shouldn't either.
We spent six weeks getting this wrong before we figured out what actually works. This is the story of what broke, what we tried, and the three-tier memory architecture we eventually shipped.
The Problem: Agents Forget Everything
Here's the setup. You have a coding agent that's assigned a ticket: "Add rate limiting to the /api/users endpoint." Simple enough.
But to do this well, the agent needs to know:
- What rate limiting approach the team already uses elsewhere (Redis? In-memory? Token bucket?)
- Whether there's an existing ADR about rate limiting decisions
- That three months ago, someone tried adding rate limiting to /api/posts and it broke the mobile app because of how retry logic worked
- That the team's runbook for rate limit incidents exists and has specific thresholds
None of this is in the ticket. None of this is in the immediate code context. It's scattered across the repo, the docs, the PR history, and the incident log.
An agent without memory treats every ticket like day one on the job.
With memory, it builds institutional knowledge over time — the same way a human engineer does. That's the theory, anyway. Getting there was painful.
What We Tried First: Pure Vector Search
The obvious first attempt: embed everything, throw it in a vector database, retrieve the top-k similar documents when the agent needs context.
We tried Pinecone. We tried pgvector. We tried Chroma. The embedding model was OpenAI's text-embedding-3-large. The chunking strategy was 512 tokens with 50-token overlap. Standard stuff.
It worked. Kinda.
When the agent searched for "rate limiting implementation," it found code snippets mentioning rate limits. It found a README section about API throttling. It found a config file with rate limit values.
What it didn't find:
- The ADR from eight months ago explaining why the team chose token bucket over sliding window
- The incident postmortem where rate limiting misconfiguration caused a 3-hour outage
- The PR discussion where the mobile team flagged that aggressive rate limiting breaks their retry logic
Vector similarity isn't structural understanding. "Rate limiting implementation" and "incident postmortem: cascading timeouts from aggressive throttling" have low cosine similarity. The words are different. The embedding space doesn't connect them.
We were retrieving the what but missing the why and the what went wrong before.
The Second Attempt: Stuff More Into the Prompt
Maybe the problem was retrieval, not representation. So we tried aggressive context stuffing.
Before every agent task, we'd pull:
- The 20 most recent PRs touching related files
- All ADRs in the /docs folder
- The last 10 incident reports
- The full git blame for any file the agent might touch
Then we'd cram all of that into the system prompt. Claude's context window is big, right? Just use it.
This was worse. Way worse.
First, tokens cost money. We were burning through API costs at 4x our budget for marginal improvement. (I'll be honest — my heart sank watching those usage graphs climb. We'd somehow built the most expensive context window in history.) Second, more context isn't always better context. When you dump 80,000 tokens of loosely-related documents on an agent, it gets confused. It hallucinates connections between unrelated things. It references the incident report from the payments service while working on the user service because both happened to mention "timeout."
Noise scales faster than signal. Period.
The failure mode that killed this approach: an agent working on a database migration "remembered" a migration from a completely different project that happened to be in the context window. It copy-pasted patterns that didn't apply. Broke the build. Wasted two hours of human debugging time.
The Architecture That Actually Works: Three-Tier Memory
Here's where we landed. Three memory tiers, each serving a different purpose.
Tier 1: Graphiti Knowledge Graphs
Graphiti builds a knowledge graph from your repository artifacts. Not just embeddings — actual relationships.
When we ingest a repo, Graphiti extracts entities and connections:
UserController→ calls →AuthService→ documented in →auth-runbook.mdrate_limit_config.yml→ referenced in incident →INC-2024-0312→ resolved by →PR #847migration_20240815.sql→ adds field →users.last_login_at→ used by →SessionTracker.ts
When the agent needs context for "add rate limiting to /api/users," the graph traversal finds the incident, the runbook, the previous PR, and the config — even though the semantic similarity is low. It follows relationships, not just embeddings.
Building this wasn't trivial. We had to write custom extractors for ADRs, incident reports, and PR discussions. The default Graphiti entity extraction works well for code, less well for freeform docs. Three weeks of tuning — and I'd be embarrassed to show you the first version. It kept classifying every doc as a "README." But it works now.
Tier 2: Embedded Vector Memory
We didn't throw out vector search entirely. It's still useful for the "find me something similar" queries that don't fit a structured traversal.
"What's an example of how we handle pagination in this codebase?"
That's a semantic search. The agent doesn't need to traverse relationships — it needs to find a representative example. Vector retrieval with reranking handles this well.
We use a two-stage approach: broad vector retrieval (top 50) followed by a reranker that scores relevance to the specific task. The reranker catches false positives from the embedding model. Costs an extra 200ms per query. Worth it. (I'd argue most teams skip reranking too early — the 200ms latency feels worse than it actually matters in practice.)
Tier 3: Episodic State Recovery
Here's the one nobody talks about: what happens when an agent session crashes? Or when the agent works on a ticket for 30 minutes, gets blocked waiting for CI, and resumes 4 hours later?
Episodic memory tracks what happened, not just what exists. We checkpoint agent state every 60 seconds during active work:
- What the agent has read
- What changes it's made (uncommitted)
- What external calls it's waiting on
- What hypotheses it's explored and rejected
When the agent resumes, it reconstructs where it was. "I was adding rate limiting. I'd already looked at the existing rate limiter in /lib/throttle.ts. I'd drafted a change to /api/users but hadn't tested it yet. CI was still running on my previous commit."
This matters more than you'd expect. Long-horizon tasks — the kind that take hours or days — can't be one-shot. Agents get interrupted. Context windows reset. Without episodic checkpointing, every interruption means starting over.
What We Still Got Wrong
The three-tier system works. It's what DevOS ships for agent memory. (For context on how we measure whether agents actually work, see our piece on why SWE-bench isn't enough.) But I'd be lying if I said we nailed it.
Ingestion latency is annoying. When a developer pushes a new ADR, it takes 2-3 minutes for Graphiti to re-index and make it available to agents. During that window, an agent might make a decision with stale context. We've thought about real-time streaming ingestion but haven't implemented it yet. For now, we tell users to wait a beat after pushing docs before assigning agent tasks.
Episodic recovery doesn't handle merge conflicts. If an agent pauses, a human pushes changes to the same file, and the agent resumes — it's going to have a bad time. We detect the conflict and alert, but we don't auto-resolve. A human has to intervene. Not ideal.
The knowledge graph gets noisy over time. Old incident reports that were resolved years ago still show up in traversals. We haven't implemented decay or archiving for stale nodes. The graph grows monotonically. At some point, this will matter. It already annoys me every time I see 2019 incidents surfacing for a 2026 task.
We're also uncertain whether three tiers is the right abstraction long-term or whether we'll collapse to two (graph + episodic) or expand to four (separating semantic memory from procedural memory). The honest answer: we don't know yet. We're watching how agents use each tier and adjusting.
The Metrics That Matter
Since shipping the three-tier system, we've tracked a few things:
- Context miss rate: How often an agent makes a decision that contradicts existing repo knowledge. Dropped from ~18% to ~4% after adding Graphiti.
- Re-work rate: How often a human reviewer sends a PR back for changes that "should have been obvious from the codebase." Down ~60%.
- Interrupted session recovery: Agents resume successfully ~94% of the time. The 6% failures? Mostly merge conflicts.
The metric we don't have yet: long-horizon learning. Does an agent that's worked on a repo for three months make better decisions than one starting fresh? We believe yes, but we haven't instrumented it yet.
Ask us in six months. (Or don't — we'll probably write about it anyway.)
Why This Architecture for Coding Agents Specifically
If you're building an agent for customer support or document summarization, your memory needs are different. Coding agents have a specific pattern:
- High structural interdependence: Code references code references code. A change in one file ripples through many. Pure semantic search misses these structural relationships.
- History matters a lot: Why something was built a certain way matters as much as how it's built now. ADRs, incident reports, PR discussions — these are institutional memory.
- Long task horizons: A coding task might span hours or days with interruptions. Session memory has to survive those gaps.
Generic RAG pipelines — the kind you'd build for a chatbot — don't handle these patterns well. We tried. (Boy, did we try. Six weeks of "maybe if we just tweak the chunk size...") It's why we ended up building something custom. The broader shift toward agents replacing IDEs entirely demands memory systems that weren't designed for chat interfaces.
If you're designing memory for agents in a different domain, your tiers might look different. But the principle holds: single-tier memory — just embeddings, just context stuffing — breaks down when tasks get complex. Anyone telling you otherwise probably hasn't shipped agents that run for more than ten minutes.
Building Memory-Native Agents
The DevOS positioning is AI agents as employees inside your sprint — agents that take tickets, ship PRs, and accumulate knowledge over time. (We've written about how this architecture differs from other agent marketplaces if you want the full breakdown.) That last part, "accumulate knowledge," only works if memory works.
We're pre-launch. The three-tier system is built and running internally. If you're curious how it holds up in a real sprint workflow — with a backlog, with handoffs, with the kind of context that accumulates over weeks — the waitlist is open.
For teams building their own agent memory systems, here's the short version: don't assume vector search solves everything. Structure matters. History matters. Interruption recovery matters. Plan for all three, or your agents will forget what they shipped last Tuesday. (And if you're running multiple products with AI assistance, here's how VDL structures its AI workforce across the portfolio.)
Frequently Asked Questions
Why does naive vector search fail for coding agent memory?
Vector embeddings capture semantic similarity, not structural relationships. When an agent searches for "authentication flow," it gets code snippets mentioning auth — but not the ADR explaining why you chose JWT over sessions, not the incident report from the refresh token bug, and not the PR discussion about rate limiting. The context is scattered across semantically dissimilar documents that cosine similarity won't surface together.
What's the difference between episodic and semantic memory for AI agents?
Semantic memory stores facts and relationships — "the User model has a hasMany relationship with Posts." Episodic memory stores events and sequences — "last Tuesday, the agent tried to add a field to User, hit a migration conflict, rolled back, and used a join table instead." Both matter. Semantic memory answers "what is true?" Episodic memory answers "what happened before when we tried this?"
How does Graphiti work for agent memory in DevOS?
Graphiti builds a knowledge graph from repository artifacts — code, ADRs, runbooks, PR discussions, incident reports. Instead of pure vector similarity, the agent traverses relationships: "this function calls that service, which is documented in this runbook, which was updated after this incident." The graph structure preserves context that embeddings lose.
How does DevOS handle memory across long-running agent sessions?
Three tiers work together. Graphiti knowledge graphs for structured relationships across the repo. Embedded vector memory for semantic search when finding related concepts. Automatic state recovery that checkpoints agent progress so a session interrupted on Tuesday resumes Thursday without losing context. The tiers complement each other — no single approach handles everything.
Join the DevOS Waitlist
AI agents that work as employees inside your sprints, standups, and tickets — not single-task copilots. Planner / Developer / QA / DevOps agents pick up work from the backlog, ship in branches, request review. Linear-shaped backlog UI with AI underneath. Pre-launch.