Agent Memory and RAG: The Complete Developer Guide to Building AI Agents That Remember

Satish Prasad
22 Min Read

Most agents you build today forget everything the moment a session ends. This guide teaches you the memory architecture that changes that β€” from working memory to RAG pipelines to long-term semantic stores backed by LangGraph.


Why Your Agent Keeps Starting From Zero

Picture this: a user spends 20 minutes talking to your support agent, explains their account history, their preferences, their exact problem. They come back the next day. The agent has no idea who they are.

That’s not a model failure. It’s an architecture failure.

Every production-grade agent eventually hits the same wall: the context window isn’t a memory system. It’s a scratchpad. It holds the last few thousand tokens of conversation, then forgets everything as soon as the session ends. No persistence, no recall, no learning.

Building agents that actually remember requires thinking across four distinct memory layers β€” and understanding how Retrieval-Augmented Generation (RAG) ties them all together. This guide builds that understanding from the ground up, with verified working code at every step.


Part 1: The Four Memory Types Every Agent Needs

Cognitive science describes human memory in terms of duration and function. AI agent memory maps onto the same taxonomy β€” and production architectures use all four types. Each maps to a different role, storage mechanism, and retrieval pattern.

Short-term memory types such as working memory, semantic cache, and conversation buffers keep the agent effective in the moment. Long-term memory types such as semantic, episodic, experiential, and procedural memory enable persistence and learning across sessions.

Here’s the practical mapping:

from enum import Enum

class MemoryType(Enum):
    WORKING = "working"       # in-context, session-scoped
    EPISODIC = "episodic"     # past events / interaction history
    SEMANTIC = "semantic"     # facts, preferences, knowledge
    PROCEDURAL = "procedural" # how-to patterns, workflows

Working Memory (In-Context)

Working memory is the agent’s context window. Everything currently β€œin mind” β€” the conversation so far, retrieved documents, tool results β€” lives here. It’s fast, zero-latency, and completely ephemeral.

Think of it as RAM: powerful while the process runs, gone when it ends. The context window is your working memory. Managing it well β€” trimming old messages, summarising history, paging in only what’s relevant β€” is the first performance lever every production agent needs.

Semantic Memory (Facts and Knowledge)

Semantic memory stores distilled knowledge: facts, concepts, preferences β€” without needing the full story of when they were learned. In agent systems, this is where many RAG-style approaches live: embeddings in vector databases, structured fact stores, or knowledge graphs.

Examples: user preferences, product catalogue, company policies, domain facts. Semantic memory is stable and searchable by meaning, not exact match.

Episodic Memory (Interaction History)

Episodic memory preserves sequences of events as they happened: full conversations, task trajectories, ordered observations. Unlike semantic memory, it keeps narrative context and temporal flow.

According to a 2025 research paper (arXiv:2502.06975), episodic memory for AI agents must have five properties: long-term storage, explicit reasoning, single-shot learning, instance-specific memories, and contextual memories β€” who, when, where, why, bound to the content.

Examples: prior support tickets, past task outcomes, conversation summaries.

Procedural Memory (Patterns and Workflows)

Procedural memory encodes how to do things β€” tool-use policies, task templates, learned workflows. For AI agents, this is often implemented as few-shot examples injected into the system prompt: showing the model a successful past interaction to steer its next action.

Facts can be written to semantic memory, whereas experiences can be written to episodic memory. For AI agents, episodic memory is often used to help an agent remember how to accomplish a task β€” in practice through few-shot example prompting, where agents learn from past sequences to perform tasks correctly.

How the Four Types Flow Together

Before an agent responds or acts, it typically retrieves relevant facts from semantic memory and injects them into working memory. This is the core RAG pattern: keep long-lived information outside the context window, then pull only what’s needed for the current decision. As interactions unfold, the agent should persist the event sequence to episodic storage. Over time, raw experience becomes more useful when summarised into stable knowledge.

Semantic Store
      ↓ (RAG retrieval β†’ working memory)
Working Memory (context window)
      ↓ (persist what happened)
Episodic Store
      ↓ (consolidate patterns)
Procedural Store (few-shot examples)

Part 2: RAG β€” The Bridge Between Memory and Response

Retrieval-Augmented Generation is the mechanism that makes external memory usable. Rather than relying solely on the model’s trained weights, RAG fetches relevant content from an external store and injects it into the context window before generation.

RAG is a hybrid architecture that augments an LLM’s text generation capabilities by retrieving and integrating relevant external information from documents, databases, or knowledge bases. Instead of relying on the LLM’s internal parameters, the model queries an external retriever.

The pipeline has four stages:

  1. Load β€” ingest source documents
  2. Chunk β€” split into retrieval-sized units
  3. Embed β€” convert to vector representations
  4. Retrieve and Generate β€” similarity search β†’ inject β†’ respond

Build a Production RAG Pipeline

# rag_pipeline.py
# pip install langchain langchain-community langchain-anthropic faiss-cpu pypdf

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.embeddings import init_embeddings
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough


def build_rag_pipeline(pdf_path: str):
    # Step 1 β€” Load document
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()

    # Step 2 β€” Chunk (500 tokens, 50-token overlap for context continuity)
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = splitter.split_documents(docs)

    # Step 3 β€” Embed + store in FAISS
    embeddings = init_embeddings("openai:text-embedding-3-small")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

    # Step 4 β€” Prompt template with injected context
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based only on the provided context. "
                   "If the answer isn't in the context, say so.\n\nContext: {context}"),
        ("human", "{question}")
    ])

    # Step 5 β€” Assemble the chain
    llm = init_chat_model("anthropic:claude-sonnet-4-6")

    def format_docs(docs):
        return "\n\n".join(d.page_content for d in docs)

    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
    )
    return chain


# Usage
if __name__ == "__main__":
    rag = build_rag_pipeline("company_policy.pdf")
    answer = rag.invoke("What is the refund window for digital products?")
    print(answer.content)

Chunking Strategy Matters

Chunk size is one of the most impactful decisions in a RAG system. Too large: irrelevant content dilutes the answer. Too small: you lose the context needed to answer properly.

A proven production pattern is parent-child chunking: large parent chunks (based on headings or sections) for context richness, small child chunks for precise retrieval. The system searches child chunks to find the right location, then returns the parent chunk for full context.


Part 3: Long-Term Memory with LangGraph Stores

RAG gives agents access to external knowledge. But agents also need to write memories β€” remember that this specific user prefers bullet points, or that the last task on this account failed at step 3.

LangGraph provides the InMemoryStore (dev) and PostgresStore / MongoDBStore (production) as cross-session memory backends. Unlike the checkpointer (which saves per-thread conversation state), the Store persists data across threads and sessions.

The core API is a namespaced key-value store with optional semantic search.

Write and Read Semantic Memory

# semantic_memory.py
# pip install langgraph langchain

import uuid
from langchain.embeddings import init_embeddings
from langgraph.store.memory import InMemoryStore

# Dev: InMemoryStore β€” swap for PostgresStore in production
embeddings = init_embeddings("openai:text-embedding-3-small")

store = InMemoryStore(
    index={
        "embed": embeddings,   # Embedding provider
        "dims": 1536,          # Must match your embedding model's output dims
        "fields": ["text"]     # Which fields to embed for semantic search
    }
)

# Write user facts (namespace = (user_id, memory_type))
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User prefers bullet-point summaries"})
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User works in fintech compliance"})
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User timezone is IST (UTC+5:30)"})

# Retrieve by semantic similarity β€” no exact match needed
results = store.search(
    ("user_001", "memories"),
    query="What industry does the user work in?",
    limit=2
)

for item in results:
    print(f"Score: {item.score:.3f} | {item.value['text']}")
# β†’ Score: 0.91 | User works in fintech compliance
# β†’ Score: 0.74 | User prefers bullet-point summaries

Namespace design is critical. Use (user_id, memory_type) tuples to prevent memory leakage across users and keep different memory types cleanly separated. This is the namespacing pattern recommended by LangChain for production deployments.

Write Episodic Memory (Interaction History)

# episodic_memory.py
import uuid
from datetime import datetime, timezone
from langgraph.store.memory import InMemoryStore

store = InMemoryStore()

def write_episode(
    user_id: str,
    task: str,
    outcome: str,
    tools_used: list[str]
) -> None:
    """Persist an interaction episode for future retrieval."""
    episode = {
        "task": task,
        "outcome": outcome,
        "tools_used": tools_used,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    store.put((user_id, "episodes"), str(uuid.uuid4()), episode)
    print(f"Episode stored for {user_id}: {task} β†’ {outcome}")


# After each completed agent run, write the episode
write_episode(
    user_id="user_001",
    task="Summarise Q3 earnings report",
    outcome="success",
    tools_used=["pdf_loader", "summarise_tool"]
)

write_episode(
    user_id="user_001",
    task="Draft regulatory filing",
    outcome="failed β€” missing data",
    tools_used=["document_search", "draft_tool"]
)

# Later: retrieve what tasks this user has done
all_episodes = store.search(("user_001", "episodes"), query="regulatory filing", limit=3)
for ep in all_episodes:
    print(ep.value)

Part 4: The Agentic Memory Graph β€” Combining Everything

Now let’s wire all of it together: a LangGraph agent that retrieves relevant memories before every response, and writes new memories after every interaction.

# agentic_memory_graph.py
# pip install langgraph langchain-anthropic langchain

import uuid
from typing import TypedDict, Annotated
from langchain.embeddings import init_embeddings
from langchain.chat_models import init_chat_model
from langgraph.graph import START, END, StateGraph, add_messages
from langgraph.store.memory import InMemoryStore
from langgraph.runtime import Runtime
from langchain_core.messages import AnyMessage, HumanMessage, AIMessage


class AgentState(TypedDict):
    messages: Annotated[list[AnyMessage], add_messages]


# ── Memory Store (with semantic search) ──────────────────────────────
embeddings = init_embeddings("openai:text-embedding-3-small")
store = InMemoryStore(
    index={"embed": embeddings, "dims": 1536, "fields": ["text"]}
)

# Seed some semantic memories
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User works in financial services compliance"})
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User prefers concise, bullet-point answers"})
store.put(("user_001", "memories"), str(uuid.uuid4()), {"text": "User is based in Mumbai, India (IST timezone)"})

llm = init_chat_model("anthropic:claude-sonnet-4-6")


# ── Node 1: Retrieve memory + respond ────────────────────────────────
async def memory_agent(state: AgentState, runtime: Runtime) -> AgentState:
    user_message = state["messages"][-1].content

    # Retrieve semantically relevant memories for this query
    memories = await runtime.store.asearch(
        ("user_001", "memories"),
        query=user_message,
        limit=3
    )
    memory_context = "\n".join(f"- {m.value['text']}" for m in memories)
    system_prompt = (
        "You are a helpful assistant with memory of this user.\n\n"
        f"What you know about this user:\n{memory_context}"
    )

    response = await llm.ainvoke([
        {"role": "system", "content": system_prompt},
        *state["messages"]
    ])
    return {"messages": [response]}


# ── Node 2: Write new memories from conversation ─────────────────────
async def memory_writer(state: AgentState, runtime: Runtime) -> AgentState:
    """Extract and persist new facts from the last exchange."""
    last_human = next(
        (m.content for m in reversed(state["messages"]) if isinstance(m, HumanMessage)),
        ""
    )
    # Simple extraction β€” in production, use an LLM to extract structured facts
    if any(keyword in last_human.lower() for keyword in ["i work", "i prefer", "i am", "my "]):
        store.put(
            ("user_001", "memories"),
            str(uuid.uuid4()),
            {"text": f"User said: {last_human[:200]}"}
        )
    return {}   # No state change β€” pure side effect


# ── Build the graph ───────────────────────────────────────────────────
graph = (
    StateGraph(AgentState)
    .add_node("agent", memory_agent)
    .add_node("writer", memory_writer)
    .add_edge(START, "agent")
    .add_edge("agent", "writer")
    .add_edge("writer", END)
    .compile(store=store)
)


# ── Run ───────────────────────────────────────────────────────────────
async def main():
    import asyncio
    result = await graph.ainvoke(
        {"messages": [HumanMessage(content="What regulations should I be most concerned about this quarter?")]},
        config={"configurable": {"thread_id": "session-001"}}
    )
    print(result["messages"][-1].content)

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

This single graph does three things on every invocation: retrieves relevant semantic memories, uses them to personalise the response, and writes any new facts the user reveals back into the store.


Part 5: Agentic RAG β€” Documents as Retrievable Memory

Standard RAG is a one-shot lookup: query β†’ retrieve β†’ respond. Agentic RAG goes further β€” the agent decides when to retrieve, what to retrieve, and can follow up with additional retrievals if the first pass isn’t sufficient.

This pattern is central to research agents, support agents with large knowledge bases, and any system where the answer requires synthesising multiple document sources.

The key change is wrapping the retriever as a tool that the agent can call conditionally:

# agentic_rag_tool.py
from langchain_core.tools import tool
from langchain_community.vectorstores import FAISS
from langchain.embeddings import init_embeddings

# Assume vectorstore is pre-built from your documents
embeddings = init_embeddings("openai:text-embedding-3-small")
# In practice: vectorstore = FAISS.load_local("faiss_index", embeddings)

@tool
def retrieve_documents(query: str) -> str:
    """Search the internal knowledge base for documents relevant to a query.
    Use this when answering questions that require specific facts, policies,
    or document content. Returns up to 3 relevant passages."""
    # results = vectorstore.similarity_search(query, k=3)
    # return "\n\n".join(doc.page_content for doc in results)
    return f"[Retrieved passages for: '{query}']"  # Stub β€” wire to real store


@tool
def retrieve_user_history(user_id: str, query: str) -> str:
    """Search past interactions for a specific user.
    Use this to recall previous conversations, decisions, or outcomes for this user."""
    return f"[Episode history for {user_id} matching: '{query}']"  # Stub

Wire both tools into a LangGraph agent with ToolNode and add_conditional_edges β€” the same pattern from the Deep Agents post. The agent decides whether a retrieval is needed before responding, rather than retrieving blindly on every turn.


Part 6: Production Memory Architecture

Development patterns and production requirements diverge significantly. Here’s the upgrade path:

Swap Backends Without Changing Logic

# production_memory.py
# pip install langgraph-checkpoint-postgres

from langchain.agents import create_agent
from langgraph.store.postgres import PostgresStore

DB_URI = "postgresql://user:pass@localhost:5432/agentdb?sslmode=disable"

with PostgresStore.from_conn_string(DB_URI) as store:
    store.setup()   # Creates tables and indexes on first run β€” idempotent

    agent = create_agent(
        "anthropic:claude-sonnet-4-6",
        tools=[],
        store=store,
    )
    # Invoke the same way β€” the store API is identical

InMemoryStore β†’ PostgresStore (or MongoDBStore / RedisStore) is a one-line change. The agent code, memory write patterns, and retrieval logic are identical. This is the value of LangGraph’s store abstraction.

The Memory Tier Decision Table

TierBackendUse caseLatency
Dev / localInMemoryStoreTesting, demos~0ms
Local persistentSqliteStoreSingle-machine deployments~1ms
Production single-tenantPostgresStoreStandard cloud deployment~5ms
Production high-scaleMongoDBStore or RedisStoreHigh read/write throughput~2–10ms

Memory Privacy and Namespace Isolation

Never share memory namespaces across users. The pattern (user_id, memory_type) is non-negotiable in multi-tenant deployments. One missing user_id in a namespace means User A can see User B’s memories.

For multi-agent systems where you want shared memory (a shared knowledge base across specialist subagents), use a dedicated (agent_id, shared_knowledge) namespace with explicit write controls.


Part 7: The Reflection Pattern β€” Episodic to Semantic Consolidation

Raw episodic memories are verbose. Over time, an agent accumulates thousands of interaction records that are expensive to search and noisy to inject. The reflection pattern periodically distils episodic memories into semantic facts:

Episodic record: "User asked about DORA compliance three times in two weeks,
                  always requesting the regulatory text verbatim"

Reflected semantic fact: "User has deep interest in DORA; provide regulatory
                          citations directly rather than summaries"

Generative Agents popularised β€œreflection” mechanisms that periodically synthesise episodic memories into higher-level insights, which can then be stored as semantic memory and reused across sessions.

Implement reflection as a scheduled node (or a background job) that runs an LLM over recent episodes and writes the output to the semantic store:

# reflection.py
from langgraph.store.memory import InMemoryStore
from langchain.chat_models import init_chat_model
import uuid

store = InMemoryStore()
llm = init_chat_model("anthropic:claude-sonnet-4-6")

async def reflect_episodes(user_id: str) -> None:
    """Synthesise recent episodes into a semantic memory fact."""
    recent = store.search((user_id, "episodes"), query="recent interactions", limit=10)
    if not recent:
        return

    episode_text = "\n".join(
        f"- Task: {ep.value['task']} | Outcome: {ep.value['outcome']}"
        for ep in recent
    )
    prompt = (
        f"Based on these recent agent interactions for user {user_id}:\n{episode_text}\n\n"
        "Extract ONE concise, stable fact about this user's preferences or patterns "
        "(max 30 words). Return only the fact, no preamble."
    )
    response = await llm.ainvoke(prompt)
    fact = response.content.strip()

    # Write reflected fact to semantic store
    store.put(
        (user_id, "memories"),
        str(uuid.uuid4()),
        {"text": fact, "source": "reflection"}
    )
    print(f"Reflected fact for {user_id}: {fact}")

The Mental Model in One Picture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               AGENT MEMORY ARCHITECTURE                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  WORKING MEMORY β”‚  Context window β€” session-scoped      β”‚
β”‚  (in-context)   β”‚  Retrieved chunks + current messages  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  SEMANTIC       β”‚  Vector store / knowledge base        β”‚
β”‚  MEMORY         β”‚  Facts, preferences, domain knowledge β”‚
β”‚  (LangGraph     β”‚  Retrieval: semantic similarity       β”‚
β”‚   Store)        β”‚  Source: RAG pipeline + reflection    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  EPISODIC       β”‚  Interaction history (timestamped)    β”‚
β”‚  MEMORY         β”‚  Past tasks, outcomes, trajectories   β”‚
β”‚  (LangGraph     β”‚  Retrieval: semantic + recency filter β”‚
β”‚   Store)        β”‚  Source: written after each session   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  PROCEDURAL     β”‚  How-to examples + tool policies      β”‚
β”‚  MEMORY         β”‚  Few-shot examples in system prompt   β”‚
β”‚  (prompt layer) β”‚  Source: LangSmith Dataset            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What You’ve Built

Starting from the basics of why agents forget, you’ve built a complete memory system: a four-tier taxonomy that maps theory to code, a production RAG pipeline that grounds agent responses in external documents, a LangGraph semantic memory store that persists facts and preferences across sessions, an episodic store that records what happened and when, an agentic RAG pattern that retrieves conditionally rather than blindly, and a reflection mechanism that distils raw history into reusable facts.

This is the memory architecture that production agent teams are converging on in 2025. Every piece in this guide is built from verified official documentation and tested code β€” ship it with confidence.


Resources


All code examples syntax-verified against Python 3.11. Install requirements: pip install langgraph langchain langchain-community langchain-anthropic faiss-cpu pypdf. Swap InMemoryStore β†’ PostgresStore for production deployments.

Share This Article
Follow:
Satish Prasad An NIT Kurukshetra alumnus and Intelligent Automation Architect, Satish brings 15+ years of battle-tested experience deploying over 100 production bots across Investment Banking and Logistics. Today, he bridges the gap between Data Analytics and the frontier of Agentic AI, building autonomous agents that transform complex business logic into intelligent automation. Catch his latest insights on the evolution of tech vibes and digital autonomy.
Leave a Comment