The Dual-Memory Architecture: Orchestrating Short-Term Agent Context and Long-Term Knowledge Graphs

2026-04-24 10:00:00+00:00

As AI agents transition from simple conversational chatbots into autonomous software engineers, managing state has become one of the most critical challenges in systems design.

An agent like Claude Code or Cline needs to know what you said two seconds ago, but it also needs to remember that you prefer PostgreSQL UNIX socket authentication, which you mentioned three weeks ago.

If we stuff all historical chats and facts directly into the LLM’s context window:

The context window will overflow, causing the agent to forget the beginning of the conversation.
The API costs will skyrocket as the token counts balloon.
The agent’s response latency will degrade, making real-time collaboration slow.

To solve this, modern agent platforms use a Dual-Memory Architecture (modeled in NanoClaw). It separates short-term working context from long-term semantic knowledge graphs. This article explores how to architect, partition, and coordinate these memory layers in production.

🏗️ Partitioning Working Context and Graph Memory

A resilient agentic memory engine partitions information into two distinct subsystems based on access speed, search parameters, and persistence:

1. Short-Term Memory: Conversational Context

What it stores: Immediate chat threads, command histories, active file paths, and transient system variables.
Storage format: Locally cached files (like .claude.json or .cline.json) linked to active SQLite database sessions.
Characteristics: Extremely fast, sequential, and ephemeral. When a task session ends, the short-term working context is completely purged to free up context space.

2. Long-Term Memory: The Semantic Knowledge Graph

What it stores: Declared preferences, structural system rules, architectural mappings, and historic lessons learned.
Storage format: Semantic graph entities and vector embeddings hosted in a graph database (like FalkorDB via Segnog).
Characteristics: Persistent, associative, and decoupled. It is searched via vector similarities or graph Cypher relationships, injecting only relevant facts into the agent's context window on demand.

🛠️ The Synchronization Logic: Handling the "/reset" Command

When the short-term cache and long-term database are separated, managing state synchronization is essential.

If a user prompts an agent: "Forget our current chat context, let's start fresh", the agent must wipe its short-term conversational memory cache. If it doesn't, the agent is trapped in an infinite loop of past error logs or outdated code paths.

However, we do not want to wipe their long-term memory (e.g., their API keys or server configuration preferences).

To execute this clean split, we implement a /reset state-machine inside our orchestrator. When the webhook handler catches the /reset command, it purges the local SQLite session reference and deletes the container's .claude.json working file, forcing the container to spawn with a clean-slate working cache while keeping long-term graph entities untouched in FalkorDB.

Below is the state-management logic written in TypeScript:

// session-manager.ts (NanoClaw Session Orchestrator)
import { promises as fs } from 'fs';
import path from 'path';
import { logger } from './logger.js';

interface SessionRecord {
  sessionId: string;
  groupFolder: string;
  activeContextPath: string; // Path to local .claude.json
}

const sessionsDatabase = new Map<string, SessionRecord>();

/**
 * Handles incoming chat commands, routing and purging working contexts.
 */
export async function handleUserMessage(
  sessionId: string, 
  groupFolder: string, 
  messageText: string
): Promise<string> {
  const contextDir = path.join('/var/nanoclaw/sessions', groupFolder);
  const cachePath = path.join(contextDir, '.claude.json');

  // Catch the fresh start /reset command
  if (messageText.trim() === '/reset') {
    logger.warn({ sessionId, groupFolder }, 'Wiping short-term conversational cache...');

    try {
      // 1. Delete the ephemeral .claude.json context file
      await fs.unlink(cachePath);
      logger.info({ cachePath }, 'Deleted local session file.');
    } catch (err: any) {
      if (err.code !== 'ENOENT') {
        logger.error({ err }, 'Failed to delete session cache file.');
      }
    }

    // 2. Clear the active session key map
    sessionsDatabase.delete(groupFolder);

    return "🔄 Ephemeral chat history cleared! Starting a fresh session. Your long-term settings are preserved in FalkorDB.";
  }

  // Resume standard execution loop
  let session = sessionsDatabase.get(groupFolder);
  if (!session) {
    session = {
      sessionId,
      groupFolder,
      activeContextPath: cachePath
    };
    sessionsDatabase.set(groupFolder, session);
    
    // Ensure directory exists
    await fs.mkdir(contextDir, { recursive: true });
  }

  // Relay message down to the sandbox container
  return `Resuming session ${session.sessionId}...`;
}

🐍 Programmatic Long-Term Fact Insertion

While short-term context is written automatically by the agent to .claude.json, long-term memory is written by extracting semantic facts from the conversation.

Below is the Python class used to write extracted facts asynchronously into the FalkorDB knowledge graph. It connects to FalkorDB, merges the user session node, creates the fact node, and builds semantic relationships:

# graph_memory.py (FalkorDB Client Library)
from falkordb import FalkorDB
import uuid

class DualMemoryClient:
    def __init__(self, host="127.0.0.1", port=6379):
        self.client = FalkorDB(host=host, port=port)
        self.graph = self.client.select_graph("segnog_memory")

    def save_long_term_fact(self, session_id: str, fact_content: str, vector_embedding: list):
        """
        Inserts a persistent fact node and links it to the active Session node.
        """
        fact_id = str(uuid.uuid4())
        
        # Cypher query using standard MERGE and CREATE semantics
        query = """
        MERGE (s:Session {id: $session_id})
        CREATE (f:Fact {
            id: $fact_id,
            content: $content,
            embeddings: $vector,
            created_at: timestamp()
        })
        CREATE (s)-[:HAS_PERSISTENT_FACT]->(f)
        RETURN f.id
        """
        
        params = {
            "session_id": session_id,
            "fact_id": fact_id,
            "content": fact_content,
            "vector": vector_embedding
        }
        
        try:
            res = self.graph.query(query, params)
            print(f"Graph Insertion Complete: {res.statistics}")
            return fact_id
        except Exception as e:
            print(f"Failed transactional graph write: {e}")
            return None
            
    def close(self):
        self.client.close()

if __name__ == "__main__":
    mem = DualMemoryClient()
    # Sample vector embedding representation (1024 floats)
    mock_vector = [0.12, -0.44, 0.89] + [0.0] * 1021
    mem.save_long_term_fact(
        session_id="session-user-186",
        fact_content="Developer uses staging database at port 8443",
        vector_embedding=mock_vector
    )

📈 Summary of Benefits

Adopting a Dual-Memory Architecture provides a highly optimized, bulletproof cognitive framework:

Infinite Context Capacity: By offloading persistent settings and lessons to FalkorDB, the agent’s prompt stays lightweight, preventing context window limits from cropping critical data.
Deterministic Purges: Wiping local short-term caches is fast and safe. Wiping chat history resolves infinite logic loops without purging historical user profiles.
High Latency Optimization: Short-term interactions avoid round-trip network hops to cloud graphs. The local JSON file handles fast chat operations, while asynchronous tasks load database graphs only when similarity searches match.

By coordinating fast, local context files with secure, associative graph databases, you build highly responsive, long-term AI assistants capable of scaling across years of development data.