Kill the Session. Earn the Memory.

By Leon Garcia-Camargo — 07 May 2026

The Problem With Long-Running Agents

Firetiger has many agents with long task horizons (persistent goals which evolve over months). Some are user-defined: “I want my api service to get healthier…” or “I want to ensure our agent output doesn’t degrade…” Agents monitor the system, propose changes, see what happens, and keep adjusting until they reach the goal. Other agents run in the background, researching the product, customers, code, or optimizing other agents.

But how do you run an agent for a month-long task or a six-month task or a year without it drifting or losing what it learned? Even with memory layers and summarization, most systems collapse into one of two patterns: keep the session alive as long as possible, patching context problems as they appear, or restart the agent on a schedule and lose everything between runs. Most memory systems are a patch; the session is still the unit of work. A better solution is to invert the framework altogether: make tasks the unit of work and keep sessions deliberately short and disposable. Each run starts clean. The only thing that carries forward is a simple, semi-structured, token-limited memory (a notebook) that the agent reads from and writes to. Over time, that notebook accumulates task specific knowledge. The agent doesn’t try to remember everything, it learns what actually matters to reach its goal.

Most systems reduce to one of two approaches. The first: write a static plan and trigger a fresh agent on a schedule. Clean, predictable, easy to reason about. But static plans can't learn. Every out-of-distribution event (corrupted data source, a schema change, a shift in what your team cares about) hits the agent fresh, like it's the first time. The only lever you have is editing the plan, and editing the plan is risky. You don't know what you broke until you've run three months of backtests. The second: give the agent a long-running session with compaction or a DB. Let it accumulate context, summarize it periodically or store it all, keep going. This works until it doesn't. After enough cycles the agent starts behaving strangely. Confidently, quietly wrong in ways that are hard to trace. You can't diff a compacted context. You don't know when it diverged.

They both break in the same way. Your telemetry pipeline goes down. The agent which is supposed to monitor things, unable to access its data, can’t tell the difference between corrupted input and a real outage. So it alerts. Then alerts again. Seventeen times before anyone realizes the pipeline was the problem.

Static plan: it never learns. You patch the plan, then patch it again. Six months later it’s a wall of fixes, mostly avoiding whatever annoyed you last quarter. It has lost the plot.

Long-running session: it “learns,” but that learning gets folded into everything else (either compacted blindly or storing 17 failures in a db). One issue disappears, another blind spot replaces it. You won’t find it until a real outage goes undetected.

Stateless Runs, Persistent Memory

First, some context on Firetiger: Firetiger runs a network of agents to make your product more reliable by watching your deploys and constantly monitoring for issues all while proposing changes and seeing if they work. The network maintains shared knowledge about the product, the codebase, the customers, and the infrastructure. Each agent maintains its own local memory with operational knowledge specific to its task. This post focuses on that local, task-specific memory. Every run follows the same formula:

plan (stable intent) + local memory (accumulated knowledge) + signals (feedback from users/other agents) → action + updated memory

What this means: Agents are defined by a plan, available tools, and triggers (temporal or event driven). Each run they are seeded with the plan and their local memory (a bounded notebook). They execute their explicit task (user defined or whatever) and their implicit task ("update notebook to get better at my job"). When the bounded notebook reaches its limit it returns an error and the agent must deal with it.

Memory That Earns Its Place

A natural question: why a notebook with a token limit instead of a database? The answer is that the constraint is the point. A database lets the agent store and retrieve what seems relevant later. A bounded notebook forces it to decide what's worth keeping, which means discarding specifics and retaining only what generalizes. An agent with unlimited storage will log an incident and move on. A bounded notebook will write the general strategy.

Here's what that looks like in practice. One agent's notebook contained a 165-word rule about OAuth escalation. A full paragraph with named HTTP clients, volume thresholds, and a framework for snapshot analysis. A month later it was 16 words:

OAuth >60 req/hr needs (1) impact + (2) error increase/new behavior.
Error >20%=HIGH PRI standalone

The two clauses that actually mattered survived. Everything else got stripped. Compression forces abstraction. Otherwise you just end up with a log file. The limit is not a technical constraint we're working around, it's doing real work. Broad context (code, product changes, shared agent knowledge) lives in a different layer (retrieval, memory blocks, sleep agents, whatever you're already using). The notebook is for task mastery, not encyclopedic recall. Memory has to be earned.

The Signal Layer

The agent doesn’t operate in isolation, it lives in a network and a changing world. Other agents are running in parallel researching the codebase or telemetry, monitoring adjacent systems, auditing each other’s outputs. When something changes in the world around it (change in data pipeline, a new feature, a shift in usage patterns) another agent in the network notices and passes the signal along in plain language. When a user marks an alert as a false positive the agent doesn't just log it, it updates its notebook. Early updates tend to be shallow rules like “don’t alert on X.” That works for about five minutes. Over time, compression turns those into strategy: “I’ve been miscalibrating the threshold because I’m not accounting for Y.” That’s what makes the feedback loop compounding rather than additive.

Something that came up early on was temporal alerting. Agents would sometimes get confused on weekends, holidays, or after hours using incorrect baselines. Another agent or a user would mark it as a false positive. Early updates looked like rules: “don’t alert on weekends” or “ignore after-hours spikes.” Over time, those compressed into something more general: "For temporal analysis, compare like-for-like periods (same day last week, same hour range) before flagging an issue." Not a patch for weekends… a strategy for handling time.

Obviously the agents shouldn't have to learn this one. We moved parts of this into the system itself; prompting, validation, better defaults but the pattern showed up in the notebook first. The agent learned it before we formalized it.

Caveats

Without the signal part of the equation this strategy can cause serious agent drift. Input from other agents or some sort of validation is key to keep it in check along with user feedback (user input weighted higher when optimizing). The flip side: when signal is good and the notebook has converged, the agent stops updating. We've seen notebooks stay bit-for-bit identical for weeks across multiple deployments. Not stagnation but convergence.
There is a balance to how much one prescribes what agents can put in their local memory. The more freedom they have the better but the more signal is required so they don’t converge on whatever works locally or fall off a cliff. Takes a lot of evaluating to see how close you can get your prompt to “Here’s a notebook. Use it to get better at your job.” Instead of “Heres a notebook. Store x, y and z when blah happens…” One agent started date-stamping its learnings to track when rules crystallized. Nobody told it to. That's roughly the right amount of freedom.
Agents tend to update the local memory with rules instead of general learnings at first. Early sessions produce specific corrections. This can decrease accuracy in early runs. Over time those rules get compressed into strategy. Sometimes a topic exits the note entirely: A five-step webhook verification protocol appeared in one notebook, got compressed over several months, then disappeared and was replaced with a single higher-level sequencing rule. You can accelerate this with prompting and validation on updates, but the compaction cycle does a lot of the work on its own.
Interpretability. The agent is not writing the notebook for you to read, it is for itself. After a while of compacting it can get pretty unreadable. But this isn’t a human artifact, it's an agent artifact. The metric is quality over time… not readability. And you still get the benefit of tracking agent drift to edits of local memory.

Where This Came From

The inspiration for this came from an unlikely place, Titans, a 2025 architecture paper from Google concerned with a much lower-level problem: how do you build a transformer that can handle sequences of millions of tokens without quadratic memory costs. It introduced a neural long-term memory module that learns, during inference, what's worth keeping and what to forget. Not through explicit rules but through signal. I won’t go through it all. Read the paper. That idea translated cleanly up a level. If a neural module can learn to distinguish signal from noise and update its weights accordingly, an agent with a notebook and the right prompting can do the same thing. Except the "weights" are natural language, the "surprise signal" is feedback from users and peer agents, and the forgetting is explicit compaction rather than gradient decay (repeated compression slowly removes what signal never reinforced; same effect, discrete and observable). The mechanism is different. The intuition is the same: memory should be earned, not accumulated.

What We Learned After a Year

It takes real work to get this production ready (validation, signal weighting, prompting) but the core works.

Two things surprised us. First, how robust the agents become when things break around them. Data sources go down, schemas change, new features ship, requirements shift, agents that have been running for months barely flinch. The notebook has seen enough to know the difference between a real problem and noise. Second, how observable drift becomes. When an agent starts behaving strangely you diff the notebook and find the update that caused it. The more interesting question is why it happened (bad signal weighting, a prompt that encourages rules over strategy, peer agent corrections that are pulling in the wrong direction). The notebook diff tells you where to look. What you fix is usually something in the system, not the agent. There's also a meta-signal. Looking across notebooks (what multiple agents struggle with, what corrections keep recurring, where false positives cluster) tells you something about the product itself. Where to invest, what's confusing, what's breaking.

We tracked notebook evolution across three customer deployments over periods ranging from two months to six months. The same cycle appeared in all three: raw corrections accumulate, distillation events compress them, false positives get deleted, and what survives gets promoted as multiple sessions independently confirm the same thing. Looks like Rules -> Patterns -> Strategy.

The most dramatic moment across any of them: one notebook rewrote its entire section taxonomy in a single update after four months. The old structure was organized around corrections and observations. It contained a mix of things users explicitly corrected, things the agent noticed itself, and things multiple sessions confirmed. All of it organized around the shape of the data:

Data Extraction      # rules on how not to query
Data Interpretation  # rules on how to interpret metrics  
Logical Conclusions  # how to reason toward conclusions

The new structure was organized around judgment:

Investigation & Monitoring         # how to run an investigation
Agent Architecture / Data Location # where things live in this system
Query / Escalation Patterns        # when and how to act

Not gradual drift. One event. The entire structure, replaced.

The agent stopped organizing around what it had learned and started organizing around how it thinks.

Subscribe to The Firetiger Blog