Lessons on filesystems for context engineering and memory

Over the past several weeks, our team has been running an experiment building an internal tool to automate the generation of grant proposal documents using an AI agent.

This post is about what we learned building a production grant writing agent: what broke, what worked, and how we designed a system where the agent's memory and context management became as important as its writing ability.

The goal: a simple vision

The initial goal was straightforward: build an agent that could read uploaded documents, analyze a grant opportunity URL, understand submission requirements, gather company context, conduct necessary research, and generate the required documents.

From the beginning, we wanted the user experience to be conversational as well as make the agent capable to handle complex, multi-step workflows with planning and subtask delegation, all while maintaining natural dialogue with users.

The first challenge: context management

The first major challenge wasn't about writing quality; it was about context management.

Early on, we identified three critical things to track:

Uploaded documents: Grant documents, RFP materials, and the like
Generated artifacts: The actual proposal documents being created
Conversational messages: The dialogue between user and agent

From experience building other agents, we knew that maintaining all of this raw content in the LLM's context window would lead to what we call "context death" - the point where the context becomes so bloated that the model loses track of what's important, response quality degrades, and token costs explode.

The solution was to separate content from metadata.

Summaries as reference cards

For uploaded documents, we generate a concise summary containing:

Document type and purpose
Key facts and data points
Relevant dates, names, and numbers
When the document would be useful

This summary serves as a reference card for the agent. Instead of reading entire 50-page technical specifications every time it needs to check something, the agent first consults the summary. Only when it needs specific details does it read the full document.

The same pattern applies to generated artifacts. Each proposal document has:

The full generated content (stored but not always in context)
A summary describing what the document contains
Metadata about its purpose and status

This architectural decision proved transformative. The agent could now "remember" dozens of documents without drowning in content. A 30-document workspace might only require 3-4 pages of summaries in context, with full documents pulled on demand.

The memory problem: continuity across conversations

Our first MVP worked well for single conversations. But a persistent question kept nagging us: What happens when users want to continue working across multiple chat sessions?

Grant proposals aren't written in one sitting. A team might:

Generate a technical approach in one session
Come back the next day to work on the budget narrative
Open another chat to refine the project timeline
Return days later to update team qualifications

All of these tasks involve the same underlying documents and context. Losing that context between sessions would be devastating.

Traditional chat-based agents treat each conversation as isolated. When you open a new chat, you start from zero. This works for simple Q&A but fails catastrophically for knowledge work that spans days or weeks.

The workspace: file system as memory

To solve this, we built a workspace, a persistent file system that serves as the agent's memory, inspired by tools like Claude Code and LangChain's deep agents.

Here's how it works:

Plain Text

workspace/
├── projects/
│   ├── nsf-sbir-phase-1/
│   │   ├── uploaded/
│   │   │   ├── company-overview.md
│   │   │   ├── team-bios.md
│   │   │   └── past-proposal-2023.pdf
│   │   ├── artifacts/
│   │   │   ├── technical-approach.md
│   │   │   ├── cover-letter.md
│   │   │   └── budget-narrative.md
│   │   └── PROJECT.md
│   └── doe-energy-grant/
│       ├── uploaded/
│       ├── artifacts/
│       └── PROJECT.md

Each project represents a grant opportunity. Inside each project:

uploaded/ contains source documents
artifacts/ contains generated proposal documents
PROJECT.md serves as a table of contents and project overview

This workspace persists across all chat sessions. A user can:

Upload documents in Chat 1
Generate a technical proposal in Chat 2
Delete all chat history
Open Chat 3 and continue working on the budget narrative

The documents and artifacts remain intact. The agent has true continuity.

This unlocked a powerful capability: users could delete all their chat history without losing what mattered. The workspace became the source of truth, not the conversation history.

Structured metadata: context at a glance

We gave the workspace two layers of structure: a project-level index and consistent metadata on every document. Each has a clear role.

PROJECT.md: the master index

At the root of each project, PROJECT.md serves as the master index. We added it to fix a concrete problem: when a user started a new chat, the agent had to list all files and read several of them just to understand current progress. PROJECT.md acts as a table of contents and gives the agent situational awareness up front: what exists, what's done, and what's left. That cut down tool calls at the start of each conversation and helped the agent stay focused.

Markdown

---
title: NSF SBIR Phase 1 - AI Sensor Platform
grant_agency: National Science Foundation
grant_program: SBIR Phase 1
deadline: 2024-03-01
status: in_progress
---

# Project Overview

This grant proposal is for developing an AI-powered sensor platform...

## Uploaded Documents
- `company-overview.md` - Company background and capabilities
- `team-bios.md` - Team member qualifications
- `past-work.md` - Relevant previous projects

## Artifacts
- `technical-approach.md` - Technical methodology (DRAFT)
- `cover-letter.md` - Introductory letter (COMPLETE)
- `budget-narrative.md` - Budget justification (NOT STARTED)

## Next Steps
- [ ] Complete technical approach
- [ ] Generate budget narrative
- [ ] Review for consistency

We keep PROJECT.md minimal on purpose. The workspace (uploaded documents, artifacts, and their frontmatter) is the system of record; PROJECT.md is a map, not a place to hold everything. So we keep it short. The agent gets the map in context; when it needs details, it reads the right file.

Uploaded and generated documents: frontmatter

The key to making individual files effective was giving every document structured frontmatter. Every markdown inside the uploaded documents and generated artifacts follows this pattern:

Markdown

---
title: Technical Approach - NSF SBIR Phase 1
category: artifact
document_type: technical_proposal
description: Detailed technical approach for developing the AI-powered sensor platform, including methodology, timeline, and risk mitigation
status: draft
created: 2024-01-15
last_modified: 2024-01-16
grant_project: nsf-sbir-phase-1
---

# Technical Approach

[Document content...]

We use frontmatter for a few reasons:

Discovery and matching. When the agent lists files, it parses only the frontmatter, not the full body, so it learns what documents exist with minimal context. The title, description, and category help the agent identify which document is relevant for the current task; a clear description with specific keywords makes that matching accurate.

Progressive disclosure. By loading metadata first, the agent stays fast and lean. The full document is only read when the agent actually needs it, for example when drafting a section that depends on a specific uploaded source. That keeps context usage efficient and purposeful.

So at a glance, the agent can understand:

What each document is
Its purpose and category
When it was created
Its current status

Only when the agent needs specific information does it read the full document body. This two-tier approach (metadata first, content on demand) is what keeps the workspace navigable without drowning the model in text.

Tools for navigating the workspace

A workspace is only useful if the agent can effectively navigate it. We built a focused toolkit:

Filesystem tools

list_files - Show what's in a project folder
read_file - Read full document content
delete_file - Remove outdated documents
write_document - Generate or update artifacts

These tools support parallel execution. The agent can list all files in one call, then read three different documents simultaneously, or delete multiple outdated drafts at once. This dramatically improved efficiency.

Planning and reflection tools

todo - Generate and track a step-by-step plan
think - Reflect on progress and decide next steps

The planning tools proved essential for complex proposals. When facing a grant with 14 requirement documents and needing to produce 7 deliverables, the agent first creates a plan, then works step by step through it.

The think tool helps the agent maintain focus. Before each major action, it reflects: "I've completed steps 1-3. I have the requirements. Next, I need to read the company technical capabilities document before drafting the technical approach."

These reflection points keep the agent aligned with its goals and prevent it from going off-track.

Model selection: what worked at scale

We experimented extensively with different language models. For most tasks, several models performed adequately. But when we tested a particularly complex grant (14 requirement documents, 7 different deliverables, synthesis of technical specs, team qualifications, and company history), only one model succeeded consistently: GPT-5.2.

What set GPT-5.2 apart:

Instruction following: It adhered to complex, multi-step plans without deviation
Long-context performance: It maintained coherence across lengthy documents
Focus: It didn't get sidetracked by interesting but irrelevant details

In one test case, GPT-5.2 processed all 14 requirement documents, gathered the necessary company context, and generated all 7 required deliverables in a single execution run. The resulting documents were coherent, consistent, and required only minor revisions.

Other models either required constant re-prompting to stay on task or lost coherence across the full set of deliverables.

What we're still learning

This approach has worked well through internal deployment. Building a real tool for real grant writers forced us to confront practical challenges.

What we don't yet know:

Long-term workspace maintenance: How does the workspace evolve as a company submits dozens of grants over months or years?
Multi-user collaboration: How do multiple team members work in the same workspace document without conflicts?
Knowledge transfer: How do insights from one successful grant inform future proposals?
Error recovery: When the agent generates incorrect content, how do we efficiently get back on track?

What's become clear: Grant writing agents need more than writing ability. They need robust memory, efficient context management, and tools that let them navigate complex document ecosystems. The workspace architecture, persistent, structured, and tool-accessible, proved to be the foundation that made everything else possible.

The broader lesson: memory architecture matters

The core insight from building this agent extends beyond grant writing: When building agents for knowledge work, memory architecture matters as much as model capability.

We spent as much time designing:

How the agent stores and retrieves information
What metadata makes content discoverable
Which tools enable efficient navigation
How context is structured for decision-making

...as we did on the actual writing quality.

The traditional LLM approach (dump everything into the context window and hope for the best) doesn't scale for complex, multi-session work. You need:

Persistent storage that outlives conversations
Structured metadata that makes content discoverable
Selective context loading that keeps the working memory focused
Tools that enable navigation rather than brute-force search
Planning and reflection capabilities for complex workflows

These architectural decisions transformed our agent from a one-shot document generator into a persistent assistant that could work alongside humans across days or weeks.

As language models continue improving, we expect these patterns to become even more important. Better models will unlock more complex tasks, which will demand more sophisticated memory and context management. The workspace pattern we discovered (file system as memory, summaries as indexes, tools as navigation) seems like just the beginning.

We hope sharing these lessons helps others building similar systems. The future of knowledge work agents depends not just on better models, but on better architectures for managing what those models remember and how they navigate their knowledge.

Over the past several weeks, our team has been running an experiment building an internal tool to automate the generation of grant proposal documents using an AI agent.

The goal: a simple vision

The first challenge: context management

The first major challenge wasn't about writing quality; it was about context management.

Early on, we identified three critical things to track:

Uploaded documents: Grant documents, RFP materials, and the like
Generated artifacts: The actual proposal documents being created
Conversational messages: The dialogue between user and agent

The solution was to separate content from metadata.

Summaries as reference cards

For uploaded documents, we generate a concise summary containing:

Document type and purpose
Key facts and data points
Relevant dates, names, and numbers
When the document would be useful

The same pattern applies to generated artifacts. Each proposal document has:

The full generated content (stored but not always in context)
A summary describing what the document contains
Metadata about its purpose and status

The memory problem: continuity across conversations

Our first MVP worked well for single conversations. But a persistent question kept nagging us: What happens when users want to continue working across multiple chat sessions?

Grant proposals aren't written in one sitting. A team might:

Generate a technical approach in one session
Come back the next day to work on the budget narrative
Open another chat to refine the project timeline
Return days later to update team qualifications

All of these tasks involve the same underlying documents and context. Losing that context between sessions would be devastating.

The workspace: file system as memory

To solve this, we built a workspace, a persistent file system that serves as the agent's memory, inspired by tools like Claude Code and LangChain's deep agents.

Here's how it works:

Plain Text

workspace/
├── projects/
│   ├── nsf-sbir-phase-1/
│   │   ├── uploaded/
│   │   │   ├── company-overview.md
│   │   │   ├── team-bios.md
│   │   │   └── past-proposal-2023.pdf
│   │   ├── artifacts/
│   │   │   ├── technical-approach.md
│   │   │   ├── cover-letter.md
│   │   │   └── budget-narrative.md
│   │   └── PROJECT.md
│   └── doe-energy-grant/
│       ├── uploaded/
│       ├── artifacts/
│       └── PROJECT.md

Each project represents a grant opportunity. Inside each project:

uploaded/ contains source documents
artifacts/ contains generated proposal documents
PROJECT.md serves as a table of contents and project overview

This workspace persists across all chat sessions. A user can:

Upload documents in Chat 1
Generate a technical proposal in Chat 2
Delete all chat history
Open Chat 3 and continue working on the budget narrative

The documents and artifacts remain intact. The agent has true continuity.

This unlocked a powerful capability: users could delete all their chat history without losing what mattered. The workspace became the source of truth, not the conversation history.

Structured metadata: context at a glance

We gave the workspace two layers of structure: a project-level index and consistent metadata on every document. Each has a clear role.

PROJECT.md: the master index

Markdown

---
title: NSF SBIR Phase 1 - AI Sensor Platform
grant_agency: National Science Foundation
grant_program: SBIR Phase 1
deadline: 2024-03-01
status: in_progress
---

# Project Overview

This grant proposal is for developing an AI-powered sensor platform...

## Uploaded Documents
- `company-overview.md` - Company background and capabilities
- `team-bios.md` - Team member qualifications
- `past-work.md` - Relevant previous projects

## Artifacts
- `technical-approach.md` - Technical methodology (DRAFT)
- `cover-letter.md` - Introductory letter (COMPLETE)
- `budget-narrative.md` - Budget justification (NOT STARTED)

## Next Steps
- [ ] Complete technical approach
- [ ] Generate budget narrative
- [ ] Review for consistency

Uploaded and generated documents: frontmatter

The key to making individual files effective was giving every document structured frontmatter. Every markdown inside the uploaded documents and generated artifacts follows this pattern:

Markdown

---
title: Technical Approach - NSF SBIR Phase 1
category: artifact
document_type: technical_proposal
description: Detailed technical approach for developing the AI-powered sensor platform, including methodology, timeline, and risk mitigation
status: draft
created: 2024-01-15
last_modified: 2024-01-16
grant_project: nsf-sbir-phase-1
---

# Technical Approach

[Document content...]

We use frontmatter for a few reasons:

So at a glance, the agent can understand:

What each document is
Its purpose and category
When it was created
Its current status

Tools for navigating the workspace

A workspace is only useful if the agent can effectively navigate it. We built a focused toolkit:

Filesystem tools

list_files - Show what's in a project folder
read_file - Read full document content
delete_file - Remove outdated documents
write_document - Generate or update artifacts

Planning and reflection tools

todo - Generate and track a step-by-step plan
think - Reflect on progress and decide next steps

These reflection points keep the agent aligned with its goals and prevent it from going off-track.

Model selection: what worked at scale

What set GPT-5.2 apart:

Instruction following: It adhered to complex, multi-step plans without deviation
Long-context performance: It maintained coherence across lengthy documents
Focus: It didn't get sidetracked by interesting but irrelevant details

Other models either required constant re-prompting to stay on task or lost coherence across the full set of deliverables.

What we're still learning

This approach has worked well through internal deployment. Building a real tool for real grant writers forced us to confront practical challenges.

What we don't yet know:

Long-term workspace maintenance: How does the workspace evolve as a company submits dozens of grants over months or years?
Multi-user collaboration: How do multiple team members work in the same workspace document without conflicts?
Knowledge transfer: How do insights from one successful grant inform future proposals?
Error recovery: When the agent generates incorrect content, how do we efficiently get back on track?

The broader lesson: memory architecture matters

The core insight from building this agent extends beyond grant writing: When building agents for knowledge work, memory architecture matters as much as model capability.

We spent as much time designing:

How the agent stores and retrieves information
What metadata makes content discoverable
Which tools enable efficient navigation
How context is structured for decision-making

...as we did on the actual writing quality.

The traditional LLM approach (dump everything into the context window and hope for the best) doesn't scale for complex, multi-session work. You need:

Persistent storage that outlives conversations
Structured metadata that makes content discoverable
Selective context loading that keeps the working memory focused
Tools that enable navigation rather than brute-force search
Planning and reflection capabilities for complex workflows

These architectural decisions transformed our agent from a one-shot document generator into a persistent assistant that could work alongside humans across days or weeks.