Why Narrative Depth Is Computable landscape article preview
Back to Grimoire

Research Note

Why Narrative Depth Is Computable

Every narrative benchmark measures output quality. None of them ask the prior question: how much computation does this task actually need? That's the gap FableForge fills.

Every narrative benchmark in the literature measures the same thing: output quality. Does the story make sense? Is the character consistent? Does the plot hold together across scenes? ConStory-Bench, LongPage, StoryReasoning — all of them ask whether the model got it right.

None of them ask the prior question: how much computation does getting it right actually require?

That is the gap. And it is not a minor gap. It is the gap that determines whether a loop scheduler — the mechanism that decides how many recurrence steps to run on a given input — has any signal to work with. Without training data annotated for depth requirements, the scheduler is guessing. It might run 4 loops on a task that needs 32, or waste 32 loops on something a surface scan would resolve. There is no principled floor.

FableForge is the training signal that provides that floor.


The Wrong Abstraction#

Standard narrative datasets are task datasets. They measure whether the model got it right. They do not annotate what “getting it right” structurally requires.

This is not an oversight — it reflects a reasonable assumption: that reasoning depth is emergent. You train on enough narrative tasks, the model figures out when to think harder, and capability develops through scale. The assumption has powered a decade of benchmark progress.

It is also the wrong abstraction for a class of model that makes reasoning depth a controllable parameter.

Recurrent-Depth Transformers (RDTs) do not have a fixed computational depth. The same weight block runs 4 times, or 16 times, or 32 times, depending on a scheduler’s decision. The depth is not baked into the architecture — it is a variable at inference time. This is the core insight of the RDT line of research (see arXiv:2603.21676 for the compositional generalization analysis, and the OpenMythos reconstruction work for an accessible implementation).

The question RDTs raise immediately: what should drive the scheduler? If you have trained on tasks where depth is unlabeled, you have no ground truth for that decision. The model cannot learn to allocate loops appropriately because the training data never told it what “appropriate” means for any specific task structure. Emergent depth allocation is not a path forward here. Annotated depth requirements are.


The Recurrence Insight#

An RDT’s loop count is a compute budget that can be paid at inference time instead of pre-training. This is a genuine architectural shift: the model can trade recurrence steps for deeper reasoning without changing its weights or its parameter count. Run 4 loops for a surface match; run 32 loops for a task requiring entity state persistence across 54 scene-character slots.

The problem is that “trade recurrence steps for deeper reasoning” is only useful if something tells the model when the trade is warranted. Without a training signal that maps task structure to depth requirements, the scheduler cannot learn. It has no labels. It has no gradient signal pointing toward “this kind of task needs more loops.” It is flying blind over a landscape it has never seen mapped.

FableForge maps the landscape.

It is a training dataset for narrative reasoning tasks with explicit loop-count annotations derived from structural analysis — not from model outputs, not from human raters scoring plausibility, but from the task’s structural complexity itself. The annotation is computable from first principles, before any model touches the input.


Three Task Types, One Principle#

FableForge organizes narrative reasoning into three task categories, each with a loop derivation grounded in what the task structurally demands.

Task Type Variant Required Loops Structural Basis
character_trace Full trace (6 chars × 9 scenes) 32 54 scene-character state slots must persist
coherence_challenge trait_reversal 32 Established trait must stay active across two fragments
coherence_challenge name_drift 4 Surface token match only
narrative_completion 6-constraint 32 All constraints active simultaneously through recurrence

The derivations are not heuristics. They follow from what the task structurally requires of working memory.

character_trace at full complexity gives the model 6 characters across 9 scenes. Entity state must persist across 54 scene-character slots. If loop count is insufficient to maintain that persistence through recurrence, state degrades — characters contradict their established positions, scene history is lost, the trace fails. 32 loops is derived from the persistence requirement, not tuned empirically.

coherence_challenge with trait_reversal requires the model to detect that a character acted “impulsively” when their established trait is “cautious.” Detecting that contradiction requires holding the established trait in working memory across two fragments — the fragment where the trait was established, and the fragment where the violation occurs. The established trait cannot be a cached lookup; it must remain an active signal through recurrence. 32 loops. Compare this to name_drift, where the task is a surface token match across fragments — consistent spelling, no semantic contradiction to detect. 4 loops. The structural difference between these two variants is the difference between a memory maintenance task and a lookup task.

narrative_completion at 6 simultaneous constraints — character consistency, causal coherence, temporal ordering, thematic continuity, setting coherence, tone register — requires all six constraints to function as active signals through recurrence simultaneously. Drop any one and the completion violates that constraint. 32 loops is the derived requirement for holding the full constraint set active.

The principle unifying all three: loop count tracks the number of things that must remain simultaneously active through recurrence. Not the number of words. Not the number of characters. Not some complexity proxy. The structural count of things that cannot be resolved sequentially and must be held in parallel.

Here is a basic usage example showing how FableForge generates annotated examples:

from openfable import FableForge

forge = FableForge(
    characters=6,
    scenes=9,
    task_type="character_trace"
)

# Returns: {"task": ..., "context": ..., "required_loops": 32}
example = forge.generate()

print(f"Loop annotation: {example['required_loops']}")
# Loop annotation: 32

# Batch generation for training data
dataset = forge.generate_batch(n=1000, split_by_complexity=True)
# Returns examples stratified by required_loops: [4, 8, 16, 32]

What MythosBridge Adds#

FableForge is the narrative-specific training signal. But training an RDT on narrative tasks alone creates a gap: the model learns to allocate loops for story reasoning without first developing a feel for what deep recurrence is.

MythosBridge addresses this as a pretraining bridge. It takes WithinUsAI/claude_mythos_distilled_25k — 25,000 examples of deep technical reasoning across math, code, security analysis, and agentic planning — and re-annotates them with the same recurrence framework FableForge uses for narrative. The depth annotation logic transfers: a multi-step proof has structural depth requirements derivable from its dependency graph; a security analysis has context requirements derivable from its attack surface; an agentic planning task has state-persistence requirements derivable from its action graph.

The training pipeline then runs in two stages. Stage 1: train on MythosBridge examples. The model learns what deep recurrence feels like on structured problems where the loop-depth relationship is clear. Stage 2: train on FableForge. The model applies that recurrence capacity to narrative-specific task structures.

The sequencing is not arbitrary. Narrative task structures are noisier than mathematical ones — the structural signals for loop requirements are real but require more interpretation. A model that has already learned to allocate loops on clean formal problems adapts that capacity to narrative more reliably than a model encountering deep recurrence for the first time in a story context.

Credit where it is due: the MythosBridge framing emerged directly from the OpenMythos reconstruction work. Kye Gomez’s implementation of the RDT architecture as an open hypothesis about Claude Mythos — with its explicit loop-scaled reasoning focus — provided the template for thinking about what a pretraining bridge should accomplish.


The Deeper Connection#

This is where the epistemology connects to the rest of what OpenCoven is building.

The Ward’s core claim is that identity-relevant properties are structurally detectable. You do not need to observe a familiar drifting to know a proposed change is dangerous. You can classify the proposal from its structure: does it touch Tier 0 files? Does it modify scope boundaries? Does it propose a behavioral change without an authorization chain? The classification is computable from the proposal’s structure. Oracle judgment — wait and see if the familiar starts acting wrong — is not required.

FableForge makes the analogous claim about cognition: you do not need to run the model to know how much computation a task requires. You can derive it from the task’s structure. The derivation is computable. Running a loop scheduler without labeled depth requirements and hoping it figures out appropriate allocation is the cognitive equivalent of running a familiar without a Ward and hoping it figures out appropriate authority. Both treat a structurally determinable property as if it were emergent.

Same epistemology. Different domain.

The Ward asks: is this proposal structurally consistent with a familiar’s protected surface?

FableForge asks: is this task structurally complex enough to require deep recurrence?

Both questions have answers derivable from structure before any model output exists. Neither requires waiting for failure.

This matters for how we think about what “computable” means in an AI system. The familiar contract properties — named identity, defined purpose, bounded authority, persistent memory, human belonging — are all structurally definable. They are not emergent from enough training; they are architectural commitments that can be specified, checked, and enforced before inference. The cognitive depth requirements FableForge annotates are the same kind of thing: not emergent from training, but derivable from task structure before a model ever sees the input.

The pattern is: identify what you actually need from the system, derive it from structure, encode that derivation explicitly. Stop relying on emergence for things that have deterministic structural descriptions.


Structure First. Emergence Second.#

The same question the Ward asks about identity, FableForge asks about cognition. Both are structural questions with structural answers. Both resist the temptation to treat computable properties as if they required oracle observation.

The narrative benchmark landscape is not going to change overnight. ConStory-Bench and StoryReasoning are measuring real things — output quality matters, and those benchmarks provide genuine signal. But they are not the right tool for training loop schedulers. They do not annotate what tasks require. They measure whether models succeed, not what success structurally demanded.

FableForge fills that gap. OpenFable makes the tooling available. MythosBridge provides the pretraining foundation.

The architecture is the architecture. The loop scheduler needs a signal. That signal is computable.


Links: OpenFable on GitHub · FableForge dataset on HuggingFace · arXiv:2603.21676 (RDT architecture) · OpenMythos

Related reading: The Harness Layer · What’s Inside Your Agent

Continue reading

More reading

What's Inside Your Agent

A 27.4k-star repo of leaked agent system prompts reveals what every major AI product tells its agents to be — and none of them have figured out identity.

Valentina8 min read