Implementation Plan

RLM Editing Layer for Wren

2026-04-03 · Design ready, experimental · Phase 0–3

Context

Current Wren is an iterative editing loop (generate → score → feedback → regenerate). It is not a Recursive Language Model. The sub-model (llm_query) acts as a utility function for targeted edits, not as a recursive reasoning partner.

This plan adds an RLM layer that wraps the existing Wren engine — giving the root model explicit recursive reasoning about editing strategy before producing prose, with the rubric as the structured feedback signal.

Research basis: Zhang et al. RLM paper · PRefLexOR (arXiv:2410.12375) · Wang et al. (arXiv:2603.02615) — depth=1 optimal, depth=2+ causes overthink · Wren rubric system plan

Architecture Overview

RLMWrapper (new — the RLM layer) ├── RootModel (gpt-4o) — recursive reasoner │ ├── reasoning_chain: list[str] — accumulated reasoning across depths │ └── reflection(depth) — reflect on sub-critique before editing │ ├── SubModel (gpt-4o-mini) — structured critic │ └── critique(reasoning, candidate, rubric) → critique text │ └── WrenEngine (existing) — editing backend ├── MontySandbox — secure code execution ├── Slop Guard + Rubric scoring └── generate_candidates() → returns best revision RLM Prompt Flow (depth=1): 1. Root: reason about editing strategy (given rubric priorities) 2. Sub: critique the reasoning 3. Root: reflect + produce final revised_text 4. WrenEngine: verify + score (mandatory gate)

Depth constraint: max_depth=2. Wang et al. showed depth=1 is optimal; depth=2 is acceptable with cost guard; depth=3+ overthinks.

Rubric as the RLM Signal

The rubric (D1–D4) is the critical bridge between scoring and reasoning. It makes the RLM's recursion directed rather than aimless.

RubricResult Schema

@dataclass
class RubricDimensionResult:
    id: str           # "D1_slop"
    name: str         # "Slop Patterns"
    weight: float     # 3.0
    max_score: float # sum of criterion weights
    actual_score: float
    status: Literal["passed", "failed", "minor"]

@dataclass
class RubricResult:
    overall_score: float   # 0-100, weighted sum
    dimensions: list[RubricDimensionResult]
    priority_order: list[str]  # dim IDs, highest-weight-first

    def format_for_rlm(self) -> str:
        """Human-readable rubric state for RLM reasoning prompt."""
        lines = ["Rubric score breakdown:"]
        for dim in self.dimensions:
            pct = dim.actual_score / dim.max_score * 100
            status_icon = "✓" if dim.status == "passed" else "✗" if dim.status == "failed" else "~"
            lines.append(f"  {dim.name}: {dim.actual_score:.1f}/{dim.max_score:.1f} {status_icon}")

Rubric Evaluation Pipeline

def evaluate_rubric(text: str, rubric: Rubric) -> RubricResult:
    # Step 1: rule_based criteria → regex match (Slop Guard already does this)
    # Step 2: computed criteria → statistics (sentence length CV, etc.)
    # Step 3: llm_judgment criteria → batch LLM call (one call for all)
    return RubricResult(...)

RLM Prompts

Initial Reasoning (Root Model)

SYSTEM: You are an expert prose editor using recursive reasoning.
Before producing any edited text, you reason step-by-step about
the best editing strategy given the current rubric state.

You will reflect on a critic's feedback before finalizing your approach.

{active_rubric}

CRITERION TYPES:
- rule_based: vocabulary, phrases, patterns (regex-checkable)
- computed: sentence length CV, paragraph variance (statistical)
- llm_judgment: voice, specificity, structure (requires editorial assessment)

Your task:
1. Read the rubric priorities and current prose
2. Reason about WHAT to edit and WHY (not how — that's for later)
3. State your editing strategy in 3-5 sentences
4. Then produce the revised_text

Format your response:
```
STRATEGY:
[your reasoning about editing strategy]

REVISED_TEXT:
[the revised prose]
```

Sub-Model Critique

SYSTEM: You are a rigorous editorial critic.

CRITIQUE_TASK: Evaluate the editing strategy for soundness.
Is the reasoning correct about what's wrong?
Is the proposed strategy likely to fix the rubric failures?
What is missing, wrong, or underweighted?

Be specific. Reference the rubric dimensions by name.
Do not rewrite the text — critique the STRATEGY.

Respond in this format:
```
CRITIQUE:
[your critique of the strategy — 3-5 sentences, specific]
MISSING_DIMENSION: [which rubric dimension, if any, is not addressed]
STRATEGY_SOUND: YES or NO
```

Root Reflection + Final Edit

SYSTEM: You are an expert prose editor. You previously reasoned about an
editing strategy, received critique, and must now reflect before producing
the final revised text.

Your reflection task:
1. Did the critique reveal a gap in your reasoning?
2. Should you adjust strategy or priorities based on critique?
3. What specific changes will you make to the prose?

Format your response:
```
REFLECTION:
[your reflection on the critique and any strategy adjustment]

REVISED_TEXT:
[final revised prose — incorporate critique insights]
```

RLM Wrapper Implementation

class RLMWrapper:
    """
    Depth=1 RLM layer that wraps WrenEngine.

    Loop (depth <= 2):
        1. Root reasons about editing strategy (with rubric signal)
        2. Sub critiques the reasoning
        3. Root reflects + produces revised_text
        4. WrenEngine verifies (mandatory scoring gate)
        5. If score < threshold and depth < max_depth: continue
        6. Return best revision
    """

    def run(self, post: str) -> RLMResult:
        context = post
        reasoning_chain: list[str] = []
        best_text, best_score = "", 0
        initial_rubric = self._evaluate_rubric(post)

        for depth in range(self.max_depth):
            # Step 1: Root generates initial reasoning
            reasoning_response = self.root.completion(
                build_initial_reasoning_prompt(context, initial_rubric, self.mode)
            )
            reasoning, candidate = extract_reasoning_and_text(reasoning_response)

            # Step 2: Sub critiques the reasoning
            if depth < self.max_depth - 1:
                critique = self.sub.completion(
                    build_critique_prompt(reasoning, candidate, initial_rubric, context)
                )

                # Step 3: Root reflects on critique
                reflection_response = self.root.completion(
                    build_reflection_prompt(reasoning, critique, initial_rubric, context)
                )
                _, final_candidate = extract_reasoning_and_text(reflection_response)
                candidate = final_candidate

            # Step 4: WrenEngine verification gate (mandatory)
            score_result = self._engine.score(candidate)
            rubric_result = self._evaluate_rubric(candidate)
            hybrid_score = 0.4 * score_result.score + 0.6 * rubric_result.overall_score

            if hybrid_score > best_score:
                best_score = hybrid_score
                best_text = candidate

            target = 80 if self.mode != "light" else 60
            if score_result.score >= target:
                break  # threshold met — stop

            # depth exhausted — fall back to WrenEngine
            if depth >= self.max_depth - 1:
                return self._fallback_to_engine(post)

        return RLMResult(
            revised_post=best_text or post,
            initial_rubric=initial_rubric,
            final_rubric=self._evaluate_rubric(best_text),
            reasoning_chain=reasoning_chain,
            rlm_score=best_score,
            engine_score=score_result.score,
            depth_used=depth + 1,
            mode=self.mode,
        )

Cost Analysis

Configuration	LLM Calls	Tokens (reasoning)	Latency
WrenEngine (iterative)	~5–10 (root only)	~2K–5K	~15–30s
RLMWrapper depth=1	~6 (3×root + 3×sub)	~4K–8K	~20–40s
RLMWrapper depth=2	~12 (6×root + 6×sub)	~8K–16K	~40–80s

Cost guard: If per-iteration cost exceeds 3× the first iteration AND score improvement < 5 points, abort RLM loop and fall back to WrenEngine.

Implementation Phases

Phase 0 — Required before RLM

Rubric Core

Before the RLM layer makes sense, the rubric system must exist.

Create engine/rubric.py — RubricResult dataclass, load_rubric(), evaluate_rubric()
Create personas/default/RUBRIC.md — 4-dimension rubric (D1 Slop, D2 Voice, D3 Specificity, D4 Structure)
Wire rubric scoring into WrenEngine.score() — return both Slop Guard score and RubricResult
Add hybrid scoring — 0.4 * slop + 0.6 * rubric

Phase 1 — RLM Infrastructure

RLM Infrastructure

Create engine/rlm_prompts.py — the three RLM prompt builders
Create engine/rlm_wrapper.py — RLMWrapper class with depth=1 loop
Create engine/rlm_sandbox.py — lightweight sandbox (simpler than Monty — RLM doesn't need full Python execution)
Add extraction helpers — extract_reasoning_and_text() from LLM response

Phase 2 — Integration

Integration

Modify editor/editor.py — add RLMWrapper as alternative to WrenEngine
Add --rlm CLI flag — wren edit post.txt --rlm --rubric default
Add use_rlm: bool parameter to Wren.__init__()
Wire Logfire spans — trace reasoning_chain across depths

Phase 3 — Comparison + Tuning

Comparison + Tuning

Add A/B test mode — run both WrenEngine and RLMWrapper on same input, compare scores
Tune depth — test depth=1 vs depth=2 on hard cases
Tune rubric weights per persona
Cost tracking — log tokens used per depth, compare to direct WrenEngine cost

Files to Create / Modify

File	Action	Purpose
`engine/rubric.py`	Create	RubricResult, load_rubric(), evaluate_rubric()
`engine/rlm_prompts.py`	Create	Three RLM prompt builders
`engine/rlm_wrapper.py`	Create	RLMWrapper class
`engine/rlm_sandbox.py`	Create	Lightweight sandbox for RLM (llm_query only)
`personas/default/RUBRIC.md`	Create	Default 4-dimension rubric
`editor/editor.py`	Modify	Add --rlm flag, RLMWrapper integration
`engine/wren_engine.py`	Modify	Return RubricResult alongside ScoreResult
`tests/test_rlm_*.py`	Create	RLM unit + integration tests

Open Questions

Depth=1 vs depth=2 tradeoff: Wang et al. says depth=1 optimal, but prose editing may benefit from depth=2 for complex cases. How to decide dynamically?
When to fall back to WrenEngine? If RLM reasoning doesn't produce a candidate above threshold in 2 rounds, should we fall back to the iterative editing loop?
Rubric + Slop Guard overlap: D1 criteria overlap with Slop Guard's rule-based checks. Should RLM layer only focus on D2/D3/D4 (llm_judgment)?
Reasoning chain length: Verbose reasoning chains are good for observability but cost tokens. Should there be a max length?
Multiple sub-model critiques: PRefLexOR uses multi-agent critique. Could multiple sub-model critiques improve reasoning quality?

Test Plan

Unit Tests

# test_rlm_extraction.py
def test_extract_reasoning_and_text():
    response = """```
STRATEGY:
The prose has D2 voice failures (hedging) and D3 specificity failures.
REVISED_TEXT:
In 2024, NotYourIdea's team of 4 grew revenue 3x.
```"""
    reasoning, text = extract_reasoning_and_text(response)
    assert "hedging" in reasoning
    assert "2024" in text

# test_rlm_wrapper.py
def test_rlm_wrapper_depth_one():
    wrapper = RLMWrapper(rubric_path="personas/default/RUBRIC.md")
    result = wrapper.run("Some blog post text...")
    assert len(result.reasoning_chain) == 3  # reasoning, critique, reflection
    assert result.depth_used == 1

Integration Tests

# test_rlm_vs_wren.py
def test_rlm_better_on_hard_cases():
    """RLM should outperform Wren on cases with mixed rubric failures."""
    hard_post = load_test_fixture("mixed_rubric_failures.txt")
    
    wren_result = WrenEngine().run(hard_post, mode="standard")
    rlm_result = RLMWrapper(rubric_path="personas/default/RUBRIC.md").run(hard_post)
    
    # RLM should produce better or equal on hard cases
    assert rlm_result.engine_score >= wren_result.final_score - 5

def test_rlm_reasoning_is_interpretable():
    result = RLMWrapper(rubric_path="personas/default/RUBRIC.md").run(test_post)
    assert len(result.reasoning_chain) > 0
    for item in result.reasoning_chain:
        assert len(item) > 50  # non-trivial reasoning

Restricted Access

Context

Architecture Overview

Rubric as the RLM Signal

RubricResult Schema

Rubric Evaluation Pipeline

RLM Prompts

Initial Reasoning (Root Model)

Sub-Model Critique

Root Reflection + Final Edit

RLM Wrapper Implementation

Cost Analysis

Implementation Phases

Rubric Core

RLM Infrastructure

Integration

Comparison + Tuning

Files to Create / Modify

Open Questions

Test Plan

Unit Tests

Integration Tests