RLM Editing Layer for Wren
Context
Current Wren is an iterative editing loop (generate → score → feedback → regenerate). It is not a Recursive Language Model. The sub-model (llm_query) acts as a utility function for targeted edits, not as a recursive reasoning partner.
This plan adds an RLM layer that wraps the existing Wren engine — giving the root model explicit recursive reasoning about editing strategy before producing prose, with the rubric as the structured feedback signal.
Research basis: Zhang et al. RLM paper · PRefLexOR (arXiv:2410.12375) · Wang et al. (arXiv:2603.02615) — depth=1 optimal, depth=2+ causes overthink · Wren rubric system plan
Architecture Overview
Depth constraint: max_depth=2. Wang et al. showed depth=1 is optimal; depth=2 is acceptable with cost guard; depth=3+ overthinks.
Rubric as the RLM Signal
The rubric (D1–D4) is the critical bridge between scoring and reasoning. It makes the RLM's recursion directed rather than aimless.
RubricResult Schema
@dataclass
class RubricDimensionResult:
id: str # "D1_slop"
name: str # "Slop Patterns"
weight: float # 3.0
max_score: float # sum of criterion weights
actual_score: float
status: Literal["passed", "failed", "minor"]
@dataclass
class RubricResult:
overall_score: float # 0-100, weighted sum
dimensions: list[RubricDimensionResult]
priority_order: list[str] # dim IDs, highest-weight-first
def format_for_rlm(self) -> str:
"""Human-readable rubric state for RLM reasoning prompt."""
lines = ["Rubric score breakdown:"]
for dim in self.dimensions:
pct = dim.actual_score / dim.max_score * 100
status_icon = "✓" if dim.status == "passed" else "✗" if dim.status == "failed" else "~"
lines.append(f" {dim.name}: {dim.actual_score:.1f}/{dim.max_score:.1f} {status_icon}")
Rubric Evaluation Pipeline
def evaluate_rubric(text: str, rubric: Rubric) -> RubricResult:
# Step 1: rule_based criteria → regex match (Slop Guard already does this)
# Step 2: computed criteria → statistics (sentence length CV, etc.)
# Step 3: llm_judgment criteria → batch LLM call (one call for all)
return RubricResult(...)
RLM Prompts
Initial Reasoning (Root Model)
SYSTEM: You are an expert prose editor using recursive reasoning.
Before producing any edited text, you reason step-by-step about
the best editing strategy given the current rubric state.
You will reflect on a critic's feedback before finalizing your approach.
{active_rubric}
CRITERION TYPES:
- rule_based: vocabulary, phrases, patterns (regex-checkable)
- computed: sentence length CV, paragraph variance (statistical)
- llm_judgment: voice, specificity, structure (requires editorial assessment)
Your task:
1. Read the rubric priorities and current prose
2. Reason about WHAT to edit and WHY (not how — that's for later)
3. State your editing strategy in 3-5 sentences
4. Then produce the revised_text
Format your response:
```
STRATEGY:
[your reasoning about editing strategy]
REVISED_TEXT:
[the revised prose]
```
Sub-Model Critique
SYSTEM: You are a rigorous editorial critic.
CRITIQUE_TASK: Evaluate the editing strategy for soundness.
Is the reasoning correct about what's wrong?
Is the proposed strategy likely to fix the rubric failures?
What is missing, wrong, or underweighted?
Be specific. Reference the rubric dimensions by name.
Do not rewrite the text — critique the STRATEGY.
Respond in this format:
```
CRITIQUE:
[your critique of the strategy — 3-5 sentences, specific]
MISSING_DIMENSION: [which rubric dimension, if any, is not addressed]
STRATEGY_SOUND: YES or NO
```
Root Reflection + Final Edit
SYSTEM: You are an expert prose editor. You previously reasoned about an
editing strategy, received critique, and must now reflect before producing
the final revised text.
Your reflection task:
1. Did the critique reveal a gap in your reasoning?
2. Should you adjust strategy or priorities based on critique?
3. What specific changes will you make to the prose?
Format your response:
```
REFLECTION:
[your reflection on the critique and any strategy adjustment]
REVISED_TEXT:
[final revised prose — incorporate critique insights]
```
RLM Wrapper Implementation
class RLMWrapper:
"""
Depth=1 RLM layer that wraps WrenEngine.
Loop (depth <= 2):
1. Root reasons about editing strategy (with rubric signal)
2. Sub critiques the reasoning
3. Root reflects + produces revised_text
4. WrenEngine verifies (mandatory scoring gate)
5. If score < threshold and depth < max_depth: continue
6. Return best revision
"""
def run(self, post: str) -> RLMResult:
context = post
reasoning_chain: list[str] = []
best_text, best_score = "", 0
initial_rubric = self._evaluate_rubric(post)
for depth in range(self.max_depth):
# Step 1: Root generates initial reasoning
reasoning_response = self.root.completion(
build_initial_reasoning_prompt(context, initial_rubric, self.mode)
)
reasoning, candidate = extract_reasoning_and_text(reasoning_response)
# Step 2: Sub critiques the reasoning
if depth < self.max_depth - 1:
critique = self.sub.completion(
build_critique_prompt(reasoning, candidate, initial_rubric, context)
)
# Step 3: Root reflects on critique
reflection_response = self.root.completion(
build_reflection_prompt(reasoning, critique, initial_rubric, context)
)
_, final_candidate = extract_reasoning_and_text(reflection_response)
candidate = final_candidate
# Step 4: WrenEngine verification gate (mandatory)
score_result = self._engine.score(candidate)
rubric_result = self._evaluate_rubric(candidate)
hybrid_score = 0.4 * score_result.score + 0.6 * rubric_result.overall_score
if hybrid_score > best_score:
best_score = hybrid_score
best_text = candidate
target = 80 if self.mode != "light" else 60
if score_result.score >= target:
break # threshold met — stop
# depth exhausted — fall back to WrenEngine
if depth >= self.max_depth - 1:
return self._fallback_to_engine(post)
return RLMResult(
revised_post=best_text or post,
initial_rubric=initial_rubric,
final_rubric=self._evaluate_rubric(best_text),
reasoning_chain=reasoning_chain,
rlm_score=best_score,
engine_score=score_result.score,
depth_used=depth + 1,
mode=self.mode,
)
Cost Analysis
| Configuration | LLM Calls | Tokens (reasoning) | Latency |
|---|---|---|---|
| WrenEngine (iterative) | ~5–10 (root only) | ~2K–5K | ~15–30s |
| RLMWrapper depth=1 | ~6 (3×root + 3×sub) | ~4K–8K | ~20–40s |
| RLMWrapper depth=2 | ~12 (6×root + 6×sub) | ~8K–16K | ~40–80s |
Cost guard: If per-iteration cost exceeds 3× the first iteration AND score improvement < 5 points, abort RLM loop and fall back to WrenEngine.
Implementation Phases
Rubric Core
Before the RLM layer makes sense, the rubric system must exist.
- Create
engine/rubric.py— RubricResult dataclass, load_rubric(), evaluate_rubric() - Create
personas/default/RUBRIC.md— 4-dimension rubric (D1 Slop, D2 Voice, D3 Specificity, D4 Structure) - Wire rubric scoring into
WrenEngine.score()— return both Slop Guard score and RubricResult - Add hybrid scoring —
0.4 * slop + 0.6 * rubric
RLM Infrastructure
- Create
engine/rlm_prompts.py— the three RLM prompt builders - Create
engine/rlm_wrapper.py— RLMWrapper class with depth=1 loop - Create
engine/rlm_sandbox.py— lightweight sandbox (simpler than Monty — RLM doesn't need full Python execution) - Add extraction helpers —
extract_reasoning_and_text()from LLM response
Integration
- Modify
editor/editor.py— addRLMWrapperas alternative toWrenEngine - Add
--rlmCLI flag —wren edit post.txt --rlm --rubric default - Add
use_rlm: boolparameter toWren.__init__() - Wire Logfire spans — trace reasoning_chain across depths
Comparison + Tuning
- Add A/B test mode — run both WrenEngine and RLMWrapper on same input, compare scores
- Tune depth — test depth=1 vs depth=2 on hard cases
- Tune rubric weights per persona
- Cost tracking — log tokens used per depth, compare to direct WrenEngine cost
Files to Create / Modify
| File | Action | Purpose |
|---|---|---|
engine/rubric.py |
Create | RubricResult, load_rubric(), evaluate_rubric() |
engine/rlm_prompts.py |
Create | Three RLM prompt builders |
engine/rlm_wrapper.py |
Create | RLMWrapper class |
engine/rlm_sandbox.py |
Create | Lightweight sandbox for RLM (llm_query only) |
personas/default/RUBRIC.md |
Create | Default 4-dimension rubric |
editor/editor.py |
Modify | Add --rlm flag, RLMWrapper integration |
engine/wren_engine.py |
Modify | Return RubricResult alongside ScoreResult |
tests/test_rlm_*.py |
Create | RLM unit + integration tests |
Open Questions
- Depth=1 vs depth=2 tradeoff: Wang et al. says depth=1 optimal, but prose editing may benefit from depth=2 for complex cases. How to decide dynamically?
- When to fall back to WrenEngine? If RLM reasoning doesn't produce a candidate above threshold in 2 rounds, should we fall back to the iterative editing loop?
- Rubric + Slop Guard overlap: D1 criteria overlap with Slop Guard's rule-based checks. Should RLM layer only focus on D2/D3/D4 (llm_judgment)?
- Reasoning chain length: Verbose reasoning chains are good for observability but cost tokens. Should there be a max length?
- Multiple sub-model critiques: PRefLexOR uses multi-agent critique. Could multiple sub-model critiques improve reasoning quality?
Test Plan
Unit Tests
# test_rlm_extraction.py
def test_extract_reasoning_and_text():
response = """```
STRATEGY:
The prose has D2 voice failures (hedging) and D3 specificity failures.
REVISED_TEXT:
In 2024, NotYourIdea's team of 4 grew revenue 3x.
```"""
reasoning, text = extract_reasoning_and_text(response)
assert "hedging" in reasoning
assert "2024" in text
# test_rlm_wrapper.py
def test_rlm_wrapper_depth_one():
wrapper = RLMWrapper(rubric_path="personas/default/RUBRIC.md")
result = wrapper.run("Some blog post text...")
assert len(result.reasoning_chain) == 3 # reasoning, critique, reflection
assert result.depth_used == 1
Integration Tests
# test_rlm_vs_wren.py
def test_rlm_better_on_hard_cases():
"""RLM should outperform Wren on cases with mixed rubric failures."""
hard_post = load_test_fixture("mixed_rubric_failures.txt")
wren_result = WrenEngine().run(hard_post, mode="standard")
rlm_result = RLMWrapper(rubric_path="personas/default/RUBRIC.md").run(hard_post)
# RLM should produce better or equal on hard cases
assert rlm_result.engine_score >= wren_result.final_score - 5
def test_rlm_reasoning_is_interpretable():
result = RLMWrapper(rubric_path="personas/default/RUBRIC.md").run(test_post)
assert len(result.reasoning_chain) > 0
for item in result.reasoning_chain:
assert len(item) > 50 # non-trivial reasoning