♾️
Self-Improving Loop Engineering Agents
Coding · Review · Quality · Orchestrator · Reflection · LangGraph
Point LOOPFORGE_API at your backend. Workflow: clone → pytest → fix → branch → commit → open PR on GitHub.
Coding Agent
PrimaryGemini 2.5 Flash
FallbackClaude Haiku 4.5
Toolscode_exec, bash, write_file
Outputcode + tests + diff
Loop roleGenerate → get reviewed → refine
Review Agent
PrimaryGroq Llama 3.3 70B
PatternReflection / Critique
ChecksLogic · Security · Style
Outputscore + issues[] + suggestions[]
Loop roleScore → annotate → route back
Memory / KB
StoragePinecone + Redis
Contentpast runs · patterns · errors
RetrievalRAG on agent decisions
Writeafter every successful loop
retry loop HITL gate self- improve
🧠
Orchestrator
Claude Sonnet / GPT-4o
⌨️
Coding Agent
Gemini 2.5 Flash
🔍
Review Agent
Groq Llama 3.3 70B
Quality Agent
Groq / Cerebras
🗃️
Memory / KB
Pinecone + Redis
♾️
LangGraph
State Engine
interrupt_before · checkpointer
Quality Agent
PrimaryGroq Llama 3.3 70B
FallbackCerebras (1M tok/day)
Runspytest · coverage · lint
Outputpass/fail + metrics
Loop roleGate keeper — pass or retry
Orchestrator
PrimaryClaude Sonnet 4.6
PatternPlan & Execute + Reflection
Controlsmax_iterations · thresholds
Self-improvesrewrites own prompt on fail
Loop roleRoute · Decide · Adapt
HITL Gate
Triggerquality_score < 0.8 after 3 retries
Orsecurity flag raised
Orconfidence < threshold
LangGraphinterrupt_before="hitl"
ResumePOST /hitl/approve
One Complete Loop Iteration — Steps in Sequence
🧠
Plan
Orchestrator
decomposes task
⌨️
Generate
Coding Agent
writes code + tests
🔍
Review
Review Agent
critiques + scores
🧪
Test
Quality Agent
runs pytest/lint
⚖️
Decide
Orchestrator
pass/retry/escalate
💾
Learn
Memory Agent
stores patterns
🔄
Adapt
Orchestrator
rewrites prompts
Ship / HITL
Human gate or
auto-approve
🧠
Orchestrator Agent
Plans, routes, and self-improves the loop
Claude Sonnet 4.6
Core Capabilities
Decomposes user task into subtasks for coding agent
Routes between agents based on quality scores
Self-improves: rewrites coding prompt if review fails 2× in a row
Controls max_iterations to prevent infinite loops
Reads memory KB to avoid repeating past mistakes
Plan & Execute Reflection prompt_rewriter memory_lookup
♾️ Self-Improving
⌨️
Coding Agent
Generates, refactors, and patches code iteratively
Gemini 2.5 Flash (free)
Core Capabilities
Generates code from task spec + memory context
Writes unit + integration tests alongside code
Receives Review Agent critique → applies targeted patches
Produces structured diff for each iteration
Executes code in sandbox and self-corrects on error
code_execute write_file bash diff_apply ReAct
⌨️ Generate + Refine
🔍
Review Agent
Critiques code quality, logic, and security
Groq Llama 3.3 70B (free)
Core Capabilities
Scores code on: correctness, complexity, security, style (0–1 each)
Returns structured issues[] with line references
Flags security vulnerabilities (OWASP top 10)
Compares current vs previous iteration — measures improvement
Feeds score into Orchestrator routing decision
Reflection static_analysis score_rubric diff_compare
🔍 Critique Pattern
Quality Agent
Runtime validation — tests, coverage, lint, benchmarks
Groq / Cerebras (free)
Core Capabilities
Executes pytest — parses pass/fail + coverage %
Runs ruff / flake8 lint and complexity (radon)
Benchmarks performance against baseline
Hard gate: coverage < 80% → mandatory retry
Emits quality_report to Orchestrator + Memory
pytest_runner coverage.py ruff radon bandit
✅ Gate Keeper
🗃️
Memory / KB Agent
Persists patterns, errors, and prompt improvements across runs
Pinecone + Redis + SQLite
Core Capabilities
Stores every (task, prompt, code, score, outcome) tuple
RAG retrieval: "find similar past tasks that failed"
Maintains prompt_evolution log per task type
Surfaces "anti-patterns" to Coding Agent at start
Enables cross-run learning — the system genuinely improves
vector_store semantic_search prompt_log pattern_extract
🗃️ Cross-Run Learning
🔄
Self-Improve Meta-Agent
Rewrites agent prompts based on failure analysis
Claude Sonnet 4.6 + Memory
Core Capabilities
Analyzes failure modes across N recent loop runs
Generates improved system prompt variants (A/B)
Evaluates prompt variants on held-out test cases
Promotes best-performing prompt → Production
Snapshots old prompts for rollback (versioned in Redis)
prompt_mutation eval_harness A/B test promote rollback
♾️ Meta-Learning
🔁
Inner Loop — Per-Task Refinement
♾️
Outer Loop — Prompt Self-Improvement
📈
Quality Score Evolution — Across Loop Iterations
Per-Metric Score (latest run)
Correctness
0.91
Test Coverage
87%
Security Score
0.94
Code Complexity
CC=7
Avg Iterations
2.1
Improvement Across Prompt Versions
prompt_v1 avg
0.54
prompt_v2 avg
0.68
prompt_v3 avg
0.79
prompt_v4 avg
0.88
prompt_v5 (live)
0.93 ↑
State Schema + Graph Definition
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, List, Optional, Annotated
from langgraph.checkpoint.redis import RedisSaver

class AgentLoopState(TypedDict):
# Input
  task: str
  spec: dict
# Orchestrator
  plan: Optional[dict]
  active_prompt: str
  iteration: int
# Coding Agent
  generated_code: Optional[str]
  generated_tests: Optional[str]
  code_diff: Optional[str]
# Review Agent
  review_score: Optional[float]
  review_issues: Optional[List[dict]]
# Quality Agent
  test_passed: Optional[bool]
  coverage_pct: Optional[float]
  quality_report: Optional[dict]
# Memory
  memory_context: Optional[str]
  anti_patterns: Optional[List[str]]
# Control
  max_iterations: int
  should_escalate: bool
  run_id: str

# ── Build the graph ──
builder = StateGraph(AgentLoopState)

builder.add_node("orchestrate", orchestrator_node)
builder.add_node("memory_retrieve", memory_retrieve_node)
builder.add_node("code", coding_agent_node)
builder.add_node("review", review_agent_node)
builder.add_node("quality", quality_agent_node)
builder.add_node("memory_write", memory_write_node)
builder.add_node("hitl", hitl_node)          # interrupt
builder.add_node("self_improve", self_improve_node)

builder.add_edge(START, "orchestrate")
builder.add_edge("orchestrate", "memory_retrieve")
builder.add_edge("memory_retrieve", "code")
builder.add_edge("code", "review")
builder.add_edge("review", "quality")

# ── Conditional routing ──
builder.add_conditional_edges(
  "quality", route_after_quality,
  {
    "retry": "orchestrate",     # loop back
    "escalate": "hitl",       # human gate
    "pass": "memory_write",   # success
  }
)
builder.add_edge("memory_write", "self_improve")
builder.add_edge("self_improve", END)
builder.add_edge("hitl", "code")

graph = builder.compile(
  checkpointer=RedisSaver(client),
  interrupt_before=["hitl"]
)
Routing Function
def route_after_quality(state: AgentLoopState) -> str:
  score = state["review_score"] or 0.0
  passed = state["test_passed"] or False
  coverage = state["coverage_pct"] or 0.0
  i = state["iteration"]
  max_i = state["max_iterations"]

  # Hard escalation
  if i >= max_i:
    return "escalate"  # HITL takes over

  # Security flag → always escalate
  if any(i["type"]=="security"
        for i in state["review_issues"] or []):
    return "escalate"

  # Pass gate
  if score >= 0.85 and passed and coverage >= 80:
    return "pass"

  return "retry"              # loop
Self-Improve Node
async def self_improve_node(state):
  # Only triggers every N=10 runs
  if not should_trigger_improve():
    return state

  # Pull last 10 run traces
  traces = memory.get_recent_runs(n=10)
  failures = [t for t in traces
             if t["avg_score"] < 0.75]

  # Ask Claude to mutate the prompt
  new_prompts = await meta_llm.generate(
    system=PROMPT_ENGINEER_SYSTEM,
    user=f"Failures: {failures}\n"
         f"Current prompt: {state['active_prompt']}\n"
         f"Generate 3 improved variants."
  )

  # Eval on held-out test cases
  best = await eval_harness.pick_best(
    candidates=new_prompts,
    test_cases=EVAL_SUITE
  )

  # Persist + promote
  memory.save_prompt_version(best)
  return {**state, "active_prompt": best}
Model Router Config
AGENT_MODELS = {
  "orchestrator": {
    "primary": "anthropic/claude-sonnet-4-6",
    "fallback": "gemini/gemini-2.5-flash",
  },
  "coding": {
    "primary": "gemini/gemini-2.5-flash",  # free
    "fallback": "anthropic/claude-haiku-4-5",
  },
  "review": {
    "primary": "groq/llama-3.3-70b-versatile",# free+fast
    "fallback": "gemini/gemini-2.5-flash",
  },
  "quality": {
    "primary": "groq/llama-3.3-70b-versatile",# free
    "fallback": "cerebras/llama-3.3-70b",
  },
  "meta_improve": {
    "primary": "anthropic/claude-sonnet-4-6",# best reasoning
  },
}