Multi-Agentic Research Platform

Answers research questions only with claims it can verify — every stage traced, every citation grounded.

operational overview

MARP exists because a single LLM call cannot be audited: it answers, but it cannot show its work. The platform decomposes research into five specialized agents in sequence, iterates until the answer meets a confidence threshold or hits the iteration cap, and returns every step traced and timed for inspection.

architecture

The Planner turns the question into a structured retrieval plan (typed PlanStep objects: sub-question plus search query). The Retriever runs cosine-similarity search against PostgreSQL with pgvector — embeddings generated through Gemini's embedContent API — and returns ranked chunks with source metadata and similarity scores. The Writer drafts from evidence, the Critic challenges the draft, and the Verifier checks claims before release; the Critic→Writer loop repeats until confidence clears the bar. Every agent emits typed trace events.

constraints

evidence grounding — no claim ships without a retrieval trail behind it
bounded iteration — the critique loop must converge or stop at a hard cap, never spin
LLM output fragility — structured JSON from a model cannot be assumed valid

tradeoffs

five single-responsibility agents over one omnibus prompt: more inference calls per question, but each stage emits typed traces and can be replaced without retraining the others
loop-until-confident over single-pass answers: response latency deliberately spent on claim-level verification, bounded by a hard iteration cap so the spend cannot run away
pgvector inside Postgres over a managed vector service: one database, one operational surface, one failure domain to observe

failure notes

the Planner's JSON parsing can fail on malformed LLM output — it degrades to treating the raw output as a single search query rather than aborting the run
the Retriever returns an empty list gracefully when the vector store has nothing — downstream stages handle absence of evidence as a first-class state
the Retriever currently executes only the first PlanStep of a multi-step plan — a known limit, preserved in the trace rather than papered over

infrastructure

python · postgres + pgvector · gemini embeddings · typescript · docker

engineering reasoning

The interesting problem was never the model — it was how intelligence behaves under constraints: what pipeline shape makes an LLM's answer auditable instead of plausible. Single-responsibility stages with typed contracts and traces are boring, and boring is what can be debugged.

future work

execute the full retrieval plan, not just its first step
confidence calibration against held-out questions