Multi-Agentic Research Platform

Answers research questions only with claims it can verify — every stage traced, every citation grounded.

operational overview

MARP exists because a single LLM call cannot be audited: it answers, but it cannot show its work. The platform decomposes research into five specialized agents in sequence, iterates until the answer meets a confidence threshold or hits the iteration cap, and returns every step traced and timed for inspection.

architecture

The Planner turns the question into a structured retrieval plan (typed PlanStep objects: sub-question plus search query). The Retriever runs cosine-similarity search against PostgreSQL with pgvector — embeddings generated through Gemini's embedContent API — and returns ranked chunks with source metadata and similarity scores. The Writer drafts from evidence, the Critic challenges the draft, and the Verifier checks claims before release; the Critic→Writer loop repeats until confidence clears the bar. Every agent emits typed trace events.

PLANRETRIEVEWRITECRITICVERIFYPGVECTORINOUT

constraints

  • evidence grounding — no claim ships without a retrieval trail behind it
  • bounded iteration — the critique loop must converge or stop at a hard cap, never spin
  • LLM output fragility — structured JSON from a model cannot be assumed valid

tradeoffs

  • five single-responsibility agents over one omnibus prompt: more inference calls per question, but each stage emits typed traces and can be replaced without retraining the others
  • loop-until-confident over single-pass answers: response latency deliberately spent on claim-level verification, bounded by a hard iteration cap so the spend cannot run away
  • pgvector inside Postgres over a managed vector service: one database, one operational surface, one failure domain to observe

failure notes

  • the Planner's JSON parsing can fail on malformed LLM output — it degrades to treating the raw output as a single search query rather than aborting the run
  • the Retriever returns an empty list gracefully when the vector store has nothing — downstream stages handle absence of evidence as a first-class state
  • the Retriever currently executes only the first PlanStep of a multi-step plan — a known limit, preserved in the trace rather than papered over

infrastructure

python · postgres + pgvector · gemini embeddings · typescript · docker

engineering reasoning

The interesting problem was never the model — it was how intelligence behaves under constraints: what pipeline shape makes an LLM's answer auditable instead of plausible. Single-responsibility stages with typed contracts and traces are boring, and boring is what can be debugged.

future work

  • execute the full retrieval plan, not just its first step
  • confidence calibration against held-out questions