Engineering notes — Endeit VC Agent

Section

Overview

The product looks like a VC research dashboard: type a company slug, get a verdict, click into the reasoning. The implementation underneath is more particular than that. The goal was not another LLM-on-top-of-search but a system where every numeric score is reproducible, every factual claim is cited back to its source, and the recommendation collapses to insufficient_data when the evidence is too thin to judge.

The pipeline runs offline (Python / LangGraph / Pydantic) and writes its output to a Postgres table on Supabase. The Next.js frontend is a thin read-only renderer that pulls from that table — the demo therefore stays cheap and fast to host, while every analysis is auditable: a single JSON blob per company contains the typed profile, the six specialist verdicts, the synthesis layer, the devil’s advocate output, and the Langfuse trace id that produced all of them.

The shape this codebase optimises for, in order: transparency > demo polish > clean architecture. When the three pulled in different directions, transparency won.

Section

The pipeline

An analysis is six stages. Each stage is also a Langfuse span, so the whole tree is reproducible inside the observability UI. The flow is sources → typed profile ⇄ gap-filler → six specialists → synthesis → devil’s advocate.

1 · Ingestion (9 adapters)

Each adapter implements the same Protocol — applies_to(slug, data_dir) and load(slug, data_dir) → (Contribution[], IngestionRecord) — and contributes typed leaves into the merge layer. Crunchbase first because it’s the densest, then LinkedIn (Bright Data API), Dealroom, Tavily news search, Owler, Greenhouse / Lever / Workable for hiring, Trustpilot, and a Firecrawl adapter for arbitrary marketing pages.

The adapters never write to the Profile directly: they emit Contribution objects with a field_path (dotted, e.g. funding.rounds[]) and a Cited[T] payload. The merge layer is responsible for picking conflicts apart — adapters stay dumb.

2 · Profile assembly · Cited[T] / Conflict[T]

Every leaf in the Profile Pydantic model is one of three shapes:

Cited[T] — a single observation with value, source, confidence, and an optional note.
Conflict[T] — two or more sources disagree on the same fact. The conflict is preserved with both candidates and both sources, not silently resolved.
None — the fact was never observed. The downstream specialist sees this as a gap, not a zero.

Pydantic 2’s PEP 695 generics (type aliases) give us this without a wrapper hierarchy: the JSON shape of every leaf is self-describing.

3 · Gap-filler agent · Firecrawl

Before the specialists run, an LLM inspects the assembled profile and emits a GapFillPlan — a ranked list of URLs to scrape, each tagged with a gap_name (team / competitors / revenue / funding / customers / news / hiring / market / product / tech / other).

Firecrawl fetches each target, the GapFillAdapter surfaces the markdown as sentiment.signals tagged with [gap:<name>], and a small routing map decides which specialist sees which note: team → team, competitors → market + competitive, revenue / funding → financial + traction, and so on. The same enrichment notes also go to the devil’s advocate, unrouted, so the bear case has the full set to mine.

The gap-filler is the only component that reaches outside pre-loaded data at runtime. Everything else is a function from Profile to a typed verdict.

4 · Six specialists (parallel, via LangGraph)

Each specialist — team, market, product, traction, competitive, financial — receives a sliced view of the profile plus its routed enrichment notes, and returns an AgentResult with score (1–5), confidence, reasoning, evidence citations, and risks.

LangGraph fan-out runs all six in parallel. Each agent is wrapped in @observe_or_noop so the trace tree shows six sibling spans, each containing the underlying chat-completion generation. Specialists never see one another — disagreement is allowed and gets preserved through to synthesis.

5 · Weighted synthesis · math, not vibes

The overall score is Σ weight_i × score_i across the six specialists, with weights tuned for a growth fund (traction and team count more than tech). The recommendation banding is a pure switch over that score plus a confidence floor — no LLM picks the verdict.

The LLM is only invoked to draft the bull thesis proseafter the score is already decided. Two systems, two responsibilities: deterministic math gets the number, the model gets the words.

If you change the prompt, the score doesn’t change. If you change the weights, the prose follows. That separation is the whole point.

6 · Devil's advocate · steelman the bear

The final stage takes the bull thesis, the six specialist verdicts, and the full set of enrichment notes, and argues the strongest case for the deal failing — heroic assumptions, weakest dimensions, red flags pulled from data conspicuously not in the specialist outputs, and an adjusted recommendation. It can lower the verdict (interested → watchlist) or refuse to judge (insufficient_data) where the bull was speculating.

Without enrichment notes the bear had nothing concrete to mine — every bear case devolved into “there’s no recent funding info.” Wiring all_enrichments(profile) into the prompt sharpened it visibly: the Mews bear now cites specific layoff articles, the Hugging Face bear points at concrete acquihires, etc.

Section

Design philosophy

Six choices the rest of the system is downstream of.

Provenance over confidence words

An LLM can always say “high confidence” — that tells you nothing. Every leaf in the profile carries a SourceRef (URL, fetched-via, retrieved_at). The frontend renders that as a clickable chip so a reviewer can verify any claim in one click. Confidence is still tracked, but it sits next to the source, not in place of it.

Conflict is data, not error

When LinkedIn says 1,500 employees and Crunchbase says 850, the system stores both inside a Conflict[T] and surfaces it. The downstream specialist sees the disagreement, can reason about which to trust, and can flag it as a risk. Silently picking the freshest value would let a stale source win and the reviewer would never know.

`insufficient_data` is a verdict

The five recommendation tiers are insufficient_data · pass · watchlist · interested · high_conviction. The synthesis layer collapses to insufficient_data when half or more of the specialists report low confidence — separating “we have data and it’s negative” from “we don’t have data yet.” The bear can use the same verdict.

Deterministic where possible

The weighted score is a one-line computation. The recommendation banding is a pure switch. The bull thesis is LLM prose drafted after the score is settled. Stochastic systems get the parts where wording matters; everything else stays deterministic so re-running the same profile yields the same number.

Adaptive ingestion

We don’t scrape everything for everyone. Each company’s gap-filler agent reads its own profile, decides which fields are thinnest, and picks five URLs that would fill them. Hugging Face gets different scrapes from Mews. The cost ceiling is the gap-filler’s own constraint: ≤ 5 Firecrawl calls per company.

Auditable end to end

Cache JSON per slug ↔ Supabase row ↔ Frontend render ↔ Langfuse trace. Any cell on the dashboard can be traced back to a specific generation, a specific prompt, a specific source page. The cache JSON is intentionally checked into git for the demo, so anyone can read the exact payload that produced a given verdict.

Section

Langfuse, in depth

The observability backbone — what's wired today, why each piece, and how the seam stays optional.

Langfuse sits behind a single seam at app/observability/langfuse.py. The whole module returns no-ops when LANGFUSE_PUBLIC_KEY or LANGFUSE_SECRET_KEY are missing, so the entire pipeline (and every test) runs clean without credentials. That seam pattern is the difference between “Langfuse is bolted on” and “Langfuse is a dev-only nicety” — both have to be true for the integration to feel light.

What's traced today

Each company analysis produces one trace with the following tree, all under session_id = <slug> so a click on “mews” in the Langfuse UI shows the whole pipeline plus any chat-bar follow-ups against the same session:

traceanalysis_graph— tagged with bull/bear/confidence + score metadata

spaningest.crunchbase

spaningest.linkedin

spaningest.tavily

spaningest.owler

spaningest.dealroom

spaningest.trustpilot

spaningest.hiring

spaningest.firecrawl + ingest.gap_fill

spangap_filler— + generation inside (URL planning)

spanteam_analyst— + generation

spanmarket_analyst— + generation

spanproduct_analyst— + generation

spantraction_analyst— + generation

spancompetitive_analyst— + generation

spanfinancial_analyst— + generation

spansynthesis— + generation (thesis prose)

spandevils_advocate— + generation (bear case)

Generations under each LLM-calling span are populated automatically by langfuse.openai.OpenAI, which we swap in for the standard OpenAI client when Langfuse is enabled. That single line of glue gives us model name, prompt / completion text, usage tokens, latency, and temperature — all per generation, all in the same trace as the surrounding span.

Sessions, tags, and metadata

After run_analysis_graph invokes the LangGraph state, we read back synthesis_result and devils_advocate_result and attach to the trace:

tags = ["bull:watchlist", "bear:pass", "confidence:medium"]
metadata = { overall_score, bull_recommendation, bear_recommendation, bear_red_flag_count, slug }
session_id = the company slug

Once those are attached the Langfuse UI becomes browsable: filter by bear=insufficient_data to find every company where the bear refused to commit, or by confidence:low to find the thinnest cases. The trace name (analysis:<slug>) makes the trace list scannable.

Per-trace stats on the dashboard

After the trace finishes, the precompute script calls fetch_trace_stats(trace_id): it flushes Langfuse, retries the public API with a short backoff (the SDK ingests asynchronously), and aggregates the observations into a small dict — generation count, prompt + completion tokens, total cost, wall-clock latency, models used. That dict is stored alongside the analysis in Supabase, so the home page can render a stats banner without making any runtime call to Langfuse.

The retry matters: a fresh trace is typically queryable within 1–3 seconds, but if you query too eagerly you get a 404. Three retries at 2-second intervals turned out to be generous enough for every slug in the dataset.

The user-feedback loop

The chat-bar on each analysis page is server-rendered via /api/chat — a Next.js Route Handler. That endpoint wraps the LLM call in a Langfuse trace (chat:<slug> under the same session_id) and returns the trace_id alongside the answer. Two buttons under each answer (👍 / 👎) post to /api/feedback, which calls langfuse.score(name="user_feedback", value=…) against that exact trace.

That closes the LLMOps loop: every chat answer is a trace, every vote is a score, and the score is filterable in the Langfuse UI. The score data type is NUMERIC (0 or 1) so it averages cleanly across many users.

The 'View trace ↗' link

Inside run_analysis_graph we capture client.get_current_trace_id() immediately after invoking LangGraph, compose the public URL ($LANGFUSE_HOST/trace/$ID), and persist both into the cached payload. The analysis page renders a discreet link in the sticky verdict bar — so an interviewer can click any company and see the full trace tree behind it without leaving the browser tab they started in.

Section

Langfuse — what's possible beyond the current wiring

The SDK affords more than what's wired today. These are the items that were next on the list.

Prompt Management

version control + A/B for prompts

The eight specialist / synthesis / bear / gap-filler prompts live as .md files in app/agents/prompts/. Moving them to Langfuse Prompt Management would give us version history, diffs in the UI, A/B between prompt variants, and a per-generation link from the trace to the exact prompt version that produced it. The seed step is a one-time script; the runtime call is langfuse.get_prompt("devils_advocate").prompt.

Datasets + Experiments

prompt comparisons over a fixed cohort

The ten demo slugs are a natural golden dataset. Saving them as a Langfuse Dataset and replaying run_analysis_graph over them under different experiment names would make prompt iterations comparable: synthesis-v1 vs synthesis-v2 against the same companies, with aggregate scores per experiment surfacing in the Langfuse UI. No new code on the agent side — the Dataset SDK wraps the call.

LLM-as-judge evaluators

automatic scoring of bear-case grounding

A separate model can be configured to read each devils_advocate output and score it on a custom rubric — e.g. “did the bear case cite at least three enrichment notes” or “did it propose adjusted = pass without specific adverse evidence?” The score posts back to the same trace, and over time you can see prompt edits move the average up or down. Configured in the Langfuse UI, no app changes needed beyond exposing the rubric.

Public trace sharing

no-login deep links for the demo

Langfuse 4.x exposes trace.public = true so individual traces can be shared without a Langfuse account. For an interview demo, flipping that switch on every cached trace gives the interviewer a click-through path even if they don’t have access to the project.

Cost configuration

model pricing for Ollama Cloud

The stats banner currently shows total_cost_usd = $0.00 because gpt-oss:120b on Ollama Cloud isn’t in Langfuse’s default pricing table. Adding a per-model price entry in the Langfuse UI (or attaching cost_details per generation) would let cost aggregations work out of the box.

Annotation queues

human review at scale

Once the analyst pipeline is producing dozens of verdicts a day, Annotation Queues let a human reviewer batch-score traces (helpful / off / borderline) and feed those scores back into the evaluator loop. The patterns from the chat-bar feedback generalise.

Section

Implementation notes

Smaller decisions worth flagging.

The seam pattern. Every Langfuse call goes through app/observability/langfuse.py, which gates on a single settings.langfuse_enabled flag and returns no-ops below it. Call sites don’t need to know whether Langfuse is configured. The test suite runs with empty env vars; no mocks required.

Prompts ship as package data, not Langfuse-hosted. The eight prompts live in app/agents/prompts/*.md so a fresh clone runs without a Langfuse fetch on cold start, and prompt versions stay pinned to the same git SHA as the code that consumes them. The cost is real: you can’t edit a prompt in the Langfuse UI and replay an analysis without a deploy. Moving prompts to Langfuse Prompt Management is the natural next step once iteration cadence outgrows that constraint.

The frontend Langfuse client is server-only. frontend/lib/langfuse.ts instantiates the SDK from LANGFUSE_* server-side env vars and is imported only from Route Handlers. The secret key never reaches the browser; the chat-bar talks to /api/chat and /api/feedback, which in turn talk to Langfuse.

Async ingestion forces a retry on stats fetch. langfuse.api.trace.get(id) called immediately after a span ends typically 404s — Langfuse ingests asynchronously, with a 1–3 second lag in practice. The stats fetch flushes the client, waits, and retries up to three times before giving up. The retry is tuned to the observed lag rather than the SDK’s default exponential backoff, which was slower than needed for this dataset.

Gap-fill ingestion rows are disambiguated by file_path. GapFillAdapter reuses the firecrawl SourceKind because the underlying scrapes are firecrawl, which leaves the frontend with two indistinguishable rows in the ingestion log. Detection is by file_path containing gap_fill_. Renaming the SourceKind on the backend is the cleaner fix; the frontend filter avoids touching the schema and re-migrating the cached payloads.

Cost on the dashboard reads "free". gpt-oss:120b on Ollama Cloud isn’t in Langfuse’s default model price table, so the per-trace cost aggregates to zero. Configuring a price entry in the Langfuse UI (or passing cost_details per generation) would surface a real number; we left it unconfigured rather than synthesise one.

Section

Open questions

Is confidence calibrated? Every specialist currently reports mediumfor almost every company. That’s probably a prompt artefact, not an actual property of the data. A targeted eval — does the human reviewer agree with the model when it says “medium confidence in this team score”? — is the next thing I’d want to know.

Is the gap-filler greedy enough? Five URLs per company is a cost cap, not a quality cap. The bear case improved visibly when we wired enrichment notes into it; that suggests the gap-filler could fetch more aggressively, especially for thin profiles.

How do you sanity-check the synthesis weights? They’re hand-tuned for a growth fund. A meta-eval over historic Endeit deals (rank-correlation between model score and outcome) would either justify them or move them — but that needs a labelled set this demo doesn’t have.

Where does this stop being a demo?The ingestion side has at least three real productisation questions — schedule, deduplication, source admission policy — none of which a one-shot demo needs to answer. They’re still worth naming.

Back to dashboard