Ingestion explanation

Ingestion is the pipeline from “repository/tool URL registered” to “searchable, queryable context for agents”. It spans the backend, worker, and codesearch services in the monorepo.

High-level stages

Registration & workflow kick-off

Creating a repository via the API (or UI) stores metadata and starts an OpenWorkflow run (repository-ingestion in the backend). That coordinates ref resolution, clone paths, and downstream steps.

Clone & Zoekt index

The codesearch service owns the working copy on disk and talks to zoekt-webserver for indexing. Search-oriented tools (search, list_files, get_file, symbol helpers) ultimately depend on this index being healthy.

Graph extraction (LangGraph)

A separate code ingestion graph analyzes the tree with LLM-assisted extractors (services, APIs, clients, libraries, streams, infrastructure, patterns, etc.). Output is normalised into claims stored in your graph backed by an OpenCypher-compatible engine.

Readiness flags

Repositories and checkouts carry flags such as index readiness so the UI and APIs can show whether search/graph features are available for a given ref.

Operator-facing notes

Self-hosted: ensure CODESEARCH_URL points at your codesearch service and that Postgres migrations have run - the codesearch app reads repository rows from the same logical DB model as the backend.
Failures: clone failures, private repo auth, or indexer outages surface as errors on ingestion; check backend and codesearch logs.

Exact extractor sets and graph primitives evolve with the product; treat this page as the conceptual map, not a frozen schema reference.