Document Copilot
A research firm was burning half its analyst capacity on reading filings before any analysis could start. This is how that bottleneck was reframed as a business problem, tested against the question "should AI even be used?", and answered with a system designed for trust.
Quorum Partners is an independent investment-research firm whose ~40 analysts spend roughly half of every week on document intake: reading SEC filings to find the passages worth analysing. The work is necessary but repetitive, and it scales linearly with coverage, so hiring more analysts never removes the bottleneck.
The proposal is Document Copilot: an internal assistant that answers plain-English questions about a curated corpus of filings, with every claim cited to a source filing and page. Because the firm sells being right, the system is engineered so it never answers beyond its sources. Success is defined concretely, as a pilot of five analysts saving at least three hours each per week, making the return measurable before any firm-wide rollout.
A business whose product is condensed reading
Quorum Partners sells deep equity research to institutional investors (hedge funds, mutual funds, and pension funds) under annual subscriptions from $50K to $500K+ per client. It does not manage money. Its product is research, and access to the analysts who produce it.
Each analyst covers roughly fifteen public companies in a single industry. Their clients, portfolio managers, don't have the bandwidth to read every 10-K, 10-Q, and earnings filing themselves, so they pay Quorum to do it and to turn thousands of pages into a one-page thesis they can act on. The entire value the firm adds is condensation, and its entire reputation rests on being right.
Who feels the problem
- Analysts
- Spend half the week on intake before producing the analysis they were actually hired for.
- Research directors
- Watch coverage capacity capped by reading time, not by analytical talent.
- Partners
- Carry the franchise risk. A single wrong, poorly-sourced call dents the firm's reputation.
Why it matters
The intake grind is the largest single drag on output from the firm's most expensive people. It is boring, necessary, and duplicated: multiple analysts read the same Apple 10-K every January. It never appears as a line item, yet it quietly halves the productive capacity of the whole firm.
Should AI even be used here?
Before designing anything, the more important question is whether AI is the right instrument at all. The honest test is to compare it against the alternatives on the dimensions that matter to this specific business.
| Approach | Why it falls short here |
|---|---|
| Hire more analysts | Cost scales linearly with coverage and never removes the duplicated reading. It buys output, not leverage. |
| Keyword search / Ctrl-F | Filings use inconsistent language across companies and years; keyword search finds strings, not meaning, and can't compare across documents. |
| A plain chatbot (LLM only) | A model answering from memory will hallucinate financial facts. In a firm that sells being right, a confident wrong answer is worse than none. |
| Retrieval-augmented AI | Fits: the work is semantic search plus summarisation over a bounded, factual corpus, with citation as a hard requirement. |
Why AI, why now: the task is reading comprehension over natural language at volume, exactly what language models do well, and retrieval techniques are now mature enough to keep answers grounded in source text. The shape of the problem (bounded corpus, semantic questions, verifiable answers) is what makes it a good fit, not the fact that AI is fashionable.
The deciding factor isn't capability, it's trust. AI is only the right answer here if the system can prove every claim against a source. That single constraint shapes the entire design that follows.
An assistant that condenses, but never invents
Document Copilot is an internal tool where any analyst can ask a plain-English question about any filing in Quorum's curated corpus and receive an answer that cites the specific filing and page, with the underlying passage one click away. It is delivered in the browser, with email login, and keeps each analyst's conversation history.
Conceptually, the product rests on three non-negotiable promises:
- Never invent. If the answer isn't in the corpus, the assistant says so rather than guessing.
- Always cite. Every claim links back to the source filing and page.
- Show the evidence. The analyst can read the exact passage and verify in seconds.
The goal is not an assistant that sounds smart. It is an assistant an analyst can stake their own reputation on.
That framing matters for adoption: analysts will only fold a tool into their workflow if they trust it enough to base downstream analysis on its output. Verification in one click is what earns that trust.
Engineering for grounded answers
The architecture is shaped by three constraints from the sections above: answers must be grounded in sources, the firm has no infrastructure team (so the footprint must stay small), and the system must be trustworthy enough to depend on daily.
A question is embedded, matched against filing chunks with hybrid vector and full-text search, then answered by the model using only the retrieved passages, each cited back to its source.
System components
| Layer | Choice & rationale |
|---|---|
| Backend | Python + FastAPI: fast to build, with a strong ecosystem for AI and data work. |
| Frontend | Vite + React + TypeScript: a responsive single-page app analysts use from the browser. |
| Database | Supabase Postgres holds users, chats, documents, and chunks in one managed store. |
| Retrieval | pgvector for semantic similarity + Postgres full-text search for exact terms: hybrid recall in one database. |
| Migrations | SQLAlchemy models + Alembic for versioned, repeatable schema changes. |
| Auth | Supabase Auth, email only: matches the "log in with your Quorum email" requirement without SSO overhead. |
| Hosting | Railway: a small managed footprint a firm with no infra team can run. |
| LLM + embeddings | OpenAI for embeddings and chat completion. |
Key trade-offs
- Retrieval over fine-tuning. The corpus changes as filings are added; retrieval keeps answers current without retraining and keeps every answer traceable to a document.
- Hybrid search over pure vectors. Vector search captures meaning; full-text catches exact tickers, figures, and defined terms. Together they reduce missed-evidence errors.
- One database over a separate vector store. Keeping chunks and embeddings in Postgres lowers operational burden for a team without dedicated infra.
Trust, security & reliability
Grounding is the core reliability mechanism: the model only ever sees passages retrieved from real filings, so it answers from evidence rather than memory, and refuses when the corpus is silent. Access is gated to Quorum email accounts, and because the source data is public SEC filings, the sensitivity sits in the firm's questions and conversations rather than the documents themselves. The corpus is curated and bounded, which keeps both behaviour and cost predictable.
From design to working product
Document Copilot is built and deployed. The sample corpus contains 10-K filings for Apple, Amazon, Alphabet, Microsoft, and NVIDIA across fiscal years 2021–2025, ingested through a chunking and embedding pipeline and served through the retrieval API described above. Analysts can ask cross-document questions such as how Apple's revenue mix shifted over five years, and receive cited, verifiable answers.
Measuring the return, not assuming it
The value here isn't software. It's senior analyst time redirected from reading to thinking. The success criterion was deliberately concrete and set before building: a pilot of five senior analysts uses the tool for a week and reports saving at least three hours each. If met, it rolls out firm-wide.
Beyond raw hours, the return shows up across several dimensions the firm cares about:
- Reclaimed capacity: three hours across forty analysts is meaningful coverage added without new headcount.
- Decision quality: cited, verifiable answers reduce the risk of analysis built on a misread passage.
- Risk reduction: a system that refuses to guess lowers the chance of a confidently wrong, reputation-denting call.
- Consistency: the duplicated reading of the same filing across analysts is done once, well.
Crucially, the metric is something the business can observe directly during the pilot, not a vague promise of "productivity."
What would prove this the wrong solution
Sound judgment means naming the evidence that would kill the idea before committing to it. Document Copilot would be the wrong call if any of these held true in the pilot:
- Analysts save fewer than three hours each, or spend the saved time re-checking the tool because they don't trust it.
- Citation accuracy is low enough that verification costs more time than the reading it replaced.
- The corpus can't be kept current cheaply, so answers drift out of date faster than they're useful.
If those signals appeared, the honest response would be to stop, not to reach for a bigger model. Defining failure up front is what keeps the decision honest.
What the build taught
Technical
Hybrid retrieval matters more than model choice for this kind of problem; getting the right passages in front of the model does more for answer quality than swapping the model itself. Chunking strategy, how filings are split before embedding, has an outsized effect on whether the right evidence is even retrievable.
Business
Defining "done" as a measurable hours-saved threshold kept the scope honest and focused the build on the one job that justified it. It also turns a subjective tool into something the client can evaluate with a number.
Architecture
Designing around a single constraint, never answer beyond the sources, simplified dozens of downstream decisions. When trust is the product, the architecture should make untrustworthy answers structurally difficult, not merely discouraged.
Limitations & future improvements
- The corpus is currently a small sample; scaling to the full S&P 500 raises ingestion and retrieval-quality questions worth a dedicated study.
- A formal evaluation harness that scores answers for groundedness and citation accuracy would turn "trust" from a principle into a continuously measured metric.
- Cross-document numerical reasoning (comparing figures across filings) is where retrieval-grounded systems are weakest and deserve careful guardrails.
See it for yourself
Open the live app, ask a question about a real SEC filing, and watch it answer with its sources attached.