Design — The Borrower Case (multi-statement consolidation)
Living design doc. Last meaningful update: 2026-06-16. Owner: Salil. Roadmap context:
roadmap.md(this is roadmap item #1, the Spine). Touches:report.py(contract),statements/models.py(tables),pipeline/{analytics,scoring,verify}.py(consolidation),accounts/models.py(pricing).Status: COMPLETE — Phase 1 + Phase 2 (backend + API + UI, 2026-06-16). Single-account multi-month AND multi-account consolidation with inter-account transfer dedup are live end to end, tested (
server/tests/test_consolidate.py, 13 cases), and live-verified on Postgres (a ₹50k self-transfer across two banks is detected and netted out of income/spend while per-account balances stay separate). Supersedes parked task Multi-statement cases.Phase 2 added: the multi-account guard is removed;
pipeline/consolidate.pynow runs per-account analytics (real per-account balance), the inter-account transfer matcher (§4a: amount tolerance, 7-day window, confidence scoring, internal vs internal_suspected), consolidated flows with internal legs netted out, andbalance_by_account(labeled, never summed). The UI shows the internal-transfers audit trail and a per-account balance table.Account identity & duplicate uploads: statements are grouped into accounts by bank + the last revealed digits of the account number, not the raw masked string — extraction masks the prefix with a varying number of
X/*(XXXX4821vs****4821), so keying on the literal mask would split one account into several and multiply income. Within an account, re-uploads of the same statement (identical period + closing balance — extraction can vary run-to-run, so we don’t require an identical ledger) are dropped from the consolidation and surfaced as aduplicateStatementsreview flag (_account_key/_dedup_members).What shipped (Phase 1):
Borrower/Case/CaseStatement/CaseReporttables (migration20260616_12), theConsolidatedReportcontract (report.py+ TS mirror), the deterministic consolidation engine (pipeline/consolidate.py: de-overlap, coverage gaps, seam-continuity, identity check, consolidated income/FOIR/cash-flow, LLM-narrated Case verdict), and the/v1/borrowers+/v1/cases/*API (statements/cases.py). Consolidation is free (members already paid per-page; pricing is credit packs, no bundle — seepricing.md). Phase 1 ran synchronously / single-account; Phase 2 shipped multi-account, async on the worker (the 422 guard is gone).
1. The job-to-be-done
A lender underwrites a borrower, not a statement. Today our unit is a Job (one file →
one AnalysisReport). A real underwriting pull is 6–12 months across 1–3 accounts/banks.
The Borrower Case is the object that holds those statements and produces one consolidated,
reconciled financial picture — while keeping every per-account number auditable back to its
source statement.
In scope: group statements under a borrower; dedup money moving between the borrower’s own accounts; stitch overlapping/non-contiguous periods; consolidate income / FOIR / cash flow / risk; one Case-level verdict; per-account drill-down.
Out of scope (for v1): Account Aggregator ingestion, the borrower intake link, the policy / auto-decision layer. Those are separate roadmap items that attach to a Case once it exists.
2. Where this sits in the existing architecture
Nothing about per-statement processing changes. Each statement is still a Job that runs the
existing pipeline (INGEST → EXTRACT → VERIFY → ANALYTICS → VERDICT) and produces a Result
holding an AnalysisReport. The Case is a new layer on top:
Borrower
└── Case (an underwriting pull)
├── Job → Result (AnalysisReport) ← unchanged per-statement pipeline
├── Job → Result (AnalysisReport)
└── Job → Result (AnalysisReport)
⇓ CONSOLIDATE (new Case-level stage, runs once all member Jobs are DONE)
ConsolidatedReportThe consolidation engine is pure deterministic code over already-verified ledgers — same
discipline as analytics.py. No LLM touches the merge; the LLM only narrates the Case verdict
(mirrors the per-statement money-math inversion).
3. Data model
New tables in statements/models.py (conventions: String(36) + _uuid PKs, _now
timestamps, tenant_id for the owning org, JSONType for payloads).
class Borrower(Base):
"""A person being underwritten. Owned by a tenant; lightweight by design —
identity is asserted by the lender and cross-checked against statement holders."""
__tablename__ = "borrowers"
id # str pk
tenant_id # FK tenants.id, indexed
display_name # lender's label for them ("Ramesh K — LAP applicant")
external_ref # optional lender's own application/loan id (indexed, for their lookups)
created_by # user id
created_at
class Case(Base):
"""One underwriting pull for a borrower — the unit the platform (and pricing) revolves around."""
__tablename__ = "cases"
id
tenant_id # FK, indexed
borrower_id # FK borrowers.id, indexed
status # draft | processing | needs_review | consolidated | failed
purpose # optional: "personal_loan" | "lap" | "bnpl" | ... (lender's tag)
consolidated_at # when the merge last succeeded
created_by
created_at, updated_at
class CaseStatement(Base):
"""Link row: a Job belongs to a Case. A Job belongs to at most one Case."""
__tablename__ = "case_statements"
__table_args__ = (UniqueConstraint("job_id"),) # a statement lives in one case
id
case_id # FK cases.id, indexed
job_id # FK jobs.id
added_by, added_at
class CaseReport(Base):
"""The consolidated output — one current row per case (re-merge overwrites)."""
__tablename__ = "case_reports"
__table_args__ = (UniqueConstraint("case_id"),)
id
case_id
report # JSONType — ConsolidatedReport, serialized by alias (camelCase)
coverage_months # distinct months covered (for quick listing)
consolidation_confidence # 0-100: how clean the merge was (dedup + coverage + identity)
created_atNotes:
Jobgains no required column — membership lives inCaseStatement, so single-statement flows are completely unaffected (a Job with noCaseStatementrow is just standalone, as today).- A Job may be added to a Case after it already ran — we re-merge, we don’t re-extract.
- Migration name follows the pinned convention:
yyyyMMdd_NN_borrower_case.py.
4. The consolidation engine (the hard part)
Input: the AnalysisReport (and its verified transaction ledger) of every member Job.
Output: a ConsolidatedReport. Four subproblems, hardest first.
4a. Inter-account transfer dedup ← the headline problem
When a borrower moves ₹50,000 from their HDFC account to their ICICI account, it appears as a
debit in HDFC and a credit in ICICI. Naively consolidated, this inflates income and
spend on both sides and corrupts FOIR. We must detect these pairs and mark them internal
so they’re excluded from income, spend, and cash-flow totals (but still shown, labelled, for
auditability).
Matching algorithm (deterministic):
- Candidate generation. For every debit
din account A, find creditscin account B≠A where:|amount(d) − amount(c)| ≤ tol—tol = max(₹1, 0.5%)to absorb fees/rounding (a small transfer fee means the credit is slightly less than the debit).0 ≤ date(c) − date(d) ≤ window—window = 7 days(policy.transfer_window_days, ≈5 business days; UPI/IMPS are same-day, NEFT/RTGS/cheque can lag). Credit cannot precede debit.- Both accounts belong to the same borrower (guaranteed — they’re in one Case).
- Scoring each candidate pair (0–1), so we can pick the best when amounts collide:
- exact amount = 1.0; within tol scaled down.
- same-day +0.2; narration cross-references (UPI VPA, the other bank’s name, “SELF”, “TRANSFER”, matching reference number) +0.3.
- mode consistency (both UPI, or NEFT↔NEFT) small bonus.
- Assignment. Greedy by descending score, one-to-one (a debit matches at most one credit); upgrade to optimal bipartite matching (Hungarian) only if greedy proves lossy on real data.
- Classification by confidence:
≥ 0.85→ auto-markinternal, excluded from aggregates.0.6–0.85→ markinternal_suspected, excluded but raised as a Case review item (“Possible self-transfer ₹X HDFC→ICICI on dd-mm — confirm”).< 0.6→ left as normal income/spend.
Why this is on-brand: every dedup decision is recorded with its score and the two source rows. The merge is auditable — a lender can see exactly which pairs were netted out and why. The merge is auditable end to end.
✅ SME-reviewed (2026-06-18, Round 4, F1): the match window is 7 calendar days (≈5 business days;
policy.transfer_window_days), and low-confidence (0.6–0.85) matches stay excluded from income and flagged — underwriting optimises against income inflation over understatement.
Nasty edge cases to handle (and test as goldens):
- Two ₹50,000 transfers same day → don’t cross-match; pair by reference/narration, else flag both.
- Transfer fee / GST splits the amount → tolerance + “the remainder is a fee” heuristic.
- Round-tripping (A→B→A) → chain detection; net to zero, don’t count as income twice.
- One leg missing (the other account isn’t in the Case) → cannot dedup; must not silently treat as income — flag “inflow looks like a transfer from an account not in this case.”
4b. Overlapping / non-contiguous periods
Member statements may overlap (same account, Jan–Mar and Mar–Jun share March) or have gaps (Jan, then Apr).
- Per-account de-overlap: within one account, dedup transactions that appear in two statements (match on date + amount + balance + narration). Keep one. The running balance is the disambiguator — duplicated rows carry the same balance, genuine same-amount repeats don’t.
- Coverage timeline: compute, per account, the covered date ranges; derive Case-level coverage. Gaps are surfaced loudly, never silently bridged (“No data Feb 1–Mar 15 for ICICS ••4321 — income/FOIR computed on available months only”).
- Consolidated monthly cash flow is built only over fully-covered months; partially-covered months are marked partial and excluded from averages.
4c. Consolidated metrics
Re-run the deterministic analytics over the merged, deduped, internal-excluded ledger:
- Income: cluster salary/recurring credits across accounts, but after removing
internaltransfers. If salary lands in account A every month, that’s the income; a monthly self-top-up into B is not. Average monthly income = over covered months only. - Obligations / FOIR: collect EMIs/rent/SIPs across all accounts (an EMI debited from B
still counts), dedup obligations that appear in overlap windows, then
FOIR = total_monthly_obligations / consolidated_average_monthly_income. This is the number per-account reports get wrong today — an EMI in one account vs salary in another makes single-statement FOIR meaningless. - Balance: report per-account ADB/AMB/min/negative-days (summing balances across accounts is misleading), plus a consolidated “lowest combined liquidity” view.
- Cash flow: merged monthly credits/debits with internal transfers removed.
- Counterparties / channel mix / categories: union across accounts, internal transfers excluded from counterparties.
4d. Consolidated risk score
scoring.py is already a pure function of (AnalyticsBundle, VerificationResult). Extend it to
a Case-level scorer that takes the consolidated bundle plus Case-level signals:
- Reuse the existing knockout/deduct model on consolidated FOIR, income regularity, risk flags (unioned across accounts), negative-balance days (worst account).
- New Case-level factors:
- coverage knockout/deduction — thin coverage (e.g. < 3 months) caps confidence, not the score.
- identity mismatch (see §5) — a hard flag.
- high
internal_suspectedvolume — reduces confidence.
- Output stays deterministic + fully factor-itemized; the LLM narrates it, never decides it.
5. Verification at the Case level (identity + coverage)
Per-statement reconciliation is unchanged (each Job still passes its own balance gate). The Case adds two checks:
- Account-holder consistency: the
account_holderon every member statement should be the same person (fuzzy match — “RAMESH KUMAR” vs “Ramesh K.”). A mismatch is a critical Case review item (“Statement 3 is held by a different name — confirm same borrower”). This is a fraud/mixup guard that single-statement analysis structurally cannot do. - Coverage adequacy: months covered vs the lender’s expectation (default warn < 6 months).
A Case reaches consolidated only when all member Jobs are DONE and Case-level checks pass;
otherwise needs_review with itemized reasons (reusing the existing ReviewItem pattern,
scoped by case).
6. The report contract
New Pydantic models in report.py (camelCase wire format; TS mirror regenerated via
scripts/export_schema.py). Reuse existing sub-models — a consolidated report is mostly the
same shapes computed over a merged ledger, plus per-account breakdown and Case metadata.
class AccountRef(CamelModel):
job_id: str
meta: StatementMeta # existing — identifies the account/bank/period
risk_band: RiskBand # that statement's standalone band (for drill-down)
class InternalTransfer(CamelModel):
from_job_id: str
to_job_id: str
amount: float
debit_date: str
credit_date: str
confidence: float # the dedup score
status: Literal["internal", "internal_suspected"]
class CoverageGap(CamelModel):
account_label: str
from_: str = Field(alias="from")
to: str
class ConsolidatedReport(CamelModel):
borrower_display_name: str
accounts: list[AccountRef] # the statements in this case
period_covered: Period # earliest → latest across all
coverage_gaps: list[CoverageGap]
verdict: Verdict # reuse — Case-level, LLM-narrated
income: IncomeAnalysis # reuse — consolidated
obligations: ObligationAnalysis # reuse — consolidated FOIR
cash_flow: CashFlowAnalysis # reuse — merged monthly
balance_by_account: list[BalanceAnalysis] # per-account, not summed
risk_flags: list[RiskFlag] # unioned
counterparties: CounterpartyAnalysis # internal transfers excluded
internal_transfers: list[InternalTransfer] # the audit trail of the dedup
consolidation_confidence: float # 0-100
review_flags: list[ReviewFlag] # identity / coverage / suspected transfers
generated_at: Optional[str] = None
disclaimer: str7. Pricing hook
Resolved (2026-06-23): pricing is pure-consumption credit packs, not a per-borrower bundle.
A Case consumes the per-page credits of its member Jobs like any other processing — no separate
bundle SKU. (A bundle was considered and rejected: it exposes us to page-count variance since our
cost is LLM-linked, and bulk-bonus credits already give heavy users a better effective rate.) The
ledger/lots model imposes no constraint here. See pricing.md.
8. API surface (additive, versioned /v1)
POST /v1/borrowers create a borrower
POST /v1/cases create a case (borrower_id, purpose?)
POST /v1/cases/{id}/statements add a statement: existing job_id, OR upload inline
(creates the Job, then links it)
POST /v1/cases/{id}/consolidate run/refresh the merge (idempotent; no-op if unchanged)
GET /v1/cases/{id} case + member job statuses + consolidation state
GET /v1/cases/{id}/report the ConsolidatedReport
POST /v1/cases/{id}/review/{item} resolve a Case review item (identity / transfer)Consolidation is triggered explicitly (or auto when the last member Job finishes). Webhooks
gain a case.consolidated / case.needs_review event (reusing the HMAC-signed notifier).
9. Processing flow
- Lender creates a Case, adds statements (new uploads or existing Jobs).
- Each new statement runs the existing pipeline unchanged.
- When all member Jobs are
DONE, the CONSOLIDATE stage runs: load eachResultledger → per-account de-overlap → inter-account dedup → merged analytics → Case verification → Case scoring → LLM verdict narration → writeCaseReport. - If Case checks fail →
needs_reviewwith itemized reasons; lender resolves; re-consolidate. - Adding/removing a statement later re-runs only CONSOLIDATE (cheap; no re-extraction).
10. UI (app.obsrv.in)
BUILT (Phase 1). A Cases section (sidebar peer of Statements, app/(shell)/cases/):
a case list, a “New case” flow (create borrower + case), and a case workspace
(components/console/views/cases.tsx) that adds completed statements from history, runs
Consolidate, and renders the merged report (components/console/consolidated-report.tsx)
reusing the per-statement report primitives — accounts table, consolidated verdict, income/
FOIR/cash-flow, risk flags, counterparties, a coverage-gaps banner, and the internal-transfers
panel (the dedup audit trail, empty until Phase 2). Report primitives are now exported from
components/report-view.tsx for reuse.
11. Rollout — two phases (de-risks the hard part)
- Phase 1 — single-account, multi-month. Same account across months: only §4b (de-overlap)
- §4c (consolidate) + §5 (coverage/identity). No cross-account dedup → ships most of the value with far less risk. This alone beats single-statement FOIR.
- Phase 2 — multi-account. Add §4a inter-account transfer dedup (the genuinely hard, goldens-heavy piece) and per-account balance breakdown.
Pricing bundle can launch with Phase 1.
12. Open questions / risks
- Dedup precision vs recall. False “internal” tags hide real income; missed transfers
inflate it. The
internal_suspectedmiddle band + review items is the safety valve — but the thresholds need tuning on real multi-account data we don’t have yet. Build the goldens first. - Identity matching strictness. Too loose merges two people; too strict nags on every “K.” vs “Kumar.” Needs a tuned fuzzy threshold + an easy lender override.
- Per-account vs consolidated balance — confirm lenders want per-account (we assume yes; summing balances is misleading).
Bundle price— resolved: credit packs, no bundle (seepricing.md). Account identity is keyed on bank + last-4 digits (consolidation already does this).- Same statement in two Cases? Current model forbids it (
UniqueConstraint(job_id)). Revisit if a lender legitimately re-underwrites the same borrower later (probably a new Case with new statements, so fine).
13. Hardening backlog (from code review)
Fixed in review rounds 1–4: empty-union crash, multi-account score using the union balance,
stale-processing recovery (a dead worker no longer strands a case — the sweep/idle-beat
requeues it, and SQS mode republishes),
stale report after membership change, silent failed-reconciliation, statement cap, addable-
statements endpoint, batched member load, near-linear transfer matching, idempotent analyze(),
add-statement race 409, true covered-month count, and moving consolidation to the worker
(below). Still open:
- ✅ Consolidation now runs on the worker.
POST …/consolidateonly validates readiness and flips the case toqueued; the worker (workers/case_runner.py, claimed skip-locked) does the merge + verdict LLM call and lands the case inconsolidated/needs_review/failed(reason incases.error). Dispatch mirrors statements: SQS (case:<id>messages) in serverless mode, the poll loop in app mode. The console polls the case to terminal. This removed the synchronous request, the lock-during-LLM, and the verdict-failure 500 (now afailedcase the user can retry). Gotcha fixed along the way: worker entrypoints must importaccounts.modelsso theCase/Borrower→tenantsFKs resolve at mapper configuration. Cleared in round 5: consolidation LLM spend is now metered (llm_usage.case_id, migration20260617_14,make_case_recorder);case.terminalwebhook fires (HMAC, both modes);DELETE /v1/cases/{id}and/v1/borrowers/{id}exist;display_name/external_ref/purposeare length-validated (422, not 500); the poll loop interleaves jobs and cases (no starvation); the UI cancels its poll on unmount and has a delete affordance;token_usage.pynow has a case-consolidation spend section (and thellm_usage.case_idFK is ON DELETE SET NULL so a case delete preserves the accounting row).
Round 6 (deletes + storage): case/borrower delete is now soft (deleted_at, hidden from all
lists/lookups, worker skips them; a deleted case releases its statement links so they can be
reused). Jobs record size_bytes at upload, and GET /v1/statements/usage returns lifetime
uploaded bytes + currently-retained (vault) bytes + counts — the basis for a future per-tier
size/count cap. Remaining:
- Inline upload to a case (
POST …/statementswith a file, not just ajob_id) — today you add already-completed statements. Next on the borrower-case backlog. - Documents view (UI). Make the statements list document-centric: original + size + download + the generated report, with a storage-usage header. Constraint to design around: raw originals are deleted after analysis for delete-mode tenants (the privacy default), so re-download only works for retain-mode (vault) tenants — the view must show “deleted after analysis” honestly.
- size_bytes backfill for pre-existing jobs (S3
HeadObject), if historical usage matters.