Design — The Borrower Case (multi-statement consolidation)

Living design doc. Last meaningful update: 2026-06-16. Owner: Salil. Roadmap context: roadmap.md (this is roadmap item #1, the Spine). Touches: report.py (contract), statements/models.py (tables), pipeline/{analytics,scoring,verify}.py (consolidation), accounts/models.py (pricing).

Status: COMPLETE — Phase 1 + Phase 2 (backend + API + UI, 2026-06-16). Single-account multi-month AND multi-account consolidation with inter-account transfer dedup are live end to end, tested (server/tests/test_consolidate.py, 13 cases), and live-verified on Postgres (a ₹50k self-transfer across two banks is detected and netted out of income/spend while per-account balances stay separate). Supersedes parked task Multi-statement cases.

Phase 2 added: the multi-account guard is removed; pipeline/consolidate.py now runs per-account analytics (real per-account balance), the inter-account transfer matcher (§4a: amount tolerance, 7-day window, confidence scoring, internal vs internal_suspected), consolidated flows with internal legs netted out, and balance_by_account (labeled, never summed). The UI shows the internal-transfers audit trail and a per-account balance table.

Account identity & duplicate uploads: statements are grouped into accounts by bank + the last revealed digits of the account number, not the raw masked string — extraction masks the prefix with a varying number of X/* (XXXX4821 vs ****4821), so keying on the literal mask would split one account into several and multiply income. Within an account, re-uploads of the same statement (identical period + closing balance — extraction can vary run-to-run, so we don’t require an identical ledger) are dropped from the consolidation and surfaced as a duplicateStatements review flag (_account_key / _dedup_members).

What shipped (Phase 1): Borrower/Case/CaseStatement/CaseReport tables (migration 20260616_12), the ConsolidatedReport contract (report.py + TS mirror), the deterministic consolidation engine (pipeline/consolidate.py: de-overlap, coverage gaps, seam-continuity, identity check, consolidated income/FOIR/cash-flow, LLM-narrated Case verdict), and the /v1/borrowers + /v1/cases/* API (statements/cases.py). Consolidation is free (members already paid per-page; pricing is credit packs, no bundle — see pricing.md). Phase 1 ran synchronously / single-account; Phase 2 shipped multi-account, async on the worker (the 422 guard is gone).

1. The job-to-be-done

A lender underwrites a borrower, not a statement. Today our unit is a Job (one file → one AnalysisReport). A real underwriting pull is 6–12 months across 1–3 accounts/banks. The Borrower Case is the object that holds those statements and produces one consolidated, reconciled financial picture — while keeping every per-account number auditable back to its source statement.

In scope: group statements under a borrower; dedup money moving between the borrower’s own accounts; stitch overlapping/non-contiguous periods; consolidate income / FOIR / cash flow / risk; one Case-level verdict; per-account drill-down.

Out of scope (for v1): Account Aggregator ingestion, the borrower intake link, the policy / auto-decision layer. Those are separate roadmap items that attach to a Case once it exists.

2. Where this sits in the existing architecture

Nothing about per-statement processing changes. Each statement is still a Job that runs the existing pipeline (INGEST → EXTRACT → VERIFY → ANALYTICS → VERDICT) and produces a Result holding an AnalysisReport. The Case is a new layer on top:


Borrower
  └── Case (an underwriting pull)
        ├── Job  → Result (AnalysisReport)   ← unchanged per-statement pipeline
        ├── Job  → Result (AnalysisReport)
        └── Job  → Result (AnalysisReport)
        ⇓  CONSOLIDATE  (new Case-level stage, runs once all member Jobs are DONE)
      ConsolidatedReport

The consolidation engine is pure deterministic code over already-verified ledgers — same discipline as analytics.py. No LLM touches the merge; the LLM only narrates the Case verdict (mirrors the per-statement money-math inversion).

3. Data model

New tables in statements/models.py (conventions: String(36) + _uuid PKs, _now timestamps, tenant_id for the owning org, JSONType for payloads).


class Borrower(Base):
    """A person being underwritten. Owned by a tenant; lightweight by design —
    identity is asserted by the lender and cross-checked against statement holders."""
    __tablename__ = "borrowers"
    id            # str pk
    tenant_id     # FK tenants.id, indexed
    display_name  # lender's label for them ("Ramesh K — LAP applicant")
    external_ref  # optional lender's own application/loan id (indexed, for their lookups)
    created_by    # user id
    created_at
 
class Case(Base):
    """One underwriting pull for a borrower — the unit the platform (and pricing) revolves around."""
    __tablename__ = "cases"
    id
    tenant_id          # FK, indexed
    borrower_id        # FK borrowers.id, indexed
    status             # draft | processing | needs_review | consolidated | failed
    purpose            # optional: "personal_loan" | "lap" | "bnpl" | ... (lender's tag)
    consolidated_at    # when the merge last succeeded
    created_by
    created_at, updated_at
 
class CaseStatement(Base):
    """Link row: a Job belongs to a Case. A Job belongs to at most one Case."""
    __tablename__ = "case_statements"
    __table_args__ = (UniqueConstraint("job_id"),)   # a statement lives in one case
    id
    case_id            # FK cases.id, indexed
    job_id             # FK jobs.id
    added_by, added_at
 
class CaseReport(Base):
    """The consolidated output — one current row per case (re-merge overwrites)."""
    __tablename__ = "case_reports"
    __table_args__ = (UniqueConstraint("case_id"),)
    id
    case_id
    report             # JSONType — ConsolidatedReport, serialized by alias (camelCase)
    coverage_months    # distinct months covered (for quick listing)
    consolidation_confidence  # 0-100: how clean the merge was (dedup + coverage + identity)
    created_at

Notes:

Job gains no required column — membership lives in CaseStatement, so single-statement flows are completely unaffected (a Job with no CaseStatement row is just standalone, as today).
A Job may be added to a Case after it already ran — we re-merge, we don’t re-extract.
Migration name follows the pinned convention: yyyyMMdd_NN_borrower_case.py.

4. The consolidation engine (the hard part)

Input: the AnalysisReport (and its verified transaction ledger) of every member Job. Output: a ConsolidatedReport. Four subproblems, hardest first.

4a. Inter-account transfer dedup ← the headline problem

When a borrower moves ₹50,000 from their HDFC account to their ICICI account, it appears as a debit in HDFC and a credit in ICICI. Naively consolidated, this inflates income and spend on both sides and corrupts FOIR. We must detect these pairs and mark them internal so they’re excluded from income, spend, and cash-flow totals (but still shown, labelled, for auditability).

Matching algorithm (deterministic):

Candidate generation. For every debit d in account A, find credits c in account B≠A where:
- |amount(d) − amount(c)| ≤ tol — tol = max(₹1, 0.5%) to absorb fees/rounding (a small transfer fee means the credit is slightly less than the debit).
- 0 ≤ date(c) − date(d) ≤ window — window = 7 days (policy.transfer_window_days, ≈5 business days; UPI/IMPS are same-day, NEFT/RTGS/cheque can lag). Credit cannot precede debit.
- Both accounts belong to the same borrower (guaranteed — they’re in one Case).
Scoring each candidate pair (0–1), so we can pick the best when amounts collide:
- exact amount = 1.0; within tol scaled down.
- same-day +0.2; narration cross-references (UPI VPA, the other bank’s name, “SELF”, “TRANSFER”, matching reference number) +0.3.
- mode consistency (both UPI, or NEFT↔NEFT) small bonus.
Assignment. Greedy by descending score, one-to-one (a debit matches at most one credit); upgrade to optimal bipartite matching (Hungarian) only if greedy proves lossy on real data.
Classification by confidence:
- ≥ 0.85 → auto-mark internal, excluded from aggregates.
- 0.6–0.85 → mark internal_suspected, excluded but raised as a Case review item (“Possible self-transfer ₹X HDFC→ICICI on dd-mm — confirm”).
- < 0.6 → left as normal income/spend.

Why this is on-brand: every dedup decision is recorded with its score and the two source rows. The merge is auditable — a lender can see exactly which pairs were netted out and why. The merge is auditable end to end.

✅ SME-reviewed (2026-06-18, Round 4, F1): the match window is 7 calendar days (≈5 business days; policy.transfer_window_days), and low-confidence (0.6–0.85) matches stay excluded from income and flagged — underwriting optimises against income inflation over understatement.

Nasty edge cases to handle (and test as goldens):

Two ₹50,000 transfers same day → don’t cross-match; pair by reference/narration, else flag both.
Transfer fee / GST splits the amount → tolerance + “the remainder is a fee” heuristic.
Round-tripping (A→B→A) → chain detection; net to zero, don’t count as income twice.
One leg missing (the other account isn’t in the Case) → cannot dedup; must not silently treat as income — flag “inflow looks like a transfer from an account not in this case.”

4b. Overlapping / non-contiguous periods

Member statements may overlap (same account, Jan–Mar and Mar–Jun share March) or have gaps (Jan, then Apr).

Per-account de-overlap: within one account, dedup transactions that appear in two statements (match on date + amount + balance + narration). Keep one. The running balance is the disambiguator — duplicated rows carry the same balance, genuine same-amount repeats don’t.
Coverage timeline: compute, per account, the covered date ranges; derive Case-level coverage. Gaps are surfaced loudly, never silently bridged (“No data Feb 1–Mar 15 for ICICS ••4321 — income/FOIR computed on available months only”).
Consolidated monthly cash flow is built only over fully-covered months; partially-covered months are marked partial and excluded from averages.

4c. Consolidated metrics

Re-run the deterministic analytics over the merged, deduped, internal-excluded ledger:

Income: cluster salary/recurring credits across accounts, but after removing internal transfers. If salary lands in account A every month, that’s the income; a monthly self-top-up into B is not. Average monthly income = over covered months only.
Obligations / FOIR: collect EMIs/rent/SIPs across all accounts (an EMI debited from B still counts), dedup obligations that appear in overlap windows, then FOIR = total_monthly_obligations / consolidated_average_monthly_income. This is the number per-account reports get wrong today — an EMI in one account vs salary in another makes single-statement FOIR meaningless.
Balance: report per-account ADB/AMB/min/negative-days (summing balances across accounts is misleading), plus a consolidated “lowest combined liquidity” view.
Cash flow: merged monthly credits/debits with internal transfers removed.
Counterparties / channel mix / categories: union across accounts, internal transfers excluded from counterparties.

4d. Consolidated risk score

scoring.py is already a pure function of (AnalyticsBundle, VerificationResult). Extend it to a Case-level scorer that takes the consolidated bundle plus Case-level signals:

Reuse the existing knockout/deduct model on consolidated FOIR, income regularity, risk flags (unioned across accounts), negative-balance days (worst account).
New Case-level factors:
- coverage knockout/deduction — thin coverage (e.g. < 3 months) caps confidence, not the score.
- identity mismatch (see §5) — a hard flag.
- high internal_suspected volume — reduces confidence.
Output stays deterministic + fully factor-itemized; the LLM narrates it, never decides it.

5. Verification at the Case level (identity + coverage)

Per-statement reconciliation is unchanged (each Job still passes its own balance gate). The Case adds two checks:

Account-holder consistency: the account_holder on every member statement should be the same person (fuzzy match — “RAMESH KUMAR” vs “Ramesh K.”). A mismatch is a critical Case review item (“Statement 3 is held by a different name — confirm same borrower”). This is a fraud/mixup guard that single-statement analysis structurally cannot do.
Coverage adequacy: months covered vs the lender’s expectation (default warn < 6 months).

A Case reaches consolidated only when all member Jobs are DONE and Case-level checks pass; otherwise needs_review with itemized reasons (reusing the existing ReviewItem pattern, scoped by case).

6. The report contract

New Pydantic models in report.py (camelCase wire format; TS mirror regenerated via scripts/export_schema.py). Reuse existing sub-models — a consolidated report is mostly the same shapes computed over a merged ledger, plus per-account breakdown and Case metadata.


class AccountRef(CamelModel):
    job_id: str
    meta: StatementMeta           # existing — identifies the account/bank/period
    risk_band: RiskBand           # that statement's standalone band (for drill-down)
 
class InternalTransfer(CamelModel):
    from_job_id: str
    to_job_id: str
    amount: float
    debit_date: str
    credit_date: str
    confidence: float             # the dedup score
    status: Literal["internal", "internal_suspected"]
 
class CoverageGap(CamelModel):
    account_label: str
    from_: str = Field(alias="from")
    to: str
 
class ConsolidatedReport(CamelModel):
    borrower_display_name: str
    accounts: list[AccountRef]            # the statements in this case
    period_covered: Period                # earliest → latest across all
    coverage_gaps: list[CoverageGap]
    verdict: Verdict                      # reuse — Case-level, LLM-narrated
    income: IncomeAnalysis                # reuse — consolidated
    obligations: ObligationAnalysis       # reuse — consolidated FOIR
    cash_flow: CashFlowAnalysis           # reuse — merged monthly
    balance_by_account: list[BalanceAnalysis]   # per-account, not summed
    risk_flags: list[RiskFlag]            # unioned
    counterparties: CounterpartyAnalysis  # internal transfers excluded
    internal_transfers: list[InternalTransfer]  # the audit trail of the dedup
    consolidation_confidence: float       # 0-100
    review_flags: list[ReviewFlag]        # identity / coverage / suspected transfers
    generated_at: Optional[str] = None
    disclaimer: str

7. Pricing hook

Resolved (2026-06-23): pricing is pure-consumption credit packs, not a per-borrower bundle. A Case consumes the per-page credits of its member Jobs like any other processing — no separate bundle SKU. (A bundle was considered and rejected: it exposes us to page-count variance since our cost is LLM-linked, and bulk-bonus credits already give heavy users a better effective rate.) The ledger/lots model imposes no constraint here. See pricing.md.

8. API surface (additive, versioned `/v1`)


POST   /v1/borrowers                     create a borrower
POST   /v1/cases                         create a case (borrower_id, purpose?)
POST   /v1/cases/{id}/statements         add a statement: existing job_id, OR upload inline
                                          (creates the Job, then links it)
POST   /v1/cases/{id}/consolidate        run/refresh the merge (idempotent; no-op if unchanged)
GET    /v1/cases/{id}                     case + member job statuses + consolidation state
GET    /v1/cases/{id}/report             the ConsolidatedReport
POST   /v1/cases/{id}/review/{item}      resolve a Case review item (identity / transfer)

Consolidation is triggered explicitly (or auto when the last member Job finishes). Webhooks gain a case.consolidated / case.needs_review event (reusing the HMAC-signed notifier).

9. Processing flow

Lender creates a Case, adds statements (new uploads or existing Jobs).
Each new statement runs the existing pipeline unchanged.
When all member Jobs are DONE, the CONSOLIDATE stage runs: load each Result ledger → per-account de-overlap → inter-account dedup → merged analytics → Case verification → Case scoring → LLM verdict narration → write CaseReport.
If Case checks fail → needs_review with itemized reasons; lender resolves; re-consolidate.
Adding/removing a statement later re-runs only CONSOLIDATE (cheap; no re-extraction).

10. UI (app.obsrv.in)

BUILT (Phase 1). A Cases section (sidebar peer of Statements, app/(shell)/cases/): a case list, a “New case” flow (create borrower + case), and a case workspace (components/console/views/cases.tsx) that adds completed statements from history, runs Consolidate, and renders the merged report (components/console/consolidated-report.tsx) reusing the per-statement report primitives — accounts table, consolidated verdict, income/ FOIR/cash-flow, risk flags, counterparties, a coverage-gaps banner, and the internal-transfers panel (the dedup audit trail, empty until Phase 2). Report primitives are now exported from components/report-view.tsx for reuse.

11. Rollout — two phases (de-risks the hard part)

Phase 1 — single-account, multi-month. Same account across months: only §4b (de-overlap)
- §4c (consolidate) + §5 (coverage/identity). No cross-account dedup → ships most of the value with far less risk. This alone beats single-statement FOIR.
Phase 2 — multi-account. Add §4a inter-account transfer dedup (the genuinely hard, goldens-heavy piece) and per-account balance breakdown.

Pricing bundle can launch with Phase 1.

12. Open questions / risks

Dedup precision vs recall. False “internal” tags hide real income; missed transfers inflate it. The internal_suspected middle band + review items is the safety valve — but the thresholds need tuning on real multi-account data we don’t have yet. Build the goldens first.
Identity matching strictness. Too loose merges two people; too strict nags on every “K.” vs “Kumar.” Needs a tuned fuzzy threshold + an easy lender override.
Per-account vs consolidated balance — confirm lenders want per-account (we assume yes; summing balances is misleading).
~~Bundle price~~ — resolved: credit packs, no bundle (see pricing.md). Account identity is keyed on bank + last-4 digits (consolidation already does this).
Same statement in two Cases? Current model forbids it (UniqueConstraint(job_id)). Revisit if a lender legitimately re-underwrites the same borrower later (probably a new Case with new statements, so fine).

13. Hardening backlog (from code review)

Fixed in review rounds 1–4: empty-union crash, multi-account score using the union balance, stale-processing recovery (a dead worker no longer strands a case — the sweep/idle-beat requeues it, and SQS mode republishes), stale report after membership change, silent failed-reconciliation, statement cap, addable- statements endpoint, batched member load, near-linear transfer matching, idempotent analyze(), add-statement race 409, true covered-month count, and moving consolidation to the worker (below). Still open:

✅ Consolidation now runs on the worker. POST …/consolidate only validates readiness and flips the case to queued; the worker (workers/case_runner.py, claimed skip-locked) does the merge + verdict LLM call and lands the case in consolidated / needs_review / failed (reason in cases.error). Dispatch mirrors statements: SQS (case:<id> messages) in serverless mode, the poll loop in app mode. The console polls the case to terminal. This removed the synchronous request, the lock-during-LLM, and the verdict-failure 500 (now a failed case the user can retry). Gotcha fixed along the way: worker entrypoints must import accounts.models so the Case/Borrower → tenants FKs resolve at mapper configuration. Cleared in round 5: consolidation LLM spend is now metered (llm_usage.case_id, migration 20260617_14, make_case_recorder); case.terminal webhook fires (HMAC, both modes); DELETE /v1/cases/{id} and /v1/borrowers/{id} exist; display_name/external_ref/purpose are length-validated (422, not 500); the poll loop interleaves jobs and cases (no starvation); the UI cancels its poll on unmount and has a delete affordance; token_usage.py now has a case-consolidation spend section (and the llm_usage.case_id FK is ON DELETE SET NULL so a case delete preserves the accounting row).

Round 6 (deletes + storage): case/borrower delete is now soft (deleted_at, hidden from all lists/lookups, worker skips them; a deleted case releases its statement links so they can be reused). Jobs record size_bytes at upload, and GET /v1/statements/usage returns lifetime uploaded bytes + currently-retained (vault) bytes + counts — the basis for a future per-tier size/count cap. Remaining:

Inline upload to a case (POST …/statements with a file, not just a job_id) — today you add already-completed statements. Next on the borrower-case backlog.
Documents view (UI). Make the statements list document-centric: original + size + download + the generated report, with a storage-usage header. Constraint to design around: raw originals are deleted after analysis for delete-mode tenants (the privacy default), so re-download only works for retain-mode (vault) tenants — the view must show “deleted after analysis” honestly.
size_bytes backfill for pre-existing jobs (S3 HeadObject), if historical usage matters.