Expand backend financial statement parsers

This commit is contained in:
2026-03-12 21:15:54 -04:00
parent 33ce48f53c
commit 7a7a78340f
13 changed files with 4398 additions and 456 deletions

View File

@@ -0,0 +1,103 @@
# Operating Statement Parser Spec
## Purpose
This document defines the backend-only parsing rules for operating statement hydration in `fiscal-xbrl-core`.
This pass is intentionally limited to Rust parser behavior. It must not change frontend files, frontend rendering logic, or API response shapes.
## Hydration Order
1. Generic compact surface mapping builds initial `surface_rows`, `detail_rows`, and `unmapped` residuals.
2. Universal income parsing rewrites the income statement into canonical operating-statement rows.
3. Canonical income parsing is authoritative for income provenance and must prune any consumed residual rows from `detail_rows["income"]["unmapped"]`.
## Canonical Precedence Rule
For income rows, canonical universal mappings take precedence over generic residual classification.
If an income concept is consumed by a canonical operating-statement row, it must not remain in `unmapped`.
## Alias Flattening Rule
Multiple source aliases for the same canonical operating-statement concept must flatten into a single canonical surface row.
Examples:
- `us-gaap:OtherOperatingExpense`
- `us-gaap:OtherOperatingExpenses`
- `us-gaap:OtherCostAndExpenseOperating`
These may differ by filer or period, but they still represent one canonical row such as `other_operating_expense`.
## Per-Period Resolution Rule
Direct canonical matching is resolved per period, not by selecting one global winner for all periods.
For each canonical income row:
1. Collect all direct statement-row matches.
2. For each period, keep only candidates with a value in that period.
3. Choose the best candidate for that period using existing ranking rules.
4. Build one canonical row whose `values` and `resolved_source_row_keys` are assembled period-by-period.
The canonical row's provenance is the union of all consumed aliases, even if a different alias wins in different periods.
## Residual Pruning Rule
After canonical income rows are resolved:
- collect all consumed source row keys
- collect all consumed concept keys
- remove any residual income detail row from `unmapped` if either identifier matches
`unmapped` is a strict remainder set after income canonicalization.
## Synonym vs Aggregate Child Rule
Two cases must remain distinct:
### Synonym aliases
Different concept names representing the same canonical meaning.
Behavior:
- flatten into one canonical surface row
- do not emit as detail rows
- do not leave in `unmapped`
### Aggregate child components
Rows that are true components of a higher-level canonical row, such as:
- `SalesAndMarketingExpense`
- `GeneralAndAdministrativeExpense`
used to derive `selling_general_and_administrative`
Behavior:
- may appear as detail rows under the canonical parent
- must not also remain in `unmapped` once consumed by that canonical parent
## Required Invariants
For income parsing, a consumed source may appear in exactly one of these places:
- canonical surface provenance
- canonical detail provenance
- `unmapped`
It must never appear in more than one place at the same time.
Additional invariants:
- canonical surface rows are unique by canonical key
- aliases are flattened into one canonical row
- `resolved_source_row_keys` are period-specific
- normalization counts reflect the post-pruning state
## Performance Constraints
- Use `HashSet` membership for consumed-source pruning.
- Build candidate collections once per canonical definition.
- Avoid UI-side dedupe or post-processing.
- Keep the parser close to linear in candidate volume per definition.
## Test Matrix
The parser must cover:
- direct alias dedupe for `other_operating_expense`
- period-sparse alias merge into a single canonical row
- pruning of canonically consumed aliases from `income.unmapped`
- preservation of truly unrelated residual rows
- pruning of formula-consumed component rows from `income.unmapped`
## Learnings For Other Statements
The same backend rules should later be applied to balance sheet and cash flow:
- canonical mapping must outrank residual classification
- alias resolution should be per-period
- consumed sources must be removed from `unmapped`
- synonym aliases and aggregate child components must be treated differently
When balance sheet and cash flow are upgraded, they should adopt these invariants without changing frontend response shapes.