Expand backend financial statement parsers

2026-03-12 21:15:54 -04:00
parent 33ce48f53c
commit 7a7a78340f
13 changed files with 4398 additions and 456 deletions
--- a/rust/fiscal-xbrl-core/OPERATING_STATEMENT_PARSER_SPEC.md
+++ b/rust/fiscal-xbrl-core/OPERATING_STATEMENT_PARSER_SPEC.md
@@ -0,0 +1,103 @@
+# Operating Statement Parser Spec
+
+## Purpose
+This document defines the backend-only parsing rules for operating statement hydration in `fiscal-xbrl-core`.
+
+This pass is intentionally limited to Rust parser behavior. It must not change frontend files, frontend rendering logic, or API response shapes.
+
+## Hydration Order
+1. Generic compact surface mapping builds initial `surface_rows`, `detail_rows`, and `unmapped` residuals.
+2. Universal income parsing rewrites the income statement into canonical operating-statement rows.
+3. Canonical income parsing is authoritative for income provenance and must prune any consumed residual rows from `detail_rows["income"]["unmapped"]`.
+
+## Canonical Precedence Rule
+For income rows, canonical universal mappings take precedence over generic residual classification.
+
+If an income concept is consumed by a canonical operating-statement row, it must not remain in `unmapped`.
+
+## Alias Flattening Rule
+Multiple source aliases for the same canonical operating-statement concept must flatten into a single canonical surface row.
+
+Examples:
+- `us-gaap:OtherOperatingExpense`
+- `us-gaap:OtherOperatingExpenses`
+- `us-gaap:OtherCostAndExpenseOperating`
+
+These may differ by filer or period, but they still represent one canonical row such as `other_operating_expense`.
+
+## Per-Period Resolution Rule
+Direct canonical matching is resolved per period, not by selecting one global winner for all periods.
+
+For each canonical income row:
+1. Collect all direct statement-row matches.
+2. For each period, keep only candidates with a value in that period.
+3. Choose the best candidate for that period using existing ranking rules.
+4. Build one canonical row whose `values` and `resolved_source_row_keys` are assembled period-by-period.
+
+The canonical row's provenance is the union of all consumed aliases, even if a different alias wins in different periods.
+
+## Residual Pruning Rule
+After canonical income rows are resolved:
+- collect all consumed source row keys
+- collect all consumed concept keys
+- remove any residual income detail row from `unmapped` if either identifier matches
+
+`unmapped` is a strict remainder set after income canonicalization.
+
+## Synonym vs Aggregate Child Rule
+Two cases must remain distinct:
+
+### Synonym aliases
+Different concept names representing the same canonical meaning.
+
+Behavior:
+- flatten into one canonical surface row
+- do not emit as detail rows
+- do not leave in `unmapped`
+
+### Aggregate child components
+Rows that are true components of a higher-level canonical row, such as:
+- `SalesAndMarketingExpense`
+- `GeneralAndAdministrativeExpense`
+used to derive `selling_general_and_administrative`
+
+Behavior:
+- may appear as detail rows under the canonical parent
+- must not also remain in `unmapped` once consumed by that canonical parent
+
+## Required Invariants
+For income parsing, a consumed source may appear in exactly one of these places:
+- canonical surface provenance
+- canonical detail provenance
+- `unmapped`
+
+It must never appear in more than one place at the same time.
+
+Additional invariants:
+- canonical surface rows are unique by canonical key
+- aliases are flattened into one canonical row
+- `resolved_source_row_keys` are period-specific
+- normalization counts reflect the post-pruning state
+
+## Performance Constraints
+- Use `HashSet` membership for consumed-source pruning.
+- Build candidate collections once per canonical definition.
+- Avoid UI-side dedupe or post-processing.
+- Keep the parser close to linear in candidate volume per definition.
+
+## Test Matrix
+The parser must cover:
+- direct alias dedupe for `other_operating_expense`
+- period-sparse alias merge into a single canonical row
+- pruning of canonically consumed aliases from `income.unmapped`
+- preservation of truly unrelated residual rows
+- pruning of formula-consumed component rows from `income.unmapped`
+
+## Learnings For Other Statements
+The same backend rules should later be applied to balance sheet and cash flow:
+- canonical mapping must outrank residual classification
+- alias resolution should be per-period
+- consumed sources must be removed from `unmapped`
+- synonym aliases and aggregate child components must be treated differently
+
+When balance sheet and cash flow are upgraded, they should adopt these invariants without changing frontend response shapes.