Expand backend financial statement parsers

This commit is contained in:
2026-03-12 21:15:54 -04:00
parent 33ce48f53c
commit 7a7a78340f
13 changed files with 4398 additions and 456 deletions

View File

@@ -0,0 +1,144 @@
# Balance Sheet Parser Spec
## Purpose
This document defines the backend-only balance-sheet parsing rules for `fiscal-xbrl-core`.
This pass is limited to Rust parser behavior and taxonomy packs. It must not modify frontend files, frontend rendering logic, or frontend response shapes.
## Hydration Order
1. Load the selected surface pack.
2. For non-core packs, merge in any core balance-sheet surfaces that the selected pack does not override.
3. Resolve direct canonical balance rows from statement rows.
4. Resolve aggregate-child rows from detail components when direct canonical rows are absent.
5. Resolve formula-backed balance rows from already-resolved canonical rows.
6. Emit `unmapped` only for rows not consumed by canonical balance parsing.
## Category Taxonomy
Balance rows use these backend category keys:
- `current_assets`
- `noncurrent_assets`
- `current_liabilities`
- `noncurrent_liabilities`
- `equity`
- `derived`
- `sector_specific`
Default rule:
- use economic placement first
- reserve `sector_specific` for rows that cannot be expressed economically
## Canonical Precedence Rule
Canonical balance mappings take precedence over residual classification.
If a statement row is consumed by a canonical balance row, it must not remain in `detail_rows["balance"]["unmapped"]`.
## Alias Flattening Rule
Synonymous balance concepts flatten into one canonical surface row.
Example:
- `AccountsReceivableNetCurrent`
- `ReceivablesNetCurrent`
These must become one `accounts_receivable` row with period-aware provenance.
## Per-Period Resolution Rule
Direct balance matching is resolved per period, not by choosing one row globally.
For each canonical balance row:
1. Collect all direct candidates.
2. For each period, choose the best candidate with a value in that period.
3. Build one canonical row from those period-specific winners.
4. Preserve the union of all consumed aliases in `source_concepts`, `source_row_keys`, and `source_fact_ids`.
## Formula Evaluation Rule
Structured formulas are evaluated only after their source surface rows have been resolved.
Supported operators:
- `sum`
- `subtract`
Formula rules:
- formulas operate period by period
- `sum` may treat nulls as zero when `treat_null_as_zero` is true
- `subtract` requires exactly two sources
- formula rows inherit provenance from the source surface rows they consume
## Residual Pruning Rule
`balance.unmapped` is a strict remainder set.
A balance statement row must be excluded from `unmapped` when either of these is true:
- its row key was consumed by a canonical balance row
- its concept key was consumed by a canonical balance row
## Helper Surface Rule
Some balance rows are parser helpers rather than user-facing canonical output.
Current helper rows:
- `deferred_revenue_current`
- `deferred_revenue_noncurrent`
- `current_liabilities`
- `leases`
Behavior:
- they remain available to formulas
- they do not appear in emitted `surface_rows`
- they do not create emitted detail buckets
- they still consume matched backend sources so those rows do not leak into `unmapped`
## Synonym vs Aggregate Child Rule
Two cases must remain distinct.
### Synonym aliases
Different concept names for the same canonical balance meaning.
Behavior:
- flatten into one canonical surface row
- do not emit duplicate detail rows
- do not remain in `unmapped`
### Aggregate child components
Rows that legitimately roll into a subtotal or total.
Behavior:
- may remain as detail rows beneath the canonical parent when grouping is enabled
- must not remain in `unmapped` after being consumed
## Sector Placement Decisions
Sector rows stay inside the same economic taxonomy.
Mappings in this pass:
- `loans` -> `noncurrent_assets`
- `allowance_for_credit_losses` -> `noncurrent_assets`
- `deposits` -> `current_liabilities`
- `policy_liabilities` -> `noncurrent_liabilities`
- `deferred_acquisition_costs` -> `noncurrent_assets`
- `investment_property` -> `noncurrent_assets`
`sector_specific` remains unused by default in this pass.
## Required Invariants
- A consumed balance source must never remain in `balance.unmapped`.
- A synonym alias must never create more than one canonical balance row.
- Hidden helper surfaces may consume sources but must not appear in emitted `surface_rows`.
- Formula-derived rows inherit canonical provenance from their source surfaces.
- The frontend response shape remains unchanged.
## Test Matrix
The parser must cover:
- direct alias flattening for `accounts_receivable`
- period-sparse alias merges into one canonical row
- formula derivation for `total_cash_and_equivalents`
- formula derivation for `unearned_revenue`
- formula derivation for `total_debt`
- formula derivation for `net_cash_position`
- helper rows staying out of emitted balance surfaces
- residual pruning of canonically consumed balance rows
- sector packs receiving merged core balance coverage without changing frontend contracts
## Learnings Reusable For Other Statements
The same parser rules should later apply to cash flow:
- canonical mapping outranks residual classification
- direct aliases should resolve per period
- helper rows can exist backend-only when formulas need them
- consumed sources must be removed from `unmapped`
- sector packs should inherit common canonical coverage rather than duplicating it

View File

@@ -0,0 +1,155 @@
# Cash Flow Statement Parser Spec
## Purpose
This document defines the backend-only cash-flow parsing rules for `fiscal-xbrl-core`.
This pass is limited to Rust parser behavior, taxonomy packs, and backend comparison tooling. It must not modify frontend files, frontend rendering logic, or frontend response shapes.
## Hydration Order
1. Load the selected surface pack.
2. For non-core packs, merge in any core balance-sheet and cash-flow surfaces that the selected pack does not override.
3. Resolve direct canonical cash-flow rows from statement rows.
4. Resolve aggregate-child cash-flow rows from matched detail components when direct canonical rows are absent.
5. Resolve formula-backed cash-flow rows from already-resolved canonical rows and helper rows.
6. Emit `unmapped` only for rows not consumed by canonical cash-flow parsing.
## Category Model
Cash-flow rows use these backend category keys:
- `operating`
- `investing`
- `financing`
- `free_cash_flow`
- `helper`
Rules:
- `helper` rows are backend-only and use `include_in_output: false`.
- Only `operating`, `investing`, `financing`, and `free_cash_flow` should appear in emitted `surface_rows`.
## Canonical Precedence Rule
Canonical cash-flow mappings take precedence over residual classification.
If a statement row is consumed by a canonical cash-flow row, it must not remain in `detail_rows["cash_flow"]["unmapped"]`.
## Alias Flattening Rule
Synonymous cash-flow concepts flatten into one canonical surface row.
Example:
- `NetCashProvidedByUsedInOperatingActivities`
- `NetCashProvidedByUsedInOperatingActivitiesContinuingOperations`
These must become one `operating_cash_flow` row with period-aware provenance.
## Per-Period Resolution Rule
Direct cash-flow matching is resolved per period, not by choosing one row globally.
For each canonical cash-flow row:
1. Collect all direct candidates.
2. For each period, choose the best candidate with a value in that period.
3. Build one canonical row from those period-specific winners.
4. Preserve the union of all consumed aliases in `source_concepts`, `source_row_keys`, and `source_fact_ids`.
## Sign Normalization Rule
Some canonical cash-flow rows require sign normalization.
Supported transform:
- `invert`
Rule:
- sign transforms are applied after direct or aggregate resolution
- sign transforms are applied before formula evaluation consumes the row
- emitted detail rows inherit the same transform when they belong to the transformed canonical row
- provenance is preserved unchanged
## Formula Rule
Structured formulas are evaluated only after their source surface rows have been resolved.
Supported operators:
- `sum`
- `subtract`
Current formulas:
- `changes_unearned_revenue = contract_liability_incurred - contract_liability_recognized`
- `changes_other_operating_activities = changes_other_current_assets + changes_other_current_liabilities + changes_other_noncurrent_assets + changes_other_noncurrent_liabilities`
- `free_cash_flow = operating_cash_flow + capital_expenditures`
## Helper Row Rule
Helper rows exist only to support formulas and canonical grouping.
Current helper rows:
- `contract_liability_incurred`
- `contract_liability_recognized`
- `changes_other_current_assets`
- `changes_other_current_liabilities`
- `changes_other_noncurrent_assets`
- `changes_other_noncurrent_liabilities`
Behavior:
- helper rows remain available for formula evaluation
- helper rows do not appear in emitted `surface_rows`
- helper rows do not create emitted detail buckets
- helper rows still consume matched backend sources so those rows do not leak into `unmapped`
## Residual Pruning Rule
`cash_flow.unmapped` is a strict remainder set.
A cash-flow statement row must be excluded from `unmapped` when either of these is true:
- its row key was consumed by a canonical cash-flow row
- its concept key was consumed by a canonical cash-flow row
## Sector Inheritance Rule
Sector packs inherit the core cash-flow taxonomy unless they provide an explicit cash-flow override.
Current behavior:
- bank/lender inherits core cash-flow rows
- broker/asset manager inherits core cash-flow rows
- insurance inherits core cash-flow rows
- REIT/real estate inherits core cash-flow rows
No first-pass sector-specific cash-flow overrides are required.
## Synonym vs Aggregate Child Rule
Two cases must remain distinct.
### Synonym aliases
Different concept names for the same canonical cash-flow meaning.
Behavior:
- flatten into one canonical surface row
- do not emit duplicate detail rows
- do not remain in `unmapped`
### Aggregate child components
Rows that legitimately roll into a subtotal or grouped adjustment row.
Behavior:
- may remain as detail rows beneath the canonical parent when grouping is enabled
- must not remain in `unmapped` after being consumed
## Required Invariants
- A consumed cash-flow source must never remain in `cash_flow.unmapped`.
- A synonym alias must never create more than one canonical cash-flow row.
- Hidden helper surfaces may consume sources but must not appear in emitted `surface_rows`.
- Formula-derived rows inherit canonical provenance from their source surfaces.
- The frontend response shape remains unchanged.
## Test Matrix
The parser must cover:
- direct sign inversion for `capital_expenditures`
- direct sign inversion for `debt_repaid`
- direct sign inversion for `share_repurchases`
- direct mapping for `operating_cash_flow`
- formula derivation for `changes_unearned_revenue`
- formula derivation for `changes_other_operating_activities`
- formula derivation for `free_cash_flow`
- helper rows staying out of emitted cash-flow surfaces
- residual pruning of canonically consumed cash-flow rows
- sector packs receiving merged core cash-flow coverage without changing frontend contracts
- fallback classification for fact-only cash-flow concepts such as `IncreaseDecreaseInAccountsReceivable` and `PaymentsOfDividends`
## Learnings Reusable For Other Statements
The same parser rules now apply consistently across income, balance, and cash flow:
- canonical mapping outranks residual classification
- direct aliases resolve per period
- helper rows may exist backend-only when formulas need them
- consumed sources must be removed from `unmapped`
- sector packs inherit common canonical coverage instead of duplicating it

View File

@@ -0,0 +1,103 @@
# Operating Statement Parser Spec
## Purpose
This document defines the backend-only parsing rules for operating statement hydration in `fiscal-xbrl-core`.
This pass is intentionally limited to Rust parser behavior. It must not change frontend files, frontend rendering logic, or API response shapes.
## Hydration Order
1. Generic compact surface mapping builds initial `surface_rows`, `detail_rows`, and `unmapped` residuals.
2. Universal income parsing rewrites the income statement into canonical operating-statement rows.
3. Canonical income parsing is authoritative for income provenance and must prune any consumed residual rows from `detail_rows["income"]["unmapped"]`.
## Canonical Precedence Rule
For income rows, canonical universal mappings take precedence over generic residual classification.
If an income concept is consumed by a canonical operating-statement row, it must not remain in `unmapped`.
## Alias Flattening Rule
Multiple source aliases for the same canonical operating-statement concept must flatten into a single canonical surface row.
Examples:
- `us-gaap:OtherOperatingExpense`
- `us-gaap:OtherOperatingExpenses`
- `us-gaap:OtherCostAndExpenseOperating`
These may differ by filer or period, but they still represent one canonical row such as `other_operating_expense`.
## Per-Period Resolution Rule
Direct canonical matching is resolved per period, not by selecting one global winner for all periods.
For each canonical income row:
1. Collect all direct statement-row matches.
2. For each period, keep only candidates with a value in that period.
3. Choose the best candidate for that period using existing ranking rules.
4. Build one canonical row whose `values` and `resolved_source_row_keys` are assembled period-by-period.
The canonical row's provenance is the union of all consumed aliases, even if a different alias wins in different periods.
## Residual Pruning Rule
After canonical income rows are resolved:
- collect all consumed source row keys
- collect all consumed concept keys
- remove any residual income detail row from `unmapped` if either identifier matches
`unmapped` is a strict remainder set after income canonicalization.
## Synonym vs Aggregate Child Rule
Two cases must remain distinct:
### Synonym aliases
Different concept names representing the same canonical meaning.
Behavior:
- flatten into one canonical surface row
- do not emit as detail rows
- do not leave in `unmapped`
### Aggregate child components
Rows that are true components of a higher-level canonical row, such as:
- `SalesAndMarketingExpense`
- `GeneralAndAdministrativeExpense`
used to derive `selling_general_and_administrative`
Behavior:
- may appear as detail rows under the canonical parent
- must not also remain in `unmapped` once consumed by that canonical parent
## Required Invariants
For income parsing, a consumed source may appear in exactly one of these places:
- canonical surface provenance
- canonical detail provenance
- `unmapped`
It must never appear in more than one place at the same time.
Additional invariants:
- canonical surface rows are unique by canonical key
- aliases are flattened into one canonical row
- `resolved_source_row_keys` are period-specific
- normalization counts reflect the post-pruning state
## Performance Constraints
- Use `HashSet` membership for consumed-source pruning.
- Build candidate collections once per canonical definition.
- Avoid UI-side dedupe or post-processing.
- Keep the parser close to linear in candidate volume per definition.
## Test Matrix
The parser must cover:
- direct alias dedupe for `other_operating_expense`
- period-sparse alias merge into a single canonical row
- pruning of canonically consumed aliases from `income.unmapped`
- preservation of truly unrelated residual rows
- pruning of formula-consumed component rows from `income.unmapped`
## Learnings For Other Statements
The same backend rules should later be applied to balance sheet and cash flow:
- canonical mapping must outrank residual classification
- alias resolution should be per-period
- consumed sources must be removed from `unmapped`
- synonym aliases and aggregate child components must be treated differently
When balance sheet and cash flow are upgraded, they should adopt these invariants without changing frontend response shapes.

View File

@@ -37,10 +37,12 @@ static IDENTIFIER_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?identifier\b[^>]*\bscheme=["']([^"']+)["'][^>]*>(.*?)</(?:[a-z0-9_\-]+:)?identifier>"#).unwrap()
});
static SEGMENT_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?segment\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?segment>"#).unwrap()
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?segment\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?segment>"#)
.unwrap()
});
static SCENARIO_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?scenario\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?scenario>"#).unwrap()
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?scenario\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?scenario>"#)
.unwrap()
});
static START_DATE_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?startDate>(.*?)</(?:[a-z0-9_\-]+:)?startDate>"#).unwrap()
@@ -55,7 +57,8 @@ static MEASURE_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?measure>(.*?)</(?:[a-z0-9_\-]+:)?measure>"#).unwrap()
});
static LABEL_LINK_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?labelLink\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?labelLink>"#).unwrap()
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?labelLink\b[^>]*>(.*?)</(?:[a-z0-9_\-]+:)?labelLink>"#)
.unwrap()
});
static PRESENTATION_LINK_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?presentationLink\b([^>]*)>(.*?)</(?:[a-z0-9_\-]+:)?presentationLink>"#).unwrap()
@@ -67,12 +70,14 @@ static LABEL_RESOURCE_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?label\b([^>]*)>(.*?)</(?:[a-z0-9_\-]+:)?label>"#).unwrap()
});
static LABEL_ARC_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?labelArc\b([^>]*)/?>(?:</(?:[a-z0-9_\-]+:)?labelArc>)?"#).unwrap()
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?labelArc\b([^>]*)/?>(?:</(?:[a-z0-9_\-]+:)?labelArc>)?"#)
.unwrap()
});
static PRESENTATION_ARC_RE: Lazy<Regex> = Lazy::new(|| {
Regex::new(r#"(?is)<(?:[a-z0-9_\-]+:)?presentationArc\b([^>]*)/?>(?:</(?:[a-z0-9_\-]+:)?presentationArc>)?"#).unwrap()
});
static ATTR_RE: Lazy<Regex> = Lazy::new(|| Regex::new(r#"([a-zA-Z0-9:_\-]+)=["']([^"']+)["']"#).unwrap());
static ATTR_RE: Lazy<Regex> =
Lazy::new(|| Regex::new(r#"([a-zA-Z0-9:_\-]+)=["']([^"']+)["']"#).unwrap());
#[derive(Debug, Deserialize)]
#[serde(rename_all = "camelCase")]
@@ -451,7 +456,8 @@ pub fn hydrate_filing(input: HydrateFilingRequest) -> Result<HydrateFilingRespon
});
};
let instance_text = fetch_text(&client, &instance_asset.url).context("fetch request failed for XBRL instance")?;
let instance_text = fetch_text(&client, &instance_asset.url)
.context("fetch request failed for XBRL instance")?;
let parsed_instance = parse_xbrl_instance(&instance_text, Some(instance_asset.name.clone()));
let mut label_by_concept = HashMap::new();
@@ -459,11 +465,9 @@ pub fn hydrate_filing(input: HydrateFilingRequest) -> Result<HydrateFilingRespon
let mut source = "xbrl_instance".to_string();
let mut parse_error = None;
for asset in discovered
.assets
.iter()
.filter(|asset| asset.is_selected && (asset.asset_type == "presentation" || asset.asset_type == "label"))
{
for asset in discovered.assets.iter().filter(|asset| {
asset.is_selected && (asset.asset_type == "presentation" || asset.asset_type == "label")
}) {
match fetch_text(&client, &asset.url) {
Ok(content) => {
if asset.asset_type == "presentation" {
@@ -515,10 +519,15 @@ pub fn hydrate_filing(input: HydrateFilingRequest) -> Result<HydrateFilingRespon
pack_selection.pack,
&mut compact_model,
)?;
let kpi_result = kpi_mapper::build_taxonomy_kpis(&materialized.periods, &facts, pack_selection.pack)?;
let kpi_result =
kpi_mapper::build_taxonomy_kpis(&materialized.periods, &facts, pack_selection.pack)?;
compact_model.normalization_summary.kpi_row_count = kpi_result.rows.len();
for warning in kpi_result.warnings {
if !compact_model.normalization_summary.warnings.contains(&warning) {
if !compact_model
.normalization_summary
.warnings
.contains(&warning)
{
compact_model.normalization_summary.warnings.push(warning);
}
}
@@ -526,7 +535,11 @@ pub fn hydrate_filing(input: HydrateFilingRequest) -> Result<HydrateFilingRespon
&mut compact_model.concept_mappings,
kpi_result.mapping_assignments,
);
surface_mapper::apply_mapping_assignments(&mut concepts, &mut facts, &compact_model.concept_mappings);
surface_mapper::apply_mapping_assignments(
&mut concepts,
&mut facts,
&compact_model.concept_mappings,
);
let has_rows = materialized
.statement_rows
@@ -572,7 +585,11 @@ pub fn hydrate_filing(input: HydrateFilingRequest) -> Result<HydrateFilingRespon
concepts_count: concepts.len(),
dimensions_count: facts
.iter()
.flat_map(|fact| fact.dimensions.iter().map(|dimension| format!("{}::{}", dimension.axis, dimension.member)))
.flat_map(|fact| {
fact.dimensions
.iter()
.map(|dimension| format!("{}::{}", dimension.axis, dimension.member))
})
.collect::<HashSet<_>>()
.len(),
assets: discovered.assets,
@@ -622,7 +639,10 @@ struct DiscoveredAssets {
assets: Vec<AssetOutput>,
}
fn discover_filing_assets(input: &HydrateFilingRequest, client: &Client) -> Result<DiscoveredAssets> {
fn discover_filing_assets(
input: &HydrateFilingRequest,
client: &Client,
) -> Result<DiscoveredAssets> {
let Some(directory_url) = resolve_filing_directory_url(
input.filing_url.as_deref(),
&input.cik,
@@ -631,12 +651,19 @@ fn discover_filing_assets(input: &HydrateFilingRequest, client: &Client) -> Resu
return Ok(DiscoveredAssets { assets: vec![] });
};
let payload = fetch_json::<FilingDirectoryPayload>(client, &format!("{directory_url}index.json")).ok();
let payload =
fetch_json::<FilingDirectoryPayload>(client, &format!("{directory_url}index.json")).ok();
let mut discovered = Vec::new();
if let Some(items) = payload.and_then(|payload| payload.directory.and_then(|directory| directory.item)) {
if let Some(items) =
payload.and_then(|payload| payload.directory.and_then(|directory| directory.item))
{
for item in items {
let Some(name) = item.name.map(|name| name.trim().to_string()).filter(|name| !name.is_empty()) else {
let Some(name) = item
.name
.map(|name| name.trim().to_string())
.filter(|name| !name.is_empty())
else {
continue;
};
@@ -683,12 +710,19 @@ fn discover_filing_assets(input: &HydrateFilingRequest, client: &Client) -> Resu
score_instance(&asset.name, input.primary_document.as_deref()),
)
})
.max_by(|left, right| left.1.partial_cmp(&right.1).unwrap_or(std::cmp::Ordering::Equal))
.max_by(|left, right| {
left.1
.partial_cmp(&right.1)
.unwrap_or(std::cmp::Ordering::Equal)
})
.map(|entry| entry.0);
for asset in &mut discovered {
asset.score = if asset.asset_type == "instance" {
Some(score_instance(&asset.name, input.primary_document.as_deref()))
Some(score_instance(
&asset.name,
input.primary_document.as_deref(),
))
} else if asset.asset_type == "pdf" {
Some(score_pdf(&asset.name, asset.size_bytes))
} else {
@@ -708,7 +742,11 @@ fn discover_filing_assets(input: &HydrateFilingRequest, client: &Client) -> Resu
Ok(DiscoveredAssets { assets: discovered })
}
fn resolve_filing_directory_url(filing_url: Option<&str>, cik: &str, accession_number: &str) -> Option<String> {
fn resolve_filing_directory_url(
filing_url: Option<&str>,
cik: &str,
accession_number: &str,
) -> Option<String> {
if let Some(filing_url) = filing_url.map(str::trim).filter(|value| !value.is_empty()) {
if let Some(last_slash) = filing_url.rfind('/') {
if last_slash > "https://".len() {
@@ -725,7 +763,10 @@ fn resolve_filing_directory_url(filing_url: Option<&str>, cik: &str, accession_n
}
fn normalize_cik_for_path(value: &str) -> Option<String> {
let digits = value.chars().filter(|char| char.is_ascii_digit()).collect::<String>();
let digits = value
.chars()
.filter(|char| char.is_ascii_digit())
.collect::<String>();
if digits.is_empty() {
return None;
}
@@ -741,16 +782,25 @@ fn classify_asset_type(name: &str) -> &'static str {
return "schema";
}
if lower.ends_with(".xml") {
if lower.ends_with("_pre.xml") || lower.ends_with("-pre.xml") || lower.contains("presentation") {
if lower.ends_with("_pre.xml")
|| lower.ends_with("-pre.xml")
|| lower.contains("presentation")
{
return "presentation";
}
if lower.ends_with("_lab.xml") || lower.ends_with("-lab.xml") || lower.contains("label") {
return "label";
}
if lower.ends_with("_cal.xml") || lower.ends_with("-cal.xml") || lower.contains("calculation") {
if lower.ends_with("_cal.xml")
|| lower.ends_with("-cal.xml")
|| lower.contains("calculation")
{
return "calculation";
}
if lower.ends_with("_def.xml") || lower.ends_with("-def.xml") || lower.contains("definition") {
if lower.ends_with("_def.xml")
|| lower.ends_with("-def.xml")
|| lower.contains("definition")
{
return "definition";
}
return "instance";
@@ -779,7 +829,11 @@ fn score_instance(name: &str, primary_document: Option<&str>) -> f64 {
score += 5.0;
}
}
if lower.contains("cal") || lower.contains("def") || lower.contains("lab") || lower.contains("pre") {
if lower.contains("cal")
|| lower.contains("def")
|| lower.contains("lab")
|| lower.contains("pre")
{
score -= 3.0;
}
score
@@ -819,7 +873,9 @@ fn fetch_text(client: &Client, url: &str) -> Result<String> {
if !response.status().is_success() {
return Err(anyhow!("request failed for {url} ({})", response.status()));
}
response.text().with_context(|| format!("unable to read response body for {url}"))
response
.text()
.with_context(|| format!("unable to read response body for {url}"))
}
fn fetch_json<T: for<'de> Deserialize<'de>>(client: &Client, url: &str) -> Result<T> {
@@ -847,17 +903,36 @@ fn parse_xbrl_instance(raw: &str, source_file: Option<String>) -> ParsedInstance
let mut facts = Vec::new();
for captures in FACT_RE.captures_iter(raw) {
let prefix = captures.get(1).map(|value| value.as_str().trim()).unwrap_or_default();
let local_name = captures.get(2).map(|value| value.as_str().trim()).unwrap_or_default();
let attrs = captures.get(3).map(|value| value.as_str()).unwrap_or_default();
let body = decode_xml_entities(captures.get(4).map(|value| value.as_str()).unwrap_or_default().trim());
let prefix = captures
.get(1)
.map(|value| value.as_str().trim())
.unwrap_or_default();
let local_name = captures
.get(2)
.map(|value| value.as_str().trim())
.unwrap_or_default();
let attrs = captures
.get(3)
.map(|value| value.as_str())
.unwrap_or_default();
let body = decode_xml_entities(
captures
.get(4)
.map(|value| value.as_str())
.unwrap_or_default()
.trim(),
);
if prefix.is_empty() || local_name.is_empty() || is_xbrl_infrastructure_prefix(prefix) {
continue;
}
let attr_map = parse_attrs(attrs);
let Some(context_id) = attr_map.get("contextRef").cloned().or_else(|| attr_map.get("contextref").cloned()) else {
let Some(context_id) = attr_map
.get("contextRef")
.cloned()
.or_else(|| attr_map.get("contextref").cloned())
else {
continue;
};
@@ -870,7 +945,10 @@ fn parse_xbrl_instance(raw: &str, source_file: Option<String>) -> ParsedInstance
.cloned()
.unwrap_or_else(|| format!("urn:unknown:{prefix}"));
let context = context_by_id.get(&context_id);
let unit_ref = attr_map.get("unitRef").cloned().or_else(|| attr_map.get("unitref").cloned());
let unit_ref = attr_map
.get("unitRef")
.cloned()
.or_else(|| attr_map.get("unitref").cloned());
let unit = unit_ref
.as_ref()
.and_then(|unit_ref| unit_by_id.get(unit_ref))
@@ -896,8 +974,12 @@ fn parse_xbrl_instance(raw: &str, source_file: Option<String>) -> ParsedInstance
period_start: context.and_then(|value| value.period_start.clone()),
period_end: context.and_then(|value| value.period_end.clone()),
period_instant: context.and_then(|value| value.period_instant.clone()),
dimensions: context.map(|value| value.dimensions.clone()).unwrap_or_default(),
is_dimensionless: context.map(|value| value.dimensions.is_empty()).unwrap_or(true),
dimensions: context
.map(|value| value.dimensions.clone())
.unwrap_or_default(),
is_dimensionless: context
.map(|value| value.dimensions.is_empty())
.unwrap_or(true),
source_file: source_file.clone(),
});
}
@@ -916,10 +998,7 @@ fn parse_xbrl_instance(raw: &str, source_file: Option<String>) -> ParsedInstance
})
.collect::<Vec<_>>();
ParsedInstance {
contexts,
facts,
}
ParsedInstance { contexts, facts }
}
fn parse_namespace_map(raw: &str, root_tag_hint: &str) -> HashMap<String, String> {
@@ -935,7 +1014,10 @@ fn parse_namespace_map(raw: &str, root_tag_hint: &str) -> HashMap<String, String
.captures_iter(&root_start)
{
if let (Some(prefix), Some(uri)) = (captures.get(1), captures.get(2)) {
map.insert(prefix.as_str().trim().to_string(), uri.as_str().trim().to_string());
map.insert(
prefix.as_str().trim().to_string(),
uri.as_str().trim().to_string(),
);
}
}
@@ -946,16 +1028,26 @@ fn parse_contexts(raw: &str) -> HashMap<String, ParsedContext> {
let mut contexts = HashMap::new();
for captures in CONTEXT_RE.captures_iter(raw) {
let Some(context_id) = captures.get(1).map(|value| value.as_str().trim().to_string()) else {
let Some(context_id) = captures
.get(1)
.map(|value| value.as_str().trim().to_string())
else {
continue;
};
let block = captures.get(2).map(|value| value.as_str()).unwrap_or_default();
let block = captures
.get(2)
.map(|value| value.as_str())
.unwrap_or_default();
let (entity_identifier, entity_scheme) = IDENTIFIER_RE
.captures(block)
.map(|captures| {
(
captures.get(2).map(|value| decode_xml_entities(value.as_str().trim())),
captures.get(1).map(|value| decode_xml_entities(value.as_str().trim())),
captures
.get(2)
.map(|value| decode_xml_entities(value.as_str().trim())),
captures
.get(1)
.map(|value| decode_xml_entities(value.as_str().trim())),
)
})
.unwrap_or((None, None));
@@ -984,7 +1076,10 @@ fn parse_contexts(raw: &str) -> HashMap<String, ParsedContext> {
let mut dimensions = Vec::new();
if let Some(segment_value) = segment.as_ref() {
if let Some(members) = segment_value.get("explicitMembers").and_then(|value| value.as_array()) {
if let Some(members) = segment_value
.get("explicitMembers")
.and_then(|value| value.as_array())
{
for member in members {
if let (Some(axis), Some(member_value)) = (
member.get("axis").and_then(|value| value.as_str()),
@@ -999,7 +1094,10 @@ fn parse_contexts(raw: &str) -> HashMap<String, ParsedContext> {
}
}
if let Some(scenario_value) = scenario.as_ref() {
if let Some(members) = scenario_value.get("explicitMembers").and_then(|value| value.as_array()) {
if let Some(members) = scenario_value
.get("explicitMembers")
.and_then(|value| value.as_array())
{
for member in members {
if let (Some(axis), Some(member_value)) = (
member.get("axis").and_then(|value| value.as_str()),
@@ -1062,10 +1160,16 @@ fn parse_dimension_container(raw: &str) -> serde_json::Value {
fn parse_units(raw: &str) -> HashMap<String, ParsedUnit> {
let mut units = HashMap::new();
for captures in UNIT_RE.captures_iter(raw) {
let Some(id) = captures.get(1).map(|value| value.as_str().trim().to_string()) else {
let Some(id) = captures
.get(1)
.map(|value| value.as_str().trim().to_string())
else {
continue;
};
let block = captures.get(2).map(|value| value.as_str()).unwrap_or_default();
let block = captures
.get(2)
.map(|value| value.as_str())
.unwrap_or_default();
let measures = MEASURE_RE
.captures_iter(block)
.filter_map(|captures| captures.get(1))
@@ -1097,7 +1201,10 @@ fn parse_attrs(raw: &str) -> HashMap<String, String> {
let mut map = HashMap::new();
for captures in ATTR_RE.captures_iter(raw) {
if let (Some(name), Some(value)) = (captures.get(1), captures.get(2)) {
map.insert(name.as_str().to_string(), decode_xml_entities(value.as_str()));
map.insert(
name.as_str().to_string(),
decode_xml_entities(value.as_str()),
);
}
}
map
@@ -1138,12 +1245,20 @@ fn parse_label_linkbase(raw: &str) -> HashMap<String, String> {
let mut preferred = HashMap::<String, (String, i64)>::new();
for captures in LABEL_LINK_RE.captures_iter(raw) {
let block = captures.get(1).map(|value| value.as_str()).unwrap_or_default();
let block = captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default();
let mut loc_by_label = HashMap::<String, String>::new();
let mut resource_by_label = HashMap::<String, (String, Option<String>)>::new();
for captures in LOC_RE.captures_iter(block) {
let attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(label) = attrs.get("xlink:label").cloned() else {
continue;
};
@@ -1160,11 +1275,21 @@ fn parse_label_linkbase(raw: &str) -> HashMap<String, String> {
}
for captures in LABEL_RESOURCE_RE.captures_iter(block) {
let attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(label) = attrs.get("xlink:label").cloned() else {
continue;
};
let body = decode_xml_entities(captures.get(2).map(|value| value.as_str()).unwrap_or_default())
let body = decode_xml_entities(
captures
.get(2)
.map(|value| value.as_str())
.unwrap_or_default(),
)
.split_whitespace()
.collect::<Vec<_>>()
.join(" ");
@@ -1175,7 +1300,12 @@ fn parse_label_linkbase(raw: &str) -> HashMap<String, String> {
}
for captures in LABEL_ARC_RE.captures_iter(block) {
let attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(from) = attrs.get("xlink:from").cloned() else {
continue;
};
@@ -1190,7 +1320,11 @@ fn parse_label_linkbase(raw: &str) -> HashMap<String, String> {
};
let priority = label_priority(role.as_deref());
let current = preferred.get(concept_key).cloned();
if current.as_ref().map(|(_, current_priority)| priority > *current_priority).unwrap_or(true) {
if current
.as_ref()
.map(|(_, current_priority)| priority > *current_priority)
.unwrap_or(true)
{
preferred.insert(concept_key.clone(), (label.clone(), priority));
}
}
@@ -1207,18 +1341,31 @@ fn parse_presentation_linkbase(raw: &str) -> Vec<PresentationNode> {
let mut rows = Vec::new();
for captures in PRESENTATION_LINK_RE.captures_iter(raw) {
let link_attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let link_attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(role_uri) = link_attrs.get("xlink:role").cloned() else {
continue;
};
let block = captures.get(2).map(|value| value.as_str()).unwrap_or_default();
let block = captures
.get(2)
.map(|value| value.as_str())
.unwrap_or_default();
let mut loc_by_label = HashMap::<String, (String, String, bool)>::new();
let mut children_by_label = HashMap::<String, Vec<(String, f64)>>::new();
let mut incoming = HashSet::<String>::new();
let mut all_referenced = HashSet::<String>::new();
for captures in LOC_RE.captures_iter(block) {
let attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(label) = attrs.get("xlink:label").cloned() else {
continue;
};
@@ -1228,14 +1375,27 @@ fn parse_presentation_linkbase(raw: &str) -> Vec<PresentationNode> {
let Some(qname) = qname_from_href(&href) else {
continue;
};
let Some((concept_key, qname, local_name)) = concept_from_qname(&qname, &namespaces) else {
let Some((concept_key, qname, local_name)) = concept_from_qname(&qname, &namespaces)
else {
continue;
};
loc_by_label.insert(label, (concept_key, qname, local_name.to_ascii_lowercase().contains("abstract")));
loc_by_label.insert(
label,
(
concept_key,
qname,
local_name.to_ascii_lowercase().contains("abstract"),
),
);
}
for captures in PRESENTATION_ARC_RE.captures_iter(block) {
let attrs = parse_attrs(captures.get(1).map(|value| value.as_str()).unwrap_or_default());
let attrs = parse_attrs(
captures
.get(1)
.map(|value| value.as_str())
.unwrap_or_default(),
);
let Some(from) = attrs.get("xlink:from").cloned() else {
continue;
};
@@ -1248,8 +1408,16 @@ fn parse_presentation_linkbase(raw: &str) -> Vec<PresentationNode> {
let order = attrs
.get("order")
.and_then(|value| value.parse::<f64>().ok())
.unwrap_or_else(|| children_by_label.get(&from).map(|children| children.len() as f64 + 1.0).unwrap_or(1.0));
children_by_label.entry(from.clone()).or_default().push((to.clone(), order));
.unwrap_or_else(|| {
children_by_label
.get(&from)
.map(|children| children.len() as f64 + 1.0)
.unwrap_or(1.0)
});
children_by_label
.entry(from.clone())
.or_default()
.push((to.clone(), order));
incoming.insert(to.clone());
all_referenced.insert(from);
all_referenced.insert(to);
@@ -1281,7 +1449,11 @@ fn parse_presentation_linkbase(raw: &str) -> Vec<PresentationNode> {
return;
}
let parent_concept_key = parent_label.and_then(|parent| loc_by_label.get(parent).map(|(concept_key, _, _)| concept_key.clone()));
let parent_concept_key = parent_label.and_then(|parent| {
loc_by_label
.get(parent)
.map(|(concept_key, _, _)| concept_key.clone())
});
rows.push(PresentationNode {
concept_key: concept_key.clone(),
role_uri: role_uri.to_string(),
@@ -1292,7 +1464,11 @@ fn parse_presentation_linkbase(raw: &str) -> Vec<PresentationNode> {
});
let mut children = children_by_label.get(label).cloned().unwrap_or_default();
children.sort_by(|left, right| left.1.partial_cmp(&right.1).unwrap_or(std::cmp::Ordering::Equal));
children.sort_by(|left, right| {
left.1
.partial_cmp(&right.1)
.unwrap_or(std::cmp::Ordering::Equal)
});
for (index, (child_label, _)) in children.into_iter().enumerate() {
dfs(
&child_label,
@@ -1400,7 +1576,10 @@ fn materialize_taxonomy_statements(
.clone()
.or_else(|| fact.period_instant.clone())
.unwrap_or_else(|| filing_date.to_string());
let id = format!("{date}-{compact_accession}-{}", period_by_signature.len() + 1);
let id = format!(
"{date}-{compact_accession}-{}",
period_by_signature.len() + 1
);
let period_label = if fact.period_instant.is_some() && fact.period_start.is_none() {
"Instant".to_string()
} else if fact.period_start.is_some() && fact.period_end.is_some() {
@@ -1420,7 +1599,10 @@ fn materialize_taxonomy_statements(
accession_number: accession_number.to_string(),
filing_date: filing_date.to_string(),
period_start: fact.period_start.clone(),
period_end: fact.period_end.clone().or_else(|| fact.period_instant.clone()),
period_end: fact
.period_end
.clone()
.or_else(|| fact.period_instant.clone()),
filing_type: filing_type.to_string(),
period_label,
},
@@ -1429,9 +1611,17 @@ fn materialize_taxonomy_statements(
let mut periods = period_by_signature.values().cloned().collect::<Vec<_>>();
periods.sort_by(|left, right| {
let left_key = left.period_end.clone().unwrap_or_else(|| left.filing_date.clone());
let right_key = right.period_end.clone().unwrap_or_else(|| right.filing_date.clone());
left_key.cmp(&right_key).then_with(|| left.id.cmp(&right.id))
let left_key = left
.period_end
.clone()
.unwrap_or_else(|| left.filing_date.clone());
let right_key = right
.period_end
.clone()
.unwrap_or_else(|| right.filing_date.clone());
left_key
.cmp(&right_key)
.then_with(|| left.id.cmp(&right.id))
});
let period_id_by_signature = period_by_signature
.iter()
@@ -1440,7 +1630,10 @@ fn materialize_taxonomy_statements(
let mut presentation_by_concept = HashMap::<String, Vec<&PresentationNode>>::new();
for node in presentation {
presentation_by_concept.entry(node.concept_key.clone()).or_default().push(node);
presentation_by_concept
.entry(node.concept_key.clone())
.or_default()
.push(node);
}
let mut grouped_by_statement = empty_parsed_fact_map();
@@ -1502,9 +1695,13 @@ fn materialize_taxonomy_statements(
let mut concepts = Vec::<ConceptOutput>::new();
for statement_kind in statement_keys() {
let concept_groups = grouped_by_statement.remove(statement_kind).unwrap_or_default();
let concept_groups = grouped_by_statement
.remove(statement_kind)
.unwrap_or_default();
let mut concept_keys = HashSet::<String>::new();
for node in presentation.iter().filter(|node| classify_statement_role(&node.role_uri).as_deref() == Some(statement_kind)) {
for node in presentation.iter().filter(|node| {
classify_statement_role(&node.role_uri).as_deref() == Some(statement_kind)
}) {
concept_keys.insert(node.concept_key.clone());
}
for concept_key in concept_groups.keys() {
@@ -1516,12 +1713,21 @@ fn materialize_taxonomy_statements(
.map(|concept_key| {
let nodes = presentation
.iter()
.filter(|node| node.concept_key == concept_key && classify_statement_role(&node.role_uri).as_deref() == Some(statement_kind))
.filter(|node| {
node.concept_key == concept_key
&& classify_statement_role(&node.role_uri).as_deref()
== Some(statement_kind)
})
.collect::<Vec<_>>();
let order = nodes.iter().map(|node| node.order).fold(f64::INFINITY, f64::min);
let order = nodes
.iter()
.map(|node| node.order)
.fold(f64::INFINITY, f64::min);
let depth = nodes.iter().map(|node| node.depth).min().unwrap_or(0);
let role_uri = nodes.first().map(|node| node.role_uri.clone());
let parent_concept_key = nodes.first().and_then(|node| node.parent_concept_key.clone());
let parent_concept_key = nodes
.first()
.and_then(|node| node.parent_concept_key.clone());
(concept_key, order, depth, role_uri, parent_concept_key)
})
.collect::<Vec<_>>();
@@ -1532,8 +1738,13 @@ fn materialize_taxonomy_statements(
.then_with(|| left.0.cmp(&right.0))
});
for (concept_key, presentation_order, depth, role_uri, parent_concept_key) in ordered_concepts {
let fact_group = concept_groups.get(&concept_key).cloned().unwrap_or_default();
for (concept_key, presentation_order, depth, role_uri, parent_concept_key) in
ordered_concepts
{
let fact_group = concept_groups
.get(&concept_key)
.cloned()
.unwrap_or_default();
let (namespace_uri, local_name) = split_concept_key(&concept_key);
let qname = fact_group
.first()
@@ -1672,7 +1883,13 @@ fn empty_detail_row_map() -> DetailRowStatementMap {
}
fn statement_keys() -> [&'static str; 5] {
["income", "balance", "cash_flow", "equity", "comprehensive_income"]
[
"income",
"balance",
"cash_flow",
"equity",
"comprehensive_income",
]
}
fn statement_key_ref(value: &str) -> Option<&'static str> {
@@ -1709,7 +1926,13 @@ fn pick_preferred_fact(grouped_facts: &[(i64, ParsedFact)]) -> Option<&(i64, Par
.unwrap_or_default();
left_date.cmp(&right_date)
})
.then_with(|| left.1.value.abs().partial_cmp(&right.1.value.abs()).unwrap_or(std::cmp::Ordering::Equal))
.then_with(|| {
left.1
.value
.abs()
.partial_cmp(&right.1.value.abs())
.unwrap_or(std::cmp::Ordering::Equal)
})
})
}
@@ -1779,12 +2002,6 @@ fn classify_statement_role(role_uri: &str) -> Option<String> {
fn concept_statement_fallback(local_name: &str) -> Option<String> {
let normalized = local_name.to_ascii_lowercase();
if Regex::new(r#"cash|operatingactivities|investingactivities|financingactivities"#)
.unwrap()
.is_match(&normalized)
{
return Some("cash_flow".to_string());
}
if Regex::new(r#"equity|retainedearnings|additionalpaidincapital"#)
.unwrap()
.is_match(&normalized)
@@ -1794,6 +2011,22 @@ fn concept_statement_fallback(local_name: &str) -> Option<String> {
if normalized.contains("comprehensiveincome") {
return Some("comprehensive_income".to_string());
}
if Regex::new(
r#"deferredpolicyacquisitioncosts(andvalueofbusinessacquired)?$|supplementaryinsuranceinformationdeferredpolicyacquisitioncosts$|deferredacquisitioncosts$"#,
)
.unwrap()
.is_match(&normalized)
{
return Some("balance".to_string());
}
if Regex::new(
r#"netcashprovidedbyusedin.*activities|increasedecreasein|paymentstoacquire|paymentsforcapitalimprovements$|paymentsfordepositsonrealestateacquisitions$|paymentsforrepurchase|paymentsofdividends|dividendscommonstockcash$|proceedsfrom|repaymentsofdebt|sharebasedcompensation$|allocatedsharebasedcompensationexpense$|depreciationdepletionandamortization$|depreciationamortizationandaccretionnet$|depreciationandamortization$|depreciationamortizationandother$|otheradjustmentstoreconcilenetincomelosstocashprovidedbyusedinoperatingactivities"#,
)
.unwrap()
.is_match(&normalized)
{
return Some("cash_flow".to_string());
}
if Regex::new(
r#"asset|liabilit|debt|financingreceivable|loansreceivable|deposits|allowanceforcreditloss|futurepolicybenefits|policyholderaccountbalances|unearnedpremiums|realestateinvestmentproperty|grossatcarryingvalue|investmentproperty"#,
)
@@ -1967,7 +2200,10 @@ mod tests {
vec![],
)
.expect("core pack should load and map");
let income_surface_rows = model.surface_rows.get("income").expect("income surface rows");
let income_surface_rows = model
.surface_rows
.get("income")
.expect("income surface rows");
let op_expenses = income_surface_rows
.iter()
.find(|row| row.key == "operating_expenses")
@@ -1978,7 +2214,10 @@ mod tests {
.expect("revenue surface row");
assert_eq!(revenue.values.get("2025").copied().flatten(), Some(120.0));
assert_eq!(op_expenses.values.get("2024").copied().flatten(), Some(40.0));
assert_eq!(
op_expenses.values.get("2024").copied().flatten(),
Some(40.0)
);
assert_eq!(op_expenses.detail_count, Some(2));
let operating_expense_details = model
@@ -1987,8 +2226,12 @@ mod tests {
.and_then(|groups| groups.get("operating_expenses"))
.expect("operating expenses details");
assert_eq!(operating_expense_details.len(), 2);
assert!(operating_expense_details.iter().any(|row| row.key == "sga-row"));
assert!(operating_expense_details.iter().any(|row| row.key == "rd-row"));
assert!(operating_expense_details
.iter()
.any(|row| row.key == "sga-row"));
assert!(operating_expense_details
.iter()
.any(|row| row.key == "rd-row"));
let residual_rows = model
.detail_rows
@@ -2003,17 +2246,26 @@ mod tests {
.concept_mappings
.get("http://fasb.org/us-gaap/2024#ResearchAndDevelopmentExpense")
.expect("rd mapping");
assert_eq!(rd_mapping.detail_parent_surface_key.as_deref(), Some("operating_expenses"));
assert_eq!(rd_mapping.surface_key.as_deref(), Some("operating_expenses"));
assert_eq!(
rd_mapping.detail_parent_surface_key.as_deref(),
Some("operating_expenses")
);
assert_eq!(
rd_mapping.surface_key.as_deref(),
Some("operating_expenses")
);
let residual_mapping = model
.concept_mappings
.get("urn:company#OtherOperatingCharges")
.expect("residual mapping");
assert!(residual_mapping.residual_flag);
assert_eq!(residual_mapping.detail_parent_surface_key.as_deref(), Some("unmapped"));
assert_eq!(
residual_mapping.detail_parent_surface_key.as_deref(),
Some("unmapped")
);
assert_eq!(model.normalization_summary.surface_row_count, 5);
assert_eq!(model.normalization_summary.surface_row_count, 6);
assert_eq!(model.normalization_summary.detail_row_count, 3);
assert_eq!(model.normalization_summary.unmapped_row_count, 1);
}
@@ -2051,18 +2303,60 @@ mod tests {
#[test]
fn classifies_pack_specific_concepts_without_presentation_roles() {
assert_eq!(
concept_statement_fallback("FinancingReceivableExcludingAccruedInterestAfterAllowanceForCreditLoss")
concept_statement_fallback(
"FinancingReceivableExcludingAccruedInterestAfterAllowanceForCreditLoss"
)
.as_deref(),
Some("balance")
);
assert_eq!(concept_statement_fallback("Deposits").as_deref(), Some("balance"));
assert_eq!(
concept_statement_fallback("Deposits").as_deref(),
Some("balance")
);
assert_eq!(
concept_statement_fallback("RealEstateInvestmentPropertyNet").as_deref(),
Some("balance")
);
assert_eq!(concept_statement_fallback("LeaseIncome").as_deref(), Some("income"));
assert_eq!(
concept_statement_fallback("DirectCostsOfLeasedAndRentedPropertyOrEquipment").as_deref(),
concept_statement_fallback("DeferredPolicyAcquisitionCosts").as_deref(),
Some("balance")
);
assert_eq!(
concept_statement_fallback("DeferredPolicyAcquisitionCostsAndValueOfBusinessAcquired")
.as_deref(),
Some("balance")
);
assert_eq!(
concept_statement_fallback("IncreaseDecreaseInAccountsReceivable").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("PaymentsOfDividends").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("RepaymentsOfDebt").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("ShareBasedCompensation").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("PaymentsForCapitalImprovements").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("PaymentsForDepositsOnRealEstateAcquisitions").as_deref(),
Some("cash_flow")
);
assert_eq!(
concept_statement_fallback("LeaseIncome").as_deref(),
Some("income")
);
assert_eq!(
concept_statement_fallback("DirectCostsOfLeasedAndRentedPropertyOrEquipment")
.as_deref(),
Some("income")
);
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,12 +1,22 @@
use anyhow::{anyhow, Context, Result};
use serde::Deserialize;
use std::collections::HashMap;
use std::env;
use std::fs;
use std::collections::HashMap;
use std::path::PathBuf;
use crate::pack_selector::FiscalPack;
fn default_include_in_output() -> bool {
true
}
#[derive(Debug, Deserialize, Clone, Copy, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum SurfaceSignTransform {
Invert,
}
#[derive(Debug, Deserialize, Clone)]
pub struct SurfacePackFile {
pub version: String,
@@ -25,9 +35,44 @@ pub struct SurfaceDefinition {
pub rollup_policy: String,
pub allowed_source_concepts: Vec<String>,
pub allowed_authoritative_concepts: Vec<String>,
pub formula_fallback: Option<serde_json::Value>,
pub formula_fallback: Option<SurfaceFormulaFallback>,
pub detail_grouping_policy: String,
pub materiality_policy: String,
#[serde(default = "default_include_in_output")]
pub include_in_output: bool,
#[serde(default)]
pub sign_transform: Option<SurfaceSignTransform>,
}
#[derive(Debug, Deserialize, Clone)]
#[serde(untagged)]
pub enum SurfaceFormulaFallback {
LegacyString(#[allow(dead_code)] String),
Structured(SurfaceFormula),
}
impl SurfaceFormulaFallback {
pub fn structured(&self) -> Option<&SurfaceFormula> {
match self {
Self::Structured(formula) => Some(formula),
Self::LegacyString(_) => None,
}
}
}
#[derive(Debug, Deserialize, Clone)]
pub struct SurfaceFormula {
pub op: SurfaceFormulaOp,
pub sources: Vec<String>,
#[serde(default)]
pub treat_null_as_zero: bool,
}
#[derive(Debug, Deserialize, Clone, Copy, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum SurfaceFormulaOp {
Sum,
Subtract,
}
#[derive(Debug, Deserialize, Clone)]
@@ -147,7 +192,9 @@ pub fn resolve_taxonomy_dir() -> Result<PathBuf> {
candidates
.into_iter()
.find(|path| path.is_dir())
.ok_or_else(|| anyhow!("taxonomy resolution failed: unable to locate runtime taxonomy directory"))
.ok_or_else(|| {
anyhow!("taxonomy resolution failed: unable to locate runtime taxonomy directory")
})
}
pub fn load_surface_pack(pack: FiscalPack) -> Result<SurfacePackFile> {
@@ -156,14 +203,52 @@ pub fn load_surface_pack(pack: FiscalPack) -> Result<SurfacePackFile> {
.join("fiscal")
.join("v1")
.join(format!("{}.surface.json", pack.as_str()));
let raw = fs::read_to_string(&path)
.with_context(|| format!("taxonomy resolution failed: unable to read {}", path.display()))?;
let file = serde_json::from_str::<SurfacePackFile>(&raw)
.with_context(|| format!("taxonomy resolution failed: unable to parse {}", path.display()))?;
let mut file = load_surface_pack_file(&path)?;
if !matches!(pack, FiscalPack::Core) {
let core_path = taxonomy_dir
.join("fiscal")
.join("v1")
.join("core.surface.json");
let core_file = load_surface_pack_file(&core_path)?;
let pack_inherited_keys = file
.surfaces
.iter()
.filter(|surface| surface.statement == "balance" || surface.statement == "cash_flow")
.map(|surface| (surface.statement.clone(), surface.surface_key.clone()))
.collect::<std::collections::HashSet<_>>();
file.surfaces.extend(
core_file
.surfaces
.into_iter()
.filter(|surface| surface.statement == "balance" || surface.statement == "cash_flow")
.filter(|surface| {
!pack_inherited_keys
.contains(&(surface.statement.clone(), surface.surface_key.clone()))
}),
);
}
let _ = (&file.version, &file.pack);
Ok(file)
}
fn load_surface_pack_file(path: &PathBuf) -> Result<SurfacePackFile> {
let raw = fs::read_to_string(path).with_context(|| {
format!(
"taxonomy resolution failed: unable to read {}",
path.display()
)
})?;
serde_json::from_str::<SurfacePackFile>(&raw).with_context(|| {
format!(
"taxonomy resolution failed: unable to parse {}",
path.display()
)
})
}
pub fn load_crosswalk(regime: &str) -> Result<Option<CrosswalkFile>> {
let file_name = match regime {
"us-gaap" => "us-gaap.json",
@@ -173,10 +258,18 @@ pub fn load_crosswalk(regime: &str) -> Result<Option<CrosswalkFile>> {
let taxonomy_dir = resolve_taxonomy_dir()?;
let path = taxonomy_dir.join("crosswalk").join(file_name);
let raw = fs::read_to_string(&path)
.with_context(|| format!("taxonomy resolution failed: unable to read {}", path.display()))?;
let file = serde_json::from_str::<CrosswalkFile>(&raw)
.with_context(|| format!("taxonomy resolution failed: unable to parse {}", path.display()))?;
let raw = fs::read_to_string(&path).with_context(|| {
format!(
"taxonomy resolution failed: unable to read {}",
path.display()
)
})?;
let file = serde_json::from_str::<CrosswalkFile>(&raw).with_context(|| {
format!(
"taxonomy resolution failed: unable to parse {}",
path.display()
)
})?;
let _ = (&file.version, &file.regime);
Ok(Some(file))
}
@@ -188,10 +281,18 @@ pub fn load_kpi_pack(pack: FiscalPack) -> Result<KpiPackFile> {
.join("v1")
.join("kpis")
.join(format!("{}.kpis.json", pack.as_str()));
let raw = fs::read_to_string(&path)
.with_context(|| format!("taxonomy resolution failed: unable to read {}", path.display()))?;
let file = serde_json::from_str::<KpiPackFile>(&raw)
.with_context(|| format!("taxonomy resolution failed: unable to parse {}", path.display()))?;
let raw = fs::read_to_string(&path).with_context(|| {
format!(
"taxonomy resolution failed: unable to read {}",
path.display()
)
})?;
let file = serde_json::from_str::<KpiPackFile>(&raw).with_context(|| {
format!(
"taxonomy resolution failed: unable to parse {}",
path.display()
)
})?;
let _ = (&file.version, &file.pack);
Ok(file)
}
@@ -202,10 +303,18 @@ pub fn load_universal_income_definitions() -> Result<UniversalIncomeFile> {
.join("fiscal")
.join("v1")
.join("universal_income.surface.json");
let raw = fs::read_to_string(&path)
.with_context(|| format!("taxonomy resolution failed: unable to read {}", path.display()))?;
let file = serde_json::from_str::<UniversalIncomeFile>(&raw)
.with_context(|| format!("taxonomy resolution failed: unable to parse {}", path.display()))?;
let raw = fs::read_to_string(&path).with_context(|| {
format!(
"taxonomy resolution failed: unable to read {}",
path.display()
)
})?;
let file = serde_json::from_str::<UniversalIncomeFile>(&raw).with_context(|| {
format!(
"taxonomy resolution failed: unable to parse {}",
path.display()
)
})?;
let _ = &file.version;
Ok(file)
}
@@ -216,10 +325,18 @@ pub fn load_income_bridge(pack: FiscalPack) -> Result<IncomeBridgeFile> {
.join("fiscal")
.join("v1")
.join(format!("{}.income-bridge.json", pack.as_str()));
let raw = fs::read_to_string(&path)
.with_context(|| format!("taxonomy resolution failed: unable to read {}", path.display()))?;
let file = serde_json::from_str::<IncomeBridgeFile>(&raw)
.with_context(|| format!("taxonomy resolution failed: unable to parse {}", path.display()))?;
let raw = fs::read_to_string(&path).with_context(|| {
format!(
"taxonomy resolution failed: unable to read {}",
path.display()
)
})?;
let file = serde_json::from_str::<IncomeBridgeFile>(&raw).with_context(|| {
format!(
"taxonomy resolution failed: unable to parse {}",
path.display()
)
})?;
let _ = (&file.version, &file.pack);
Ok(file)
}
@@ -230,17 +347,20 @@ mod tests {
#[test]
fn resolves_taxonomy_dir_and_loads_core_pack() {
let taxonomy_dir = resolve_taxonomy_dir().expect("taxonomy dir should resolve during tests");
let taxonomy_dir =
resolve_taxonomy_dir().expect("taxonomy dir should resolve during tests");
assert!(taxonomy_dir.exists());
let surface_pack = load_surface_pack(FiscalPack::Core).expect("core surface pack should load");
let surface_pack =
load_surface_pack(FiscalPack::Core).expect("core surface pack should load");
assert_eq!(surface_pack.pack, "core");
assert!(!surface_pack.surfaces.is_empty());
let kpi_pack = load_kpi_pack(FiscalPack::Core).expect("core kpi pack should load");
assert_eq!(kpi_pack.pack, "core");
let universal_income = load_universal_income_definitions().expect("universal income config should load");
let universal_income =
load_universal_income_definitions().expect("universal income config should load");
assert!(!universal_income.rows.is_empty());
let core_bridge = load_income_bridge(FiscalPack::Core).expect("core bridge should load");

File diff suppressed because it is too large Load Diff

View File

@@ -156,7 +156,7 @@
"surface_key": "loans",
"statement": "balance",
"label": "Loans",
"category": "surface",
"category": "noncurrent_assets",
"order": 30,
"unit": "currency",
"rollup_policy": "aggregate_children",
@@ -181,7 +181,7 @@
"surface_key": "allowance_for_credit_losses",
"statement": "balance",
"label": "Allowance for Credit Losses",
"category": "surface",
"category": "noncurrent_assets",
"order": 40,
"unit": "currency",
"rollup_policy": "aggregate_children",
@@ -201,7 +201,7 @@
"surface_key": "deposits",
"statement": "balance",
"label": "Deposits",
"category": "surface",
"category": "current_liabilities",
"order": 80,
"unit": "currency",
"rollup_policy": "aggregate_children",
@@ -215,7 +215,7 @@
"surface_key": "total_assets",
"statement": "balance",
"label": "Total Assets",
"category": "surface",
"category": "derived",
"order": 90,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -229,7 +229,7 @@
"surface_key": "total_liabilities",
"statement": "balance",
"label": "Total Liabilities",
"category": "surface",
"category": "derived",
"order": 100,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -243,7 +243,7 @@
"surface_key": "total_equity",
"statement": "balance",
"label": "Total Equity",
"category": "surface",
"category": "equity",
"order": 110,
"unit": "currency",
"rollup_policy": "direct_only",

View File

@@ -63,7 +63,7 @@
"surface_key": "total_assets",
"statement": "balance",
"label": "Total Assets",
"category": "surface",
"category": "derived",
"order": 90,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -77,7 +77,7 @@
"surface_key": "total_liabilities",
"statement": "balance",
"label": "Total Liabilities",
"category": "surface",
"category": "derived",
"order": 100,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -91,7 +91,7 @@
"surface_key": "total_equity",
"statement": "balance",
"label": "Total Equity",
"category": "surface",
"category": "equity",
"order": 110,
"unit": "currency",
"rollup_policy": "direct_only",

File diff suppressed because it is too large Load Diff

View File

@@ -119,7 +119,7 @@
"surface_key": "policy_liabilities",
"statement": "balance",
"label": "Policy Liabilities",
"category": "surface",
"category": "noncurrent_liabilities",
"order": 80,
"unit": "currency",
"rollup_policy": "aggregate_children",
@@ -145,17 +145,19 @@
"surface_key": "deferred_acquisition_costs",
"statement": "balance",
"label": "Deferred Acquisition Costs",
"category": "surface",
"category": "noncurrent_assets",
"order": 90,
"unit": "currency",
"rollup_policy": "aggregate_children",
"allowed_source_concepts": [
"us-gaap:DeferredPolicyAcquisitionCosts",
"us-gaap:DeferredAcquisitionCosts"
"us-gaap:DeferredAcquisitionCosts",
"us-gaap:DeferredPolicyAcquisitionCostsAndValueOfBusinessAcquired"
],
"allowed_authoritative_concepts": [
"us-gaap:DeferredPolicyAcquisitionCosts",
"us-gaap:DeferredAcquisitionCosts"
"us-gaap:DeferredAcquisitionCosts",
"us-gaap:DeferredPolicyAcquisitionCostsAndValueOfBusinessAcquired"
],
"formula_fallback": null,
"detail_grouping_policy": "group_all_children",
@@ -165,7 +167,7 @@
"surface_key": "total_assets",
"statement": "balance",
"label": "Total Assets",
"category": "surface",
"category": "derived",
"order": 100,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -179,7 +181,7 @@
"surface_key": "total_liabilities",
"statement": "balance",
"label": "Total Liabilities",
"category": "surface",
"category": "derived",
"order": 110,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -193,7 +195,7 @@
"surface_key": "total_equity",
"statement": "balance",
"label": "Total Equity",
"category": "surface",
"category": "equity",
"order": 120,
"unit": "currency",
"rollup_policy": "direct_only",

View File

@@ -78,7 +78,7 @@
"surface_key": "investment_property",
"statement": "balance",
"label": "Investment Property",
"category": "surface",
"category": "noncurrent_assets",
"order": 40,
"unit": "currency",
"rollup_policy": "aggregate_children",
@@ -99,7 +99,7 @@
"surface_key": "total_assets",
"statement": "balance",
"label": "Total Assets",
"category": "surface",
"category": "derived",
"order": 90,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -113,7 +113,7 @@
"surface_key": "total_liabilities",
"statement": "balance",
"label": "Total Liabilities",
"category": "surface",
"category": "derived",
"order": 100,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -127,7 +127,7 @@
"surface_key": "total_equity",
"statement": "balance",
"label": "Total Equity",
"category": "surface",
"category": "equity",
"order": 110,
"unit": "currency",
"rollup_policy": "direct_only",
@@ -136,6 +136,25 @@
"formula_fallback": null,
"detail_grouping_policy": "top_level_only",
"materiality_policy": "balance_default"
},
{
"surface_key": "capital_expenditures",
"statement": "cash_flow",
"label": "Capital Expenditures",
"category": "investing",
"order": 130,
"unit": "currency",
"rollup_policy": "aggregate_children",
"allowed_source_concepts": [
"us-gaap:PaymentsToAcquireCommercialRealEstate",
"us-gaap:PaymentsForCapitalImprovements",
"us-gaap:PaymentsForDepositsOnRealEstateAcquisitions"
],
"allowed_authoritative_concepts": [],
"formula_fallback": null,
"detail_grouping_policy": "group_all_children",
"materiality_policy": "cash_flow_default",
"sign_transform": "invert"
}
]
}

View File

@@ -5,7 +5,7 @@ import { hydrateFilingTaxonomySnapshot } from '@/lib/server/taxonomy/engine';
import type { TaxonomyHydrationInput, TaxonomyHydrationResult } from '@/lib/server/taxonomy/types';
type ComparisonTarget = {
statement: Extract<FinancialStatementKind, 'income' | 'balance'>;
statement: Extract<FinancialStatementKind, 'income' | 'balance' | 'cash_flow'>;
surfaceKey: string;
fiscalAiLabels: string[];
allowNotMeaningful?: boolean;
@@ -46,7 +46,7 @@ type FiscalAiTable = {
};
type ComparisonRow = {
statement: Extract<FinancialStatementKind, 'income' | 'balance'>;
statement: Extract<FinancialStatementKind, 'income' | 'balance' | 'cash_flow'>;
surfaceKey: string;
fiscalAiLabel: string | null;
fiscalAiValueM: number | null;
@@ -89,6 +89,11 @@ const CASES: CompanyCase[] = [
surfaceKey: 'net_income',
fiscalAiLabels: ['Net Income Attributable to Common Shareholders', 'Consolidated Net Income', 'Net Income']
},
{ statement: 'balance', surfaceKey: 'current_assets', fiscalAiLabels: ['Current Assets', 'Total Current Assets'] },
{ statement: 'balance', surfaceKey: 'total_assets', fiscalAiLabels: ['Total Assets'] },
{ statement: 'cash_flow', surfaceKey: 'operating_cash_flow', fiscalAiLabels: ['Cash from Operating Activities', 'Operating Cash Flow', 'Net Cash from Operations', 'Net Cash Provided by Operating'] },
{ statement: 'cash_flow', surfaceKey: 'capital_expenditures', fiscalAiLabels: ['Capital Expenditures', 'Capital Expenditure'] },
{ statement: 'cash_flow', surfaceKey: 'free_cash_flow', fiscalAiLabels: ['Free Cash Flow', 'Levered Free Cash Flow'] },
]
},
{
@@ -113,6 +118,11 @@ const CASES: CompanyCase[] = [
surfaceKey: 'net_income',
fiscalAiLabels: ['Net Income to Common', 'Net Income Attributable to Common Shareholders', 'Net Income']
},
{ statement: 'balance', surfaceKey: 'loans', fiscalAiLabels: ['Net Loans', 'Loans', 'Loans Receivable'] },
{ statement: 'balance', surfaceKey: 'total_assets', fiscalAiLabels: ['Total Assets'] },
{ statement: 'cash_flow', surfaceKey: 'operating_cash_flow', fiscalAiLabels: ['Cash from Operating Activities', 'Net Cash from Operating Activities', 'Net Cash Provided by Operating'] },
{ statement: 'cash_flow', surfaceKey: 'investing_cash_flow', fiscalAiLabels: ['Cash from Investing Activities', 'Net Cash from Investing Activities', 'Net Cash Provided by Investing'] },
{ statement: 'cash_flow', surfaceKey: 'financing_cash_flow', fiscalAiLabels: ['Cash from Financing Activities', 'Net Cash from Financing Activities', 'Net Cash Provided by Financing'] },
]
},
{
@@ -137,6 +147,18 @@ const CASES: CompanyCase[] = [
surfaceKey: 'net_income',
fiscalAiLabels: ['Net Income Attributable to Common Shareholders', 'Consolidated Net Income', 'Net Income']
},
{
statement: 'balance',
surfaceKey: 'deferred_acquisition_costs',
fiscalAiLabels: [
'Deferred Acquisition Costs',
'Deferred Policy Acquisition Costs',
'Deferred Policy Acquisition Costs and Value of Business Acquired'
]
},
{ statement: 'balance', surfaceKey: 'total_assets', fiscalAiLabels: ['Total Assets'] },
{ statement: 'cash_flow', surfaceKey: 'operating_cash_flow', fiscalAiLabels: ['Cash from Operating Activities', 'Operating Cash Flow', 'Net Cash from Operations', 'Net Cash Provided by Operating'] },
{ statement: 'cash_flow', surfaceKey: 'free_cash_flow', fiscalAiLabels: ['Free Cash Flow', 'Levered Free Cash Flow'] },
]
},
{
@@ -154,7 +176,22 @@ const CASES: CompanyCase[] = [
statement: 'income',
surfaceKey: 'net_income',
fiscalAiLabels: ['Net Income Attributable to Common Shareholders', 'Consolidated Net Income', 'Net Income']
}
},
{
statement: 'balance',
surfaceKey: 'investment_property',
fiscalAiLabels: [
'Investment Property',
'Investment Properties',
'Real Estate Investment Property, Net',
'Real Estate Investment Property, at Cost',
'Total real estate held for investment, at cost'
]
},
{ statement: 'balance', surfaceKey: 'total_assets', fiscalAiLabels: ['Total Assets'] },
{ statement: 'cash_flow', surfaceKey: 'operating_cash_flow', fiscalAiLabels: ['Cash from Operating Activities', 'Operating Cash Flow', 'Net Cash from Operations', 'Net Cash Provided by Operating'] },
{ statement: 'cash_flow', surfaceKey: 'capital_expenditures', fiscalAiLabels: ['Capital Expenditures', 'Capital Expenditure'] },
{ statement: 'cash_flow', surfaceKey: 'free_cash_flow', fiscalAiLabels: ['Free Cash Flow', 'Levered Free Cash Flow'] }
]
},
{
@@ -184,6 +221,9 @@ const CASES: CompanyCase[] = [
];
function parseTickerFilter(argv: string[]) {
let ticker: string | null = null;
let statement: Extract<FinancialStatementKind, 'income' | 'balance' | 'cash_flow'> | null = null;
for (const arg of argv) {
if (arg === '--help' || arg === '-h') {
console.log('Compare live Fiscal.ai standardized statement rows against local sidecar output.');
@@ -191,16 +231,26 @@ function parseTickerFilter(argv: string[]) {
console.log('Usage:');
console.log(' bun run scripts/compare-fiscal-ai-statements.ts');
console.log(' bun run scripts/compare-fiscal-ai-statements.ts --ticker=MSFT');
console.log(' bun run scripts/compare-fiscal-ai-statements.ts --statement=balance');
console.log(' bun run scripts/compare-fiscal-ai-statements.ts --statement=cash_flow');
process.exit(0);
}
if (arg.startsWith('--ticker=')) {
const value = arg.slice('--ticker='.length).trim().toUpperCase();
return value.length > 0 ? value : null;
ticker = value.length > 0 ? value : null;
continue;
}
if (arg.startsWith('--statement=')) {
const value = arg.slice('--statement='.length).trim().toLowerCase().replace(/-/g, '_');
if (value === 'income' || value === 'balance' || value === 'cash_flow') {
statement = value;
}
}
}
return null;
return { ticker, statement };
}
function normalizeLabel(value: string) {
@@ -295,10 +345,98 @@ function chooseInstantPeriodId(result: TaxonomyHydrationResult) {
return instantPeriods[0]?.id ?? null;
}
function parseColumnLabelPeriodEnd(columnLabel: string) {
const match = columnLabel.match(/^([A-Za-z]{3})\s+'?(\d{2,4})$/);
if (!match) {
return null;
}
const [, monthToken, yearToken] = match;
const monthMap: Record<string, number> = {
jan: 0,
feb: 1,
mar: 2,
apr: 3,
may: 4,
jun: 5,
jul: 6,
aug: 7,
sep: 8,
oct: 9,
nov: 10,
dec: 11
};
const month = monthMap[monthToken.toLowerCase()];
if (month === undefined) {
return null;
}
const parsedYear = Number.parseInt(yearToken, 10);
if (!Number.isFinite(parsedYear)) {
return null;
}
const year = yearToken.length === 2 ? 2000 + parsedYear : parsedYear;
return { month, year };
}
function choosePeriodIdForColumnLabel(
result: TaxonomyHydrationResult,
statement: Extract<FinancialStatementKind, 'income' | 'balance' | 'cash_flow'>,
columnLabel: string
) {
const parsed = parseColumnLabelPeriodEnd(columnLabel);
if (!parsed) {
return null;
}
const matchingPeriods = result.periods
.filter((period): period is ResultPeriod => {
const end = periodEnd(period as ResultPeriod);
if (!end) {
return false;
}
const endDate = new Date(end);
if (Number.isNaN(endDate.getTime())) {
return false;
}
const periodMatchesStatement = statement === 'balance'
? !periodStart(period as ResultPeriod)
: Boolean(periodStart(period as ResultPeriod));
if (!periodMatchesStatement) {
return false;
}
return endDate.getUTCFullYear() === parsed.year && endDate.getUTCMonth() === parsed.month;
})
.sort((left, right) => {
if (statement !== 'balance') {
const leftStart = periodStart(left);
const rightStart = periodStart(right);
const leftDuration = leftStart
? Math.round((Date.parse(periodEnd(left) as string) - Date.parse(leftStart)) / (1000 * 60 * 60 * 24))
: -1;
const rightDuration = rightStart
? Math.round((Date.parse(periodEnd(right) as string) - Date.parse(rightStart)) / (1000 * 60 * 60 * 24))
: -1;
if (leftDuration !== rightDuration) {
return rightDuration - leftDuration;
}
}
return Date.parse(periodEnd(right) as string) - Date.parse(periodEnd(left) as string);
});
return matchingPeriods[0]?.id ?? null;
}
function findSurfaceValue(
result: TaxonomyHydrationResult,
statement: Extract<FinancialStatementKind, 'income' | 'balance'>,
surfaceKey: string
statement: Extract<FinancialStatementKind, 'income' | 'balance' | 'cash_flow'>,
surfaceKey: string,
referenceColumnLabel?: string
) {
const rows = result.surface_rows[statement] ?? [];
const row = rows.find((entry) => entry.key === surfaceKey) ?? null;
@@ -306,9 +444,11 @@ function findSurfaceValue(
return { row: null, value: null };
}
const periodId = statement === 'balance'
const periodId = (referenceColumnLabel
? choosePeriodIdForColumnLabel(result, statement, referenceColumnLabel)
: null) ?? (statement === 'balance'
? chooseInstantPeriodId(result)
: chooseDurationPeriodId(result);
: chooseDurationPeriodId(result));
if (periodId) {
const directValue = row.values[periodId];
@@ -412,14 +552,24 @@ async function fetchLatestAnnualFiling(company: CompanyCase): Promise<TaxonomyHy
async function scrapeFiscalAiTable(
page: import('@playwright/test').Page,
exchangeTicker: string,
statement: 'income' | 'balance'
statement: 'income' | 'balance' | 'cash_flow'
): Promise<FiscalAiTable> {
const pagePath = statement === 'income' ? 'income-statement' : 'balance-sheet';
const pagePath = statement === 'income'
? 'income-statement'
: statement === 'balance'
? 'balance-sheet'
: 'cash-flow-statement';
const url = `https://fiscal.ai/company/${exchangeTicker}/financials/${pagePath}/annual/?templateType=standardized`;
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 120_000 });
await page.waitForSelector('table', { timeout: 120_000 });
await page.waitForTimeout(2_500);
await page.evaluate(async () => {
window.scrollTo(0, document.body.scrollHeight);
await new Promise((resolve) => setTimeout(resolve, 750));
window.scrollTo(0, 0);
await new Promise((resolve) => setTimeout(resolve, 250));
});
return await page.evaluate(() => {
function normalizeLabel(value: string) {
@@ -452,45 +602,52 @@ async function scrapeFiscalAiTable(
return Number.isFinite(parsed) ? (negative ? -Math.abs(parsed) : parsed) : null;
}
const table = document.querySelector('table');
if (!table) {
const tables = Array.from(document.querySelectorAll('table'));
if (tables.length === 0) {
throw new Error('Fiscal.ai table not found');
}
const rowsByLabel = new Map<string, FiscalAiTableRow>();
let columnLabel = 'unknown';
for (const table of tables) {
const headerCells = Array.from(table.querySelectorAll('tr:first-child th, tr:first-child td'))
.map((cell) => cell.textContent?.trim() ?? '')
.filter((value) => value.length > 0);
const annualColumnIndex = headerCells.findIndex((value, index) => index > 0 && value !== 'LTM');
if (annualColumnIndex < 0) {
throw new Error(`Could not locate latest annual column in headers: ${headerCells.join(' | ')}`);
continue;
}
const rows = Array.from(table.querySelectorAll('tr'))
.slice(1)
.map((row) => {
if (columnLabel === 'unknown') {
columnLabel = headerCells[annualColumnIndex] ?? 'unknown';
}
for (const row of Array.from(table.querySelectorAll('tr')).slice(1)) {
const cells = Array.from(row.querySelectorAll('td'));
if (cells.length <= annualColumnIndex) {
return null;
continue;
}
const label = cells[0]?.textContent?.trim() ?? '';
const valueText = cells[annualColumnIndex]?.textContent?.trim() ?? '';
if (!label) {
return null;
continue;
}
return {
rowsByLabel.set(label, {
label,
normalizedLabel: normalizeLabel(label),
valueText,
value: parseDisplayedNumber(valueText)
};
})
.filter((entry): entry is FiscalAiTableRow => entry !== null);
});
}
}
const rows = Array.from(rowsByLabel.values());
return {
columnLabel: headerCells[annualColumnIndex] ?? 'unknown',
columnLabel,
rows
};
});
@@ -536,7 +693,7 @@ function compareRow(
): ComparisonRow {
const fiscalAiRow = findFiscalAiRow(fiscalAiTable.rows, target.fiscalAiLabels);
const fiscalAiValueM = fiscalAiRow?.value ?? null;
const ourSurface = findSurfaceValue(result, target.statement, target.surfaceKey);
const ourSurface = findSurfaceValue(result, target.statement, target.surfaceKey, fiscalAiTable.columnLabel);
const ourValueM = roundMillions(ourSurface.value);
const absDiffM = absoluteDiff(ourValueM, fiscalAiValueM);
const relDiffValue = relativeDiff(ourValueM, fiscalAiValueM);
@@ -587,17 +744,34 @@ async function compareCase(page: import('@playwright/test').Page, company: Compa
throw new Error(`${company.ticker} parse_status=${result.parse_status}${result.parse_error ? ` parse_error=${result.parse_error}` : ''}`);
}
const incomeTable = await scrapeFiscalAiTable(page, company.exchangeTicker, 'income');
const balanceTable = await scrapeFiscalAiTable(page, company.exchangeTicker, 'balance');
const statementKinds = new Set(company.comparisons.map((target) => target.statement));
const incomeTable = statementKinds.has('income')
? await scrapeFiscalAiTable(page, company.exchangeTicker, 'income')
: null;
const balanceTable = statementKinds.has('balance')
? await scrapeFiscalAiTable(page, company.exchangeTicker, 'balance')
: null;
const cashFlowTable = statementKinds.has('cash_flow')
? await scrapeFiscalAiTable(page, company.exchangeTicker, 'cash_flow')
: null;
const rows = company.comparisons.map((target) => {
const table = target.statement === 'income' ? incomeTable : balanceTable;
const table = target.statement === 'income'
? incomeTable
: target.statement === 'balance'
? balanceTable
: cashFlowTable;
if (!table) {
throw new Error(`Missing scraped table for ${target.statement}`);
}
return compareRow(target, result, table);
});
const failures = rows.filter((row) => row.status === 'fail' || row.status === 'missing_ours');
const failures = rows.filter(
(row) => row.status === 'fail' || row.status === 'missing_ours' || row.status === 'missing_reference'
);
console.log(
`[compare-fiscal-ai] ${company.ticker} filing=${filing.accessionNumber} fiscal_pack=${result.fiscal_pack ?? 'null'} income_column="${incomeTable.columnLabel}" balance_column="${balanceTable.columnLabel}" pass=${rows.length - failures.length}/${rows.length}`
`[compare-fiscal-ai] ${company.ticker} filing=${filing.accessionNumber} fiscal_pack=${result.fiscal_pack ?? 'null'} income_column="${incomeTable?.columnLabel ?? 'n/a'}" balance_column="${balanceTable?.columnLabel ?? 'n/a'}" cash_flow_column="${cashFlowTable?.columnLabel ?? 'n/a'}" pass=${rows.length - failures.length}/${rows.length}`
);
for (const row of rows) {
console.log(
@@ -625,18 +799,28 @@ async function compareCase(page: import('@playwright/test').Page, company: Compa
async function main() {
process.env.XBRL_ENGINE_TIMEOUT_MS = process.env.XBRL_ENGINE_TIMEOUT_MS ?? '180000';
const tickerFilter = parseTickerFilter(process.argv.slice(2));
const selectedCases = tickerFilter
? CASES.filter((entry) => entry.ticker === tickerFilter)
: CASES;
const filters = parseTickerFilter(process.argv.slice(2));
const selectedCases = (filters.ticker
? CASES.filter((entry) => entry.ticker === filters.ticker)
: CASES
)
.map((entry) => ({
...entry,
comparisons: filters.statement
? entry.comparisons.filter((target) => target.statement === filters.statement)
: entry.comparisons
}))
.filter((entry) => entry.comparisons.length > 0);
if (selectedCases.length === 0) {
console.error(`[compare-fiscal-ai] unknown ticker: ${tickerFilter}`);
console.error(
`[compare-fiscal-ai] no matching cases for ticker=${filters.ticker ?? 'all'} statement=${filters.statement ?? 'all'}`
);
process.exitCode = 1;
return;
}
const browser = await chromium.launch({ headless: false });
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
userAgent: BROWSER_USER_AGENT
});