Cache and reuse filing HTML during KPI note extraction #18

Open
opened 2026-03-15 01:25:59 +00:00 by Francy51 · 0 comments
Owner

Structured KPI extraction from filing notes repeatedly refetches and reparses the same filing HTML inside nested loops.

Why this is a problem:

  • Needlessly amplifies SEC requests.
  • Repeats expensive HTML parsing work.
  • Increases latency and makes throttling more likely.
  • Violates the performance-first priority.

Observed in:

  • lib/server/financials/kpi-notes.ts
    • extractStructuredKpisFromNotes() loops over definitions and periods
    • each period/definition path can call fetchHtml() for the same filing again
    • each fetched document is reparsed with Cheerio
  • lib/server/financial-taxonomy.ts
    • invokes note extraction in the taxonomy pipeline

Suggested direction:

  • Fetch each filing HTML at most once per extraction run.
  • Parse each filing HTML once and reuse the parsed representation.
  • Restructure the algorithm to iterate filing -> parsed tables -> matching definitions, instead of definition -> period -> fetch.

Acceptance criteria:

  • Each filing document is fetched at most once per extraction pass.
  • HTML parsing is reused rather than repeated for each definition.
  • Extraction latency and outbound SEC request volume are materially reduced.
Structured KPI extraction from filing notes repeatedly refetches and reparses the same filing HTML inside nested loops. Why this is a problem: - Needlessly amplifies SEC requests. - Repeats expensive HTML parsing work. - Increases latency and makes throttling more likely. - Violates the performance-first priority. Observed in: - `lib/server/financials/kpi-notes.ts` - `extractStructuredKpisFromNotes()` loops over definitions and periods - each period/definition path can call `fetchHtml()` for the same filing again - each fetched document is reparsed with Cheerio - `lib/server/financial-taxonomy.ts` - invokes note extraction in the taxonomy pipeline Suggested direction: - Fetch each filing HTML at most once per extraction run. - Parse each filing HTML once and reuse the parsed representation. - Restructure the algorithm to iterate filing -> parsed tables -> matching definitions, instead of definition -> period -> fetch. Acceptance criteria: - Each filing document is fetched at most once per extraction pass. - HTML parsing is reused rather than repeated for each definition. - Extraction latency and outbound SEC request volume are materially reduced.
Francy51 added the P1 label 2026-03-15 01:25:59 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Francy51/Neon-Desk#18