Build first-party HTML content extractor (replace fivefilters/readability.php) #29

Open
opened 2026-04-26 16:28:26 +02:00 by myrmidex · 0 comments
Owner

Context

Ticket #12 introduces fivefilters/readability.php (dev-master constraint) for HTML main-content extraction (title, main text, outbound links, word count). It works for v0.1 but the dev-master requirement is a long-term liability: no semantic versioning, master could break our code without warning, lockfile pin only protects until next composer update.

Build a first-party replacement so we own this critical step in the search pipeline.

Acceptance

  • Decide deliverable shape: in-tree class in app/ vs extracted package in packages/Lvl0/Readability/. Likely the latter for the same reasons FediDiscover lives in packages/: testability, future external use, narrow contract.
  • Define a contract — Lvl0\Readability\ContentExtractor (or similar) with extract(string $html, ?string $baseUrl = null): ExtractedContent returning a value object with title, mainText, outboundLinks, wordCount.
  • Implement extraction. Approach options to evaluate:
    • Mozilla Readability.js algorithm in PHP (port the scoring logic)
    • Pragmatic: tag stripping + <p> / <h*> / <li> extraction with a minimum-character threshold to drop sidebar noise
    • Trafilatura-style heuristics (open question whether worth porting from Python)
  • Build a test fixture corpus from real-world HTML (varied sites — Wordpress blog, Medium, Substack, Hugo static site, hand-written HTML). Compare output against fivefilters baseline. Don't ship until extraction quality is at least as good on the corpus.
  • Drop fivefilters/readability.php dep from composer.json once swap is complete.
  • Update PLATFORM.md.

Notes

  • This is v0.2 or later — fivefilters is fine for v0.1 launch. Trigger this when (a) we hit a fivefilters bug we can't fix, OR (b) we want a different extraction strategy (e.g. trade some main-text quality for better link extraction), OR (c) we just have time and want to own this layer.
  • The composer.lock does freeze fivefilters' commit hash, so day-to-day stability is OK. The risk is "what happens at the next composer update" — a deliberate update that could pull a breaking master change.
  • Search quality depends critically on this step — sidebar/nav/footer noise pollutes both keywords_tsv (tokenization input from #13) and wordCount (a ranking signal). Worth doing well.

Out of scope

  • Performance optimization of the extractor — first version just needs to match fivefilters' output quality.
  • Image/video metadata extraction — different concern, separate ticket if it ever matters.
## Context Ticket #12 introduces `fivefilters/readability.php` (`dev-master` constraint) for HTML main-content extraction (title, main text, outbound links, word count). It works for v0.1 but the dev-master requirement is a long-term liability: no semantic versioning, master could break our code without warning, lockfile pin only protects until next `composer update`. Build a first-party replacement so we own this critical step in the search pipeline. ## Acceptance - [ ] Decide deliverable shape: in-tree class in `app/` vs extracted package in `packages/Lvl0/Readability/`. Likely the latter for the same reasons FediDiscover lives in `packages/`: testability, future external use, narrow contract. - [ ] Define a contract — `Lvl0\Readability\ContentExtractor` (or similar) with `extract(string $html, ?string $baseUrl = null): ExtractedContent` returning a value object with `title`, `mainText`, `outboundLinks`, `wordCount`. - [ ] Implement extraction. Approach options to evaluate: - Mozilla Readability.js algorithm in PHP (port the scoring logic) - Pragmatic: tag stripping + `<p>` / `<h*>` / `<li>` extraction with a minimum-character threshold to drop sidebar noise - Trafilatura-style heuristics (open question whether worth porting from Python) - [ ] Build a test fixture corpus from real-world HTML (varied sites — Wordpress blog, Medium, Substack, Hugo static site, hand-written HTML). Compare output against fivefilters baseline. Don't ship until extraction quality is at least as good on the corpus. - [ ] Drop `fivefilters/readability.php` dep from `composer.json` once swap is complete. - [ ] Update PLATFORM.md. ## Notes - This is **v0.2 or later** — fivefilters is fine for v0.1 launch. Trigger this when (a) we hit a fivefilters bug we can't fix, OR (b) we want a different extraction strategy (e.g. trade some main-text quality for better link extraction), OR (c) we just have time and want to own this layer. - The `composer.lock` does freeze fivefilters' commit hash, so day-to-day stability is OK. The risk is "what happens at the next `composer update`" — a deliberate update that could pull a breaking master change. - Search quality depends critically on this step — sidebar/nav/footer noise pollutes both `keywords_tsv` (tokenization input from #13) and `wordCount` (a ranking signal). Worth doing well. ## Out of scope - Performance optimization of the extractor — first version just needs to match fivefilters' output quality. - Image/video metadata extraction — different concern, separate ticket if it ever matters.
myrmidex added this to the v0.2 milestone 2026-04-26 16:28:26 +02:00
myrmidex self-assigned this 2026-04-26 16:28:26 +02:00
myrmidex modified the milestone from v0.2 to v0.3 2026-04-29 23:55:38 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:02:00 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#29
No description provided.