Build first-party HTML content extractor (replace fivefilters/readability.php) #29

New issue

Open

opened 2026-04-26 16:28:26 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-26 16:28:26 +02:00

Owner

Context

Ticket #12 introduces fivefilters/readability.php (dev-master constraint) for HTML main-content extraction (title, main text, outbound links, word count). It works for v0.1 but the dev-master requirement is a long-term liability: no semantic versioning, master could break our code without warning, lockfile pin only protects until next composer update.

Build a first-party replacement so we own this critical step in the search pipeline.

Acceptance

Decide deliverable shape: in-tree class in app/ vs extracted package in packages/Lvl0/Readability/. Likely the latter for the same reasons FediDiscover lives in packages/: testability, future external use, narrow contract.
Define a contract — Lvl0\Readability\ContentExtractor (or similar) with extract(string $html, ?string $baseUrl = null): ExtractedContent returning a value object with title, mainText, outboundLinks, wordCount.
Implement extraction. Approach options to evaluate:
- Mozilla Readability.js algorithm in PHP (port the scoring logic)
- Pragmatic: tag stripping + <p> / <h*> / <li> extraction with a minimum-character threshold to drop sidebar noise
- Trafilatura-style heuristics (open question whether worth porting from Python)
Build a test fixture corpus from real-world HTML (varied sites — Wordpress blog, Medium, Substack, Hugo static site, hand-written HTML). Compare output against fivefilters baseline. Don't ship until extraction quality is at least as good on the corpus.
Drop fivefilters/readability.php dep from composer.json once swap is complete.
Update PLATFORM.md.

Notes

This is v0.2 or later — fivefilters is fine for v0.1 launch. Trigger this when (a) we hit a fivefilters bug we can't fix, OR (b) we want a different extraction strategy (e.g. trade some main-text quality for better link extraction), OR (c) we just have time and want to own this layer.
The composer.lock does freeze fivefilters' commit hash, so day-to-day stability is OK. The risk is "what happens at the next composer update" — a deliberate update that could pull a breaking master change.
Search quality depends critically on this step — sidebar/nav/footer noise pollutes both keywords_tsv (tokenization input from #13) and wordCount (a ranking signal). Worth doing well.

Out of scope

Performance optimization of the extractor — first version just needs to match fivefilters' output quality.
Image/video metadata extraction — different concern, separate ticket if it ever matters.

## Context Ticket #12 introduces `fivefilters/readability.php` (`dev-master` constraint) for HTML main-content extraction (title, main text, outbound links, word count). It works for v0.1 but the dev-master requirement is a long-term liability: no semantic versioning, master could break our code without warning, lockfile pin only protects until next `composer update`. Build a first-party replacement so we own this critical step in the search pipeline. ## Acceptance - [ ] Decide deliverable shape: in-tree class in `app/` vs extracted package in `packages/Lvl0/Readability/`. Likely the latter for the same reasons FediDiscover lives in `packages/`: testability, future external use, narrow contract. - [ ] Define a contract — `Lvl0\Readability\ContentExtractor` (or similar) with `extract(string $html, ?string $baseUrl = null): ExtractedContent` returning a value object with `title`, `mainText`, `outboundLinks`, `wordCount`. - [ ] Implement extraction. Approach options to evaluate: - Mozilla Readability.js algorithm in PHP (port the scoring logic) - Pragmatic: tag stripping + `<p>` / `<h*>` / `<li>` extraction with a minimum-character threshold to drop sidebar noise - Trafilatura-style heuristics (open question whether worth porting from Python) - [ ] Build a test fixture corpus from real-world HTML (varied sites — Wordpress blog, Medium, Substack, Hugo static site, hand-written HTML). Compare output against fivefilters baseline. Don't ship until extraction quality is at least as good on the corpus. - [ ] Drop `fivefilters/readability.php` dep from `composer.json` once swap is complete. - [ ] Update PLATFORM.md. ## Notes - This is **v0.2 or later** — fivefilters is fine for v0.1 launch. Trigger this when (a) we hit a fivefilters bug we can't fix, OR (b) we want a different extraction strategy (e.g. trade some main-text quality for better link extraction), OR (c) we just have time and want to own this layer. - The `composer.lock` does freeze fivefilters' commit hash, so day-to-day stability is OK. The risk is "what happens at the next `composer update`" — a deliberate update that could pull a breaking master change. - Search quality depends critically on this step — sidebar/nav/footer noise pollutes both `keywords_tsv` (tokenization input from #13) and `wordCount` (a ranking signal). Worth doing well. ## Out of scope - Performance optimization of the extractor — first version just needs to match fivefilters' output quality. - Image/video metadata extraction — different concern, separate ticket if it ever matters.