Build first-party HTML content extractor (replace fivefilters/readability.php) #29
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Ticket #12 introduces
fivefilters/readability.php(dev-masterconstraint) for HTML main-content extraction (title, main text, outbound links, word count). It works for v0.1 but the dev-master requirement is a long-term liability: no semantic versioning, master could break our code without warning, lockfile pin only protects until nextcomposer update.Build a first-party replacement so we own this critical step in the search pipeline.
Acceptance
app/vs extracted package inpackages/Lvl0/Readability/. Likely the latter for the same reasons FediDiscover lives inpackages/: testability, future external use, narrow contract.Lvl0\Readability\ContentExtractor(or similar) withextract(string $html, ?string $baseUrl = null): ExtractedContentreturning a value object withtitle,mainText,outboundLinks,wordCount.<p>/<h*>/<li>extraction with a minimum-character threshold to drop sidebar noisefivefilters/readability.phpdep fromcomposer.jsononce swap is complete.Notes
composer.lockdoes freeze fivefilters' commit hash, so day-to-day stability is OK. The risk is "what happens at the nextcomposer update" — a deliberate update that could pull a breaking master change.keywords_tsv(tokenization input from #13) andwordCount(a ranking signal). Worth doing well.Out of scope