URL-pattern pre-filter: skip non-HTML extensions before page row creation #28

New issue

Closed

opened 2026-04-26 16:24:05 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-26 16:24:05 +02:00

Owner

Context

Ticket #12 rejects non-HTML pages at fetch time: when the fetcher gets back Content-Type: application/pdf (or any non-HTML), the page row is marked status=Rejected and stays in the DB to prevent re-discovery loops. The trade-off accepted for v0.1: we still pay the network round-trip on every PDF/zip/image URL we encounter.

A cheaper-and-earlier optimization: when a URL is discovered with an obvious non-HTML extension (.pdf, .zip, .mp4, .jpg, etc.), skip creating the page row entirely. No fetch, no pages row, no page_crawls row.

Acceptance

Define a definitive list of "obviously non-HTML" extensions in config/crawler.php (e.g. pdf, zip, tar, gz, mp3, mp4, webm, jpg, jpeg, png, gif, svg, webp, ico, woff, woff2, ttf, otf, exe, dmg, iso, ...)
New method on App\Services\UrlService — e.g. looksLikeBinary(string $url): bool — extracts the path's final segment, parses the extension, returns true if it matches the list
Apply at every URL ingestion site:
- App\Listeners\UrlDiscoveredListener — if the discovered URL looks binary, skip creating the page (no event side effects).
- App\Livewire\UrlSubmissionForm::submit() — return a validation error to the user ("PDF / image URLs aren't indexed")
- The crawler's outbound-links extractor (#12 returns links, worker dispatches UrlDiscovered events for each) — pre-filter the array before dispatching
Tests covering each ingestion point, plus unit tests on UrlService::looksLikeBinary() covering edge cases (no extension, double-dot like .tar.gz, query strings like /file.pdf?download=1, fragment like #section)

Notes

This is a pre-filter, not a complete filter. URLs like /download/123 or /api/file still hit the fetcher and get Rejected there. That's fine — pre-filter catches the obvious 90%, the post-fetch check catches the rest.
Worth bikeshedding: should we also pre-filter Content-Disposition-like patterns (e.g. ?download=1)? Probably no — too aggressive. Stick to extensions only.
Side benefit: spam URL submissions (https://example.com/malware.exe) get rejected at the form layer, not after a network round-trip.

## Context Ticket #12 rejects non-HTML pages at fetch time: when the fetcher gets back `Content-Type: application/pdf` (or any non-HTML), the page row is marked `status=Rejected` and stays in the DB to prevent re-discovery loops. The trade-off accepted for v0.1: we still pay the network round-trip on every PDF/zip/image URL we encounter. A cheaper-and-earlier optimization: when a URL is *discovered* with an obvious non-HTML extension (`.pdf`, `.zip`, `.mp4`, `.jpg`, etc.), skip creating the page row entirely. No fetch, no `pages` row, no `page_crawls` row. ## Acceptance - [ ] Define a definitive list of "obviously non-HTML" extensions in `config/crawler.php` (e.g. `pdf, zip, tar, gz, mp3, mp4, webm, jpg, jpeg, png, gif, svg, webp, ico, woff, woff2, ttf, otf, exe, dmg, iso, ...`) - [ ] New method on `App\Services\UrlService` — e.g. `looksLikeBinary(string $url): bool` — extracts the path's final segment, parses the extension, returns true if it matches the list - [ ] Apply at every URL ingestion site: - `App\Listeners\UrlDiscoveredListener` — if the discovered URL looks binary, skip creating the page (no event side effects). - `App\Livewire\UrlSubmissionForm::submit()` — return a validation error to the user ("PDF / image URLs aren't indexed") - The crawler's outbound-links extractor (#12 returns links, worker dispatches `UrlDiscovered` events for each) — pre-filter the array before dispatching - [ ] Tests covering each ingestion point, plus unit tests on `UrlService::looksLikeBinary()` covering edge cases (no extension, double-dot like `.tar.gz`, query strings like `/file.pdf?download=1`, fragment like `#section`) ## Notes - This is a **pre-filter, not a complete filter**. URLs like `/download/123` or `/api/file` still hit the fetcher and get `Rejected` there. That's fine — pre-filter catches the obvious 90%, the post-fetch check catches the rest. - Worth bikeshedding: should we also pre-filter Content-Disposition-like patterns (e.g. `?download=1`)? Probably no — too aggressive. Stick to extensions only. - Side benefit: spam URL submissions (`https://example.com/malware.exe`) get rejected at the form layer, not after a network round-trip.