URL-pattern pre-filter: skip non-HTML extensions before page row creation #28
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Ticket #12 rejects non-HTML pages at fetch time: when the fetcher gets back
Content-Type: application/pdf(or any non-HTML), the page row is markedstatus=Rejectedand stays in the DB to prevent re-discovery loops. The trade-off accepted for v0.1: we still pay the network round-trip on every PDF/zip/image URL we encounter.A cheaper-and-earlier optimization: when a URL is discovered with an obvious non-HTML extension (
.pdf,.zip,.mp4,.jpg, etc.), skip creating the page row entirely. No fetch, nopagesrow, nopage_crawlsrow.Acceptance
config/crawler.php(e.g.pdf, zip, tar, gz, mp3, mp4, webm, jpg, jpeg, png, gif, svg, webp, ico, woff, woff2, ttf, otf, exe, dmg, iso, ...)App\Services\UrlService— e.g.looksLikeBinary(string $url): bool— extracts the path's final segment, parses the extension, returns true if it matches the listApp\Listeners\UrlDiscoveredListener— if the discovered URL looks binary, skip creating the page (no event side effects).App\Livewire\UrlSubmissionForm::submit()— return a validation error to the user ("PDF / image URLs aren't indexed")UrlDiscoveredevents for each) — pre-filter the array before dispatchingUrlService::looksLikeBinary()covering edge cases (no extension, double-dot like.tar.gz, query strings like/file.pdf?download=1, fragment like#section)Notes
/download/123or/api/filestill hit the fetcher and getRejectedthere. That's fine — pre-filter catches the obvious 90%, the post-fetch check catches the rest.?download=1)? Probably no — too aggressive. Stick to extensions only.https://example.com/malware.exe) get rejected at the form layer, not after a network round-trip.