Crawler: HTTP fetcher and content extraction #12
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Rewritten 2026-04-26. Original text bundled "fetch + extract + content-type filter + HTML hash" into one bullet list. Now scoped explicitly with locked-in design decisions from #7/#8 and the v0.1 "stores signals not content" rule.
The fetcher is the single most consequential action in the crawler pipeline: a pure data-in/data-out function that takes a URL, performs one HTTP GET, parses the response, and returns a structured
FetchResult. It does NOT touch the DB and does NOT decide whether to fetch.Locked-in design decisions
TimeoutorFailed, the worker decides whether to insert a newpage_crawlsrow with backoff.Http::fake()tests.Content-Type: text/htmlonly. Anything else returnsoutcome=Rejected. Thepagesrow getsstatus=Rejected(worker writes back). Page row STAYS — deletion would create re-discovery fetch loops as fediverse re-shares the URL..pdf/.zipURLs at discovery time before they becomepagesrows) → deferred to a separate v0.2 follow-up ticket.Schema additions (in this ticket's commits)
RejectedonApp\Enums\CrawlOutcomeEnum(currently 6 cases → 7).RejectedonApp\Enums\PageStatusEnum(currently 3 cases → 4).stringwith no DB-level constraint.Action shape
App\Actions\FetchPageAction. Constructor-injectsIlluminate\Http\Client\Factory. One public method:Returns a
FetchResultvalue object (inApp\ValueObjects\FetchResultorApp\Crawler\FetchResult) with these fields:The
body(raw response) is NOT returned — per "stores signals not content," only the extracted signals leave the action. The body lives only in memory during the fetch.Fields populated by outcome:
Success→ all fields populated excepterrorMessageFailed/Timeout/BlockedRobots(won't fire — worker upstream check) → minimal fields,errorMessagedescribes the failureBlocked4xx/Blocked5xx→statusCodeanderrorMessagepopulated, no extracted fieldsRejected→statusCode(200, since fetch succeeded),finalUrl,errorMessage(e.g."Content-Type: application/pdf not supported"), no extracted fieldsHTTP client behavior
Factory(testable withHttp::fake()).config/crawler.php(new file):timeout(default 10s),max_redirects(default 5).config/crawler.php:user_agent(placeholder for v0.1; the real UA +/botpage lands in #10).User-Agent,Accept: text/htmlonly. Keep wire requests boring.Illuminate\Http\Client\ConnectionException(DNS / refused / no route) →FailedConnectionExceptionwith timeout message; check via response or useHttp::timeout(N)which throwsRequestExceptionon timeout in Laravel 11+) →TimeoutBlocked4xxBlocked5xxRejectedSuccess(run extractor)Content extraction
For v0.1 we want enough to feed the search engine and the citation graph:
<title>tag content (usesymfony/dom-crawler, already in Laravel).andreskrey/readability.phpif it works cleanly with our Composer setup; otherwise a primitive tag-stripping fallback. Pick at implementation time, document in PLATFORM.md.<a href>extraction within the main content (not header/footer/sidebar). Resolve relative URLs againstfinalUrl. Filter to http/https only. Strip fragments. Pass throughApp\Services\UrlService::host()validation to drop IP literals/userinfo URLs at this stage too.extractedTexton whitespace.str_word_countis naive;mb_splitorpreg_split('/\s+/u', ...)is safer for Unicode.Outbound link dispatch is NOT in this action — it returns the array. The worker (#14) is responsible for firing
UrlDiscoveredevents for each link, which then flows through the existing observer to enqueue more crawls. Keeps the fetcher pure.Tests
Http::fake()per the existingMastodonClientpattern in the package.finalUrlreflects post-redirect location.application/pdf,image/jpeg,text/plainall →Rejected. (Edge case:text/html; charset=utf-8→ still HTML, accept.)Out of scope (later tickets)
/botinfo page → #10 Crawler: User agent and /bot page.pdf/.zipbefore page exists) → new v0.2 follow-up ticketextractedText→ #13 Crawler: Language detection (readsextractedText, writespages.language)keywords_tsv→ #13 (or successor)Acceptance
App\Enums\CrawlOutcomeEnumgainsRejectedcase (test bump)App\Enums\PageStatusEnumgainsRejectedcase (test bump)App\Actions\FetchPageActionwith__invoke(string $url): FetchResultApp\ValueObjects\FetchResultvalue objectconfig/crawler.phpwithtimeout,max_redirects,user_agentplaceholdersRejectedenum semantics