Crawler: HTTP fetcher and content extraction #12

New issue

Closed

opened 2026-04-23 02:28:57 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-23 02:28:57 +02:00

Owner

Context

Rewritten 2026-04-26. Original text bundled "fetch + extract + content-type filter + HTML hash" into one bullet list. Now scoped explicitly with locked-in design decisions from #7/#8 and the v0.1 "stores signals not content" rule.

The fetcher is the single most consequential action in the crawler pipeline: a pure data-in/data-out function that takes a URL, performs one HTTP GET, parses the response, and returns a structured FetchResult. It does NOT touch the DB and does NOT decide whether to fetch.

Locked-in design decisions

Pure single-responsibility fetcher — no robots.txt check, no per-domain politeness check, no DB writes. The worker (#14) orchestrates around it.
Worker handles retry/backoff — fetcher is one-shot; if it returns Timeout or Failed, the worker decides whether to insert a new page_crawls row with backoff.
Follow up to 5 redirects, return both original and final URL in the result. Cross-domain redirects accepted for v0.1.
No body size limit — accept what comes; if a real large page bites, fix later.
No HTML hash compute or storage — deferred until re-crawl scheduling lands (post-v0.1).
Fetch + extract bundled in one ticket — the action returns title, main text, outbound links, word count alongside fetch metadata. Splitting them into two actions adds ceremony for v0.1; the extractor's correctness is exercised by the same Http::fake() tests.
Reject non-HTML at fetch time — Content-Type: text/html only. Anything else returns outcome=Rejected. The pages row gets status=Rejected (worker writes back). Page row STAYS — deletion would create re-discovery fetch loops as fediverse re-shares the URL.
URL-pattern-based pre-filtering (e.g. skip .pdf/.zip URLs at discovery time before they become pages rows) → deferred to a separate v0.2 follow-up ticket.

Schema additions (in this ticket's commits)

New case Rejected on App\Enums\CrawlOutcomeEnum (currently 6 cases → 7).
New case Rejected on App\Enums\PageStatusEnum (currently 3 cases → 4).
Existing tests asserting "exactly N cases" will need their counts bumped in the same commit.
No DB migration needed — both columns are PG string with no DB-level constraint.

Action shape

App\Actions\FetchPageAction. Constructor-injects Illuminate\Http\Client\Factory. One public method:

public function __invoke(string $url): FetchResult

Returns a FetchResult value object (in App\ValueObjects\FetchResult or App\Crawler\FetchResult) with these fields:

outcome:        CrawlOutcomeEnum
statusCode:     ?int        // null on network errors, set on any HTTP response
finalUrl:       ?string     // post-redirect URL (may differ from input)
title:          ?string     // extracted from <title>
extractedText:  ?string     // readability-style main text (input to tokenization)
outboundLinks:  array       // list of fully-qualified absolute URLs found in <a href> within main content
wordCount:      ?int        // word count of extractedText
errorMessage:   ?string     // null on success

The body (raw response) is NOT returned — per "stores signals not content," only the extracted signals leave the action. The body lives only in memory during the fetch.

Fields populated by outcome:

Success → all fields populated except errorMessage
Failed / Timeout / BlockedRobots (won't fire — worker upstream check) → minimal fields, errorMessage describes the failure
Blocked4xx / Blocked5xx → statusCode and errorMessage populated, no extracted fields
Rejected → statusCode (200, since fetch succeeded), finalUrl, errorMessage (e.g. "Content-Type: application/pdf not supported"), no extracted fields

HTTP client behavior

Use Laravel's HTTP client via the injected Factory (testable with Http::fake()).
Timeout from config/crawler.php (new file): timeout (default 10s), max_redirects (default 5).
User-Agent from config/crawler.php: user_agent (placeholder for v0.1; the real UA + /bot page lands in #10).
Headers: User-Agent, Accept: text/html only. Keep wire requests boring.
Exception → outcome mapping:
- Illuminate\Http\Client\ConnectionException (DNS / refused / no route) → Failed
- timeout (manifests as ConnectionException with timeout message; check via response or use Http::timeout(N) which throws RequestException on timeout in Laravel 11+) → Timeout
- 4xx → Blocked4xx
- 5xx → Blocked5xx
- 2xx + non-HTML Content-Type → Rejected
- 2xx + HTML → Success (run extractor)

Content extraction

For v0.1 we want enough to feed the search engine and the citation graph:

Title — <title> tag content (use symfony/dom-crawler, already in Laravel).
Main text — readability-style extraction. Library decision: andreskrey/readability.php if it works cleanly with our Composer setup; otherwise a primitive tag-stripping fallback. Pick at implementation time, document in PLATFORM.md.
Outbound links — <a href> extraction within the main content (not header/footer/sidebar). Resolve relative URLs against finalUrl. Filter to http/https only. Strip fragments. Pass through App\Services\UrlService::host() validation to drop IP literals/userinfo URLs at this stage too.
Word count — splitting extractedText on whitespace. str_word_count is naive; mb_split or preg_split('/\s+/u', ...) is safer for Unicode.

Outbound link dispatch is NOT in this action — it returns the array. The worker (#14) is responsible for firing UrlDiscovered events for each link, which then flows through the existing observer to enqueue more crawls. Keeps the fetcher pure.

Tests

Http::fake() per the existing MastodonClient pattern in the package.
Cover each outcome (success/failed/timeout/4xx/5xx/rejected).
Cover redirect handling — finalUrl reflects post-redirect location.
Cover extractor correctness on a small fixture (title, links, word count).
Cover Content-Type filter — application/pdf, image/jpeg, text/plain all → Rejected. (Edge case: text/html; charset=utf-8 → still HTML, accept.)
Cover relative-URL resolution in outbound links.
Cover invalid outbound links (IP literal, userinfo) — filtered out.

Out of scope (later tickets)

Worker logic invoking the fetcher → #14 Queue worker
Per-domain politeness gating → #11 Per-domain politeness
Robots.txt check before fetch → #9 Crawler: robots.txt handling
Real User-Agent string + /bot info page → #10 Crawler: User agent and /bot page
HTML hash for change detection → post-v0.1 (re-crawl ticket)
URL-pattern-based pre-filtering at discovery (skip .pdf/.zip before page exists) → new v0.2 follow-up ticket
Language detection on extractedText → #13 Crawler: Language detection (reads extractedText, writes pages.language)
Tokenization writing keywords_tsv → #13 (or successor)

Acceptance

App\Enums\CrawlOutcomeEnum gains Rejected case (test bump)
App\Enums\PageStatusEnum gains Rejected case (test bump)
App\Actions\FetchPageAction with __invoke(string $url): FetchResult
App\ValueObjects\FetchResult value object
config/crawler.php with timeout, max_redirects, user_agent placeholders
HTML extractor: title, main text, outbound links (filtered + resolved), word count
Tests covering each outcome path, redirect handling, extractor correctness, Content-Type filtering, relative-URL resolution, invalid-link filtering
PLATFORM.md updated with the action contract, the chosen extractor library, and the Rejected enum semantics

## Context **Rewritten 2026-04-26.** Original text bundled "fetch + extract + content-type filter + HTML hash" into one bullet list. Now scoped explicitly with locked-in design decisions from #7/#8 and the v0.1 "stores signals not content" rule. The fetcher is the **single most consequential action in the crawler pipeline**: a pure data-in/data-out function that takes a URL, performs one HTTP GET, parses the response, and returns a structured `FetchResult`. It does NOT touch the DB and does NOT decide whether to fetch. ## Locked-in design decisions - **Pure single-responsibility fetcher** — no robots.txt check, no per-domain politeness check, no DB writes. The worker (#14) orchestrates around it. - **Worker handles retry/backoff** — fetcher is one-shot; if it returns `Timeout` or `Failed`, the worker decides whether to insert a new `page_crawls` row with backoff. - **Follow up to 5 redirects**, return both original and final URL in the result. Cross-domain redirects accepted for v0.1. - **No body size limit** — accept what comes; if a real large page bites, fix later. - **No HTML hash compute or storage** — deferred until re-crawl scheduling lands (post-v0.1). - **Fetch + extract bundled in one ticket** — the action returns title, main text, outbound links, word count alongside fetch metadata. Splitting them into two actions adds ceremony for v0.1; the extractor's correctness is exercised by the same `Http::fake()` tests. - **Reject non-HTML at fetch time** — `Content-Type: text/html` only. Anything else returns `outcome=Rejected`. The `pages` row gets `status=Rejected` (worker writes back). Page row STAYS — deletion would create re-discovery fetch loops as fediverse re-shares the URL. - **URL-pattern-based pre-filtering** (e.g. skip `.pdf`/`.zip` URLs at discovery time before they become `pages` rows) → deferred to a separate v0.2 follow-up ticket. ## Schema additions (in this ticket's commits) - New case `Rejected` on `App\Enums\CrawlOutcomeEnum` (currently 6 cases → 7). - New case `Rejected` on `App\Enums\PageStatusEnum` (currently 3 cases → 4). - Existing tests asserting "exactly N cases" will need their counts bumped in the same commit. - No DB migration needed — both columns are PG `string` with no DB-level constraint. ## Action shape `App\Actions\FetchPageAction`. Constructor-injects `Illuminate\Http\Client\Factory`. One public method: ```php public function __invoke(string $url): FetchResult ``` Returns a `FetchResult` value object (in `App\ValueObjects\FetchResult` or `App\Crawler\FetchResult`) with these fields: ``` outcome: CrawlOutcomeEnum statusCode: ?int // null on network errors, set on any HTTP response finalUrl: ?string // post-redirect URL (may differ from input) title: ?string // extracted from <title> extractedText: ?string // readability-style main text (input to tokenization) outboundLinks: array // list of fully-qualified absolute URLs found in <a href> within main content wordCount: ?int // word count of extractedText errorMessage: ?string // null on success ``` The `body` (raw response) is NOT returned — per "stores signals not content," only the extracted signals leave the action. The body lives only in memory during the fetch. **Fields populated by outcome**: - `Success` → all fields populated except `errorMessage` - `Failed` / `Timeout` / `BlockedRobots` (won't fire — worker upstream check) → minimal fields, `errorMessage` describes the failure - `Blocked4xx` / `Blocked5xx` → `statusCode` and `errorMessage` populated, no extracted fields - `Rejected` → `statusCode` (200, since fetch succeeded), `finalUrl`, `errorMessage` (e.g. `"Content-Type: application/pdf not supported"`), no extracted fields ## HTTP client behavior - Use Laravel's HTTP client via the injected `Factory` (testable with `Http::fake()`). - Timeout from `config/crawler.php` (new file): `timeout` (default 10s), `max_redirects` (default 5). - User-Agent from `config/crawler.php`: `user_agent` (placeholder for v0.1; the real UA + `/bot` page lands in #10). - Headers: `User-Agent`, `Accept: text/html` only. Keep wire requests boring. - Exception → outcome mapping: - `Illuminate\Http\Client\ConnectionException` (DNS / refused / no route) → `Failed` - timeout (manifests as `ConnectionException` with timeout message; check via response or use `Http::timeout(N)` which throws `RequestException` on timeout in Laravel 11+) → `Timeout` - 4xx → `Blocked4xx` - 5xx → `Blocked5xx` - 2xx + non-HTML Content-Type → `Rejected` - 2xx + HTML → `Success` (run extractor) ## Content extraction For v0.1 we want enough to feed the search engine and the citation graph: - **Title** — `<title>` tag content (use `symfony/dom-crawler`, already in Laravel). - **Main text** — readability-style extraction. Library decision: `andreskrey/readability.php` if it works cleanly with our Composer setup; otherwise a primitive tag-stripping fallback. Pick at implementation time, document in PLATFORM.md. - **Outbound links** — `<a href>` extraction within the main content (not header/footer/sidebar). Resolve relative URLs against `finalUrl`. Filter to http/https only. Strip fragments. Pass through `App\Services\UrlService::host()` validation to drop IP literals/userinfo URLs at this stage too. - **Word count** — splitting `extractedText` on whitespace. `str_word_count` is naive; `mb_split` or `preg_split('/\s+/u', ...)` is safer for Unicode. Outbound link dispatch is NOT in this action — it returns the array. The worker (#14) is responsible for firing `UrlDiscovered` events for each link, which then flows through the existing observer to enqueue more crawls. Keeps the fetcher pure. ## Tests - `Http::fake()` per the existing `MastodonClient` pattern in the package. - Cover each outcome (success/failed/timeout/4xx/5xx/rejected). - Cover redirect handling — `finalUrl` reflects post-redirect location. - Cover extractor correctness on a small fixture (title, links, word count). - Cover Content-Type filter — `application/pdf`, `image/jpeg`, `text/plain` all → `Rejected`. (Edge case: `text/html; charset=utf-8` → still HTML, accept.) - Cover relative-URL resolution in outbound links. - Cover invalid outbound links (IP literal, userinfo) — filtered out. ## Out of scope (later tickets) - Worker logic invoking the fetcher → **#14 Queue worker** - Per-domain politeness gating → **#11 Per-domain politeness** - Robots.txt check before fetch → **#9 Crawler: robots.txt handling** - Real User-Agent string + `/bot` info page → **#10 Crawler: User agent and /bot page** - HTML hash for change detection → **post-v0.1 (re-crawl ticket)** - URL-pattern-based pre-filtering at discovery (skip `.pdf`/`.zip` before page exists) → **new v0.2 follow-up ticket** - Language detection on `extractedText` → **#13 Crawler: Language detection** (reads `extractedText`, writes `pages.language`) - Tokenization writing `keywords_tsv` → **#13 (or successor)** ## Acceptance - [ ] `App\Enums\CrawlOutcomeEnum` gains `Rejected` case (test bump) - [ ] `App\Enums\PageStatusEnum` gains `Rejected` case (test bump) - [ ] `App\Actions\FetchPageAction` with `__invoke(string $url): FetchResult` - [ ] `App\ValueObjects\FetchResult` value object - [ ] `config/crawler.php` with `timeout`, `max_redirects`, `user_agent` placeholders - [ ] HTML extractor: title, main text, outbound links (filtered + resolved), word count - [ ] Tests covering each outcome path, redirect handling, extractor correctness, Content-Type filtering, relative-URL resolution, invalid-link filtering - [ ] PLATFORM.md updated with the action contract, the chosen extractor library, and the `Rejected` enum semantics