Crawler: HTTP fetcher and content extraction #12

Closed
opened 2026-04-23 02:28:57 +02:00 by myrmidex · 0 comments
Owner

Context

Rewritten 2026-04-26. Original text bundled "fetch + extract + content-type filter + HTML hash" into one bullet list. Now scoped explicitly with locked-in design decisions from #7/#8 and the v0.1 "stores signals not content" rule.

The fetcher is the single most consequential action in the crawler pipeline: a pure data-in/data-out function that takes a URL, performs one HTTP GET, parses the response, and returns a structured FetchResult. It does NOT touch the DB and does NOT decide whether to fetch.

Locked-in design decisions

  • Pure single-responsibility fetcher — no robots.txt check, no per-domain politeness check, no DB writes. The worker (#14) orchestrates around it.
  • Worker handles retry/backoff — fetcher is one-shot; if it returns Timeout or Failed, the worker decides whether to insert a new page_crawls row with backoff.
  • Follow up to 5 redirects, return both original and final URL in the result. Cross-domain redirects accepted for v0.1.
  • No body size limit — accept what comes; if a real large page bites, fix later.
  • No HTML hash compute or storage — deferred until re-crawl scheduling lands (post-v0.1).
  • Fetch + extract bundled in one ticket — the action returns title, main text, outbound links, word count alongside fetch metadata. Splitting them into two actions adds ceremony for v0.1; the extractor's correctness is exercised by the same Http::fake() tests.
  • Reject non-HTML at fetch timeContent-Type: text/html only. Anything else returns outcome=Rejected. The pages row gets status=Rejected (worker writes back). Page row STAYS — deletion would create re-discovery fetch loops as fediverse re-shares the URL.
  • URL-pattern-based pre-filtering (e.g. skip .pdf/.zip URLs at discovery time before they become pages rows) → deferred to a separate v0.2 follow-up ticket.

Schema additions (in this ticket's commits)

  • New case Rejected on App\Enums\CrawlOutcomeEnum (currently 6 cases → 7).
  • New case Rejected on App\Enums\PageStatusEnum (currently 3 cases → 4).
  • Existing tests asserting "exactly N cases" will need their counts bumped in the same commit.
  • No DB migration needed — both columns are PG string with no DB-level constraint.

Action shape

App\Actions\FetchPageAction. Constructor-injects Illuminate\Http\Client\Factory. One public method:

public function __invoke(string $url): FetchResult

Returns a FetchResult value object (in App\ValueObjects\FetchResult or App\Crawler\FetchResult) with these fields:

outcome:        CrawlOutcomeEnum
statusCode:     ?int        // null on network errors, set on any HTTP response
finalUrl:       ?string     // post-redirect URL (may differ from input)
title:          ?string     // extracted from <title>
extractedText:  ?string     // readability-style main text (input to tokenization)
outboundLinks:  array       // list of fully-qualified absolute URLs found in <a href> within main content
wordCount:      ?int        // word count of extractedText
errorMessage:   ?string     // null on success

The body (raw response) is NOT returned — per "stores signals not content," only the extracted signals leave the action. The body lives only in memory during the fetch.

Fields populated by outcome:

  • Success → all fields populated except errorMessage
  • Failed / Timeout / BlockedRobots (won't fire — worker upstream check) → minimal fields, errorMessage describes the failure
  • Blocked4xx / Blocked5xxstatusCode and errorMessage populated, no extracted fields
  • RejectedstatusCode (200, since fetch succeeded), finalUrl, errorMessage (e.g. "Content-Type: application/pdf not supported"), no extracted fields

HTTP client behavior

  • Use Laravel's HTTP client via the injected Factory (testable with Http::fake()).
  • Timeout from config/crawler.php (new file): timeout (default 10s), max_redirects (default 5).
  • User-Agent from config/crawler.php: user_agent (placeholder for v0.1; the real UA + /bot page lands in #10).
  • Headers: User-Agent, Accept: text/html only. Keep wire requests boring.
  • Exception → outcome mapping:
    • Illuminate\Http\Client\ConnectionException (DNS / refused / no route) → Failed
    • timeout (manifests as ConnectionException with timeout message; check via response or use Http::timeout(N) which throws RequestException on timeout in Laravel 11+) → Timeout
    • 4xx → Blocked4xx
    • 5xx → Blocked5xx
    • 2xx + non-HTML Content-Type → Rejected
    • 2xx + HTML → Success (run extractor)

Content extraction

For v0.1 we want enough to feed the search engine and the citation graph:

  • Title<title> tag content (use symfony/dom-crawler, already in Laravel).
  • Main text — readability-style extraction. Library decision: andreskrey/readability.php if it works cleanly with our Composer setup; otherwise a primitive tag-stripping fallback. Pick at implementation time, document in PLATFORM.md.
  • Outbound links<a href> extraction within the main content (not header/footer/sidebar). Resolve relative URLs against finalUrl. Filter to http/https only. Strip fragments. Pass through App\Services\UrlService::host() validation to drop IP literals/userinfo URLs at this stage too.
  • Word count — splitting extractedText on whitespace. str_word_count is naive; mb_split or preg_split('/\s+/u', ...) is safer for Unicode.

Outbound link dispatch is NOT in this action — it returns the array. The worker (#14) is responsible for firing UrlDiscovered events for each link, which then flows through the existing observer to enqueue more crawls. Keeps the fetcher pure.

Tests

  • Http::fake() per the existing MastodonClient pattern in the package.
  • Cover each outcome (success/failed/timeout/4xx/5xx/rejected).
  • Cover redirect handling — finalUrl reflects post-redirect location.
  • Cover extractor correctness on a small fixture (title, links, word count).
  • Cover Content-Type filter — application/pdf, image/jpeg, text/plain all → Rejected. (Edge case: text/html; charset=utf-8 → still HTML, accept.)
  • Cover relative-URL resolution in outbound links.
  • Cover invalid outbound links (IP literal, userinfo) — filtered out.

Out of scope (later tickets)

  • Worker logic invoking the fetcher → #14 Queue worker
  • Per-domain politeness gating → #11 Per-domain politeness
  • Robots.txt check before fetch → #9 Crawler: robots.txt handling
  • Real User-Agent string + /bot info page → #10 Crawler: User agent and /bot page
  • HTML hash for change detection → post-v0.1 (re-crawl ticket)
  • URL-pattern-based pre-filtering at discovery (skip .pdf/.zip before page exists) → new v0.2 follow-up ticket
  • Language detection on extractedText#13 Crawler: Language detection (reads extractedText, writes pages.language)
  • Tokenization writing keywords_tsv#13 (or successor)

Acceptance

  • App\Enums\CrawlOutcomeEnum gains Rejected case (test bump)
  • App\Enums\PageStatusEnum gains Rejected case (test bump)
  • App\Actions\FetchPageAction with __invoke(string $url): FetchResult
  • App\ValueObjects\FetchResult value object
  • config/crawler.php with timeout, max_redirects, user_agent placeholders
  • HTML extractor: title, main text, outbound links (filtered + resolved), word count
  • Tests covering each outcome path, redirect handling, extractor correctness, Content-Type filtering, relative-URL resolution, invalid-link filtering
  • PLATFORM.md updated with the action contract, the chosen extractor library, and the Rejected enum semantics
## Context **Rewritten 2026-04-26.** Original text bundled "fetch + extract + content-type filter + HTML hash" into one bullet list. Now scoped explicitly with locked-in design decisions from #7/#8 and the v0.1 "stores signals not content" rule. The fetcher is the **single most consequential action in the crawler pipeline**: a pure data-in/data-out function that takes a URL, performs one HTTP GET, parses the response, and returns a structured `FetchResult`. It does NOT touch the DB and does NOT decide whether to fetch. ## Locked-in design decisions - **Pure single-responsibility fetcher** — no robots.txt check, no per-domain politeness check, no DB writes. The worker (#14) orchestrates around it. - **Worker handles retry/backoff** — fetcher is one-shot; if it returns `Timeout` or `Failed`, the worker decides whether to insert a new `page_crawls` row with backoff. - **Follow up to 5 redirects**, return both original and final URL in the result. Cross-domain redirects accepted for v0.1. - **No body size limit** — accept what comes; if a real large page bites, fix later. - **No HTML hash compute or storage** — deferred until re-crawl scheduling lands (post-v0.1). - **Fetch + extract bundled in one ticket** — the action returns title, main text, outbound links, word count alongside fetch metadata. Splitting them into two actions adds ceremony for v0.1; the extractor's correctness is exercised by the same `Http::fake()` tests. - **Reject non-HTML at fetch time** — `Content-Type: text/html` only. Anything else returns `outcome=Rejected`. The `pages` row gets `status=Rejected` (worker writes back). Page row STAYS — deletion would create re-discovery fetch loops as fediverse re-shares the URL. - **URL-pattern-based pre-filtering** (e.g. skip `.pdf`/`.zip` URLs at discovery time before they become `pages` rows) → deferred to a separate v0.2 follow-up ticket. ## Schema additions (in this ticket's commits) - New case `Rejected` on `App\Enums\CrawlOutcomeEnum` (currently 6 cases → 7). - New case `Rejected` on `App\Enums\PageStatusEnum` (currently 3 cases → 4). - Existing tests asserting "exactly N cases" will need their counts bumped in the same commit. - No DB migration needed — both columns are PG `string` with no DB-level constraint. ## Action shape `App\Actions\FetchPageAction`. Constructor-injects `Illuminate\Http\Client\Factory`. One public method: ```php public function __invoke(string $url): FetchResult ``` Returns a `FetchResult` value object (in `App\ValueObjects\FetchResult` or `App\Crawler\FetchResult`) with these fields: ``` outcome: CrawlOutcomeEnum statusCode: ?int // null on network errors, set on any HTTP response finalUrl: ?string // post-redirect URL (may differ from input) title: ?string // extracted from <title> extractedText: ?string // readability-style main text (input to tokenization) outboundLinks: array // list of fully-qualified absolute URLs found in <a href> within main content wordCount: ?int // word count of extractedText errorMessage: ?string // null on success ``` The `body` (raw response) is NOT returned — per "stores signals not content," only the extracted signals leave the action. The body lives only in memory during the fetch. **Fields populated by outcome**: - `Success` → all fields populated except `errorMessage` - `Failed` / `Timeout` / `BlockedRobots` (won't fire — worker upstream check) → minimal fields, `errorMessage` describes the failure - `Blocked4xx` / `Blocked5xx` → `statusCode` and `errorMessage` populated, no extracted fields - `Rejected` → `statusCode` (200, since fetch succeeded), `finalUrl`, `errorMessage` (e.g. `"Content-Type: application/pdf not supported"`), no extracted fields ## HTTP client behavior - Use Laravel's HTTP client via the injected `Factory` (testable with `Http::fake()`). - Timeout from `config/crawler.php` (new file): `timeout` (default 10s), `max_redirects` (default 5). - User-Agent from `config/crawler.php`: `user_agent` (placeholder for v0.1; the real UA + `/bot` page lands in #10). - Headers: `User-Agent`, `Accept: text/html` only. Keep wire requests boring. - Exception → outcome mapping: - `Illuminate\Http\Client\ConnectionException` (DNS / refused / no route) → `Failed` - timeout (manifests as `ConnectionException` with timeout message; check via response or use `Http::timeout(N)` which throws `RequestException` on timeout in Laravel 11+) → `Timeout` - 4xx → `Blocked4xx` - 5xx → `Blocked5xx` - 2xx + non-HTML Content-Type → `Rejected` - 2xx + HTML → `Success` (run extractor) ## Content extraction For v0.1 we want enough to feed the search engine and the citation graph: - **Title** — `<title>` tag content (use `symfony/dom-crawler`, already in Laravel). - **Main text** — readability-style extraction. Library decision: `andreskrey/readability.php` if it works cleanly with our Composer setup; otherwise a primitive tag-stripping fallback. Pick at implementation time, document in PLATFORM.md. - **Outbound links** — `<a href>` extraction within the main content (not header/footer/sidebar). Resolve relative URLs against `finalUrl`. Filter to http/https only. Strip fragments. Pass through `App\Services\UrlService::host()` validation to drop IP literals/userinfo URLs at this stage too. - **Word count** — splitting `extractedText` on whitespace. `str_word_count` is naive; `mb_split` or `preg_split('/\s+/u', ...)` is safer for Unicode. Outbound link dispatch is NOT in this action — it returns the array. The worker (#14) is responsible for firing `UrlDiscovered` events for each link, which then flows through the existing observer to enqueue more crawls. Keeps the fetcher pure. ## Tests - `Http::fake()` per the existing `MastodonClient` pattern in the package. - Cover each outcome (success/failed/timeout/4xx/5xx/rejected). - Cover redirect handling — `finalUrl` reflects post-redirect location. - Cover extractor correctness on a small fixture (title, links, word count). - Cover Content-Type filter — `application/pdf`, `image/jpeg`, `text/plain` all → `Rejected`. (Edge case: `text/html; charset=utf-8` → still HTML, accept.) - Cover relative-URL resolution in outbound links. - Cover invalid outbound links (IP literal, userinfo) — filtered out. ## Out of scope (later tickets) - Worker logic invoking the fetcher → **#14 Queue worker** - Per-domain politeness gating → **#11 Per-domain politeness** - Robots.txt check before fetch → **#9 Crawler: robots.txt handling** - Real User-Agent string + `/bot` info page → **#10 Crawler: User agent and /bot page** - HTML hash for change detection → **post-v0.1 (re-crawl ticket)** - URL-pattern-based pre-filtering at discovery (skip `.pdf`/`.zip` before page exists) → **new v0.2 follow-up ticket** - Language detection on `extractedText` → **#13 Crawler: Language detection** (reads `extractedText`, writes `pages.language`) - Tokenization writing `keywords_tsv` → **#13 (or successor)** ## Acceptance - [ ] `App\Enums\CrawlOutcomeEnum` gains `Rejected` case (test bump) - [ ] `App\Enums\PageStatusEnum` gains `Rejected` case (test bump) - [ ] `App\Actions\FetchPageAction` with `__invoke(string $url): FetchResult` - [ ] `App\ValueObjects\FetchResult` value object - [ ] `config/crawler.php` with `timeout`, `max_redirects`, `user_agent` placeholders - [ ] HTML extractor: title, main text, outbound links (filtered + resolved), word count - [ ] Tests covering each outcome path, redirect handling, extractor correctness, Content-Type filtering, relative-URL resolution, invalid-link filtering - [ ] PLATFORM.md updated with the action contract, the chosen extractor library, and the `Rejected` enum semantics
myrmidex added this to the v0.1 milestone 2026-04-23 02:28:57 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex self-assigned this 2026-04-26 16:23:23 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#12
No description provided.