Crawler: Queue worker #14

New issue

Closed

opened 2026-04-23 02:29:49 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-23 02:29:49 +02:00

Owner

Context

Rewritten 2026-04-26. Original text bundled "queue worker + politeness checks + retry logic + error logging" and referenced crawl_queue (which doesn't exist — we built page_crawls in #7).

Now scoped explicitly with locked-in design decisions:

Politeness gating → out of scope (lands in #11)
robots.txt check → out of scope (lands in #9)
Real User-Agent string → out of scope (lands in #10, currently a placeholder in config/crawler.php)
Language detection on success → out of scope (lands in #13)

This ticket builds the orchestration layer that calls FetchPageAction and writes results back. Until #9/#10/#11 land, the worker is dev-safe (run against your own URLs) but NOT yet safe to point at arbitrary domains at scale.

Major design shift from #7

In #7 we designed page_crawls as history-as-queue (one table doing both jobs). For #14 we're moving the work-queue role to Laravel queues (Redis), leaving page_crawls as pure attempt history.

Why: queue:work is reactive (job picked up within ms of dispatch), retries + backoff + failure tracking are free, no custom polling loop to maintain. The previous "history-as-queue" design predates the decision to use Laravel queues.

This means dropping 5 columns and 2 indexes from page_crawls that no longer have consumers.

Schema changes (commit 1 of this ticket)

Drop from page_crawls:

scheduled_for — Redis handles delay via dispatch()->delay($when) if we ever need it
locked_at — Redis handles visibility timeout
locked_by — already absent (was dropped in #7)
attempted_at — created_at = job created = attempt started, redundant
priority — keep it; maps to Laravel's named-queue priority if we use it later

Wait, on reflection: priority stays. Laravel queues support dispatch()->onQueue('high') etc., so even if we don't use it in v0.1, the column has a forward-looking consumer.

Drop indexes:

(domain) WHERE outcome IS NULL partial — no more "pending" filter at DB level
(scheduled_for, locked_at) WHERE outcome IS NULL partial — same

Keep:

(page_id, created_at) — per-page history view still valuable

So page_crawls shrinks from 13 columns to ~9. PLATFORM.md "history-as-queue" section becomes "history."

Worker shape

Trigger: PageCrawlObserver::created dispatches ProcessCrawlJob to Redis. (Mirrors PageObserver pattern.)

Job class App\Jobs\ProcessCrawlJob implements ShouldQueue:

public function handle(FetchPageAction $fetcher, RegisterDiscoveredPageAction $register): void
{
    $result = ($fetcher)($this->pageCrawl->page->url);

    // Write outcome back to page_crawls
    $this->pageCrawl->update([
        'completed_at' => now(),
        'outcome' => $result->outcome,
        'status_code' => $result->statusCode,
        'error_message' => $result->errorMessage,
    ]);

    // Update pages on terminal outcomes
    if ($result->outcome === Success) {
        $this->pageCrawl->page->update([
            'status' => Fetched,
            'fetched_at' => now(),
            'title' => $result->title,
        ]);

        // Discover outbound links
        foreach ($result->outboundLinks as $url) {
            ($register)($url);  // null instanceId — crawler-discovered, not fediverse
        }
    } elseif ($result->outcome === Rejected) {
        $this->pageCrawl->page->update(['status' => Rejected]);
    } elseif ($result->outcome in [Blocked4xx, Failed, Timeout, Blocked5xx]) {
        $this->pageCrawl->page->update(['status' => Failed, 'failed_at' => now()]);
    }
}

Retry strategy

Laravel queue retries handle Redis-level retries (e.g. job throws unexpected exception → $tries retry).

For our outcome-level retries (page returned 5xx, want to try again later):

✅ Don't retry: Success, Rejected, Blocked4xx, BlockedRobots
✅ Retry: Failed, Timeout, Blocked5xx
Strategy for retryable: insert a NEW page_crawls row with dispatch()->delay(now()->addHour()) — leverages the same observer-dispatch pattern
Cap retries at 3 attempts per page (count PageCrawl::where('page_id', $id)->count()). After cap → pages.status = Failed, no more retries.

Backoff is fixed 1h for v0.1. Exponential is overkill until we have data showing fixed isn't right.

Outbound link discovery

On Success, iterate $result->outboundLinks and call RegisterDiscoveredPageAction($url, instanceId: null) for each. Each call creates a pages row → fires PageObserver → creates page_crawls row → fires PageCrawlObserver → dispatches ProcessCrawlJob. The pipeline self-feeds.

Production deployment (commit 4 of this ticket)

docker/prod/start.sh needs a php artisan queue:work process alongside FrankenPHP. Sidecar approach (separate container) vs in-container (background process).

Decide at implementation time. Document in PLATFORM.md.

Commit plan

14 - Simplify page_crawls schema (queue moves to Redis) — migration drops 4 columns + 2 indexes. Factory cleanup. Test cleanup. PLATFORM.md update.
14 - Add PageCrawlObserver and ProcessCrawlJob skeleton — observer dispatches job. Job class with empty handle (or a single-line stub). Tests for observer firing.
14 - Implement ProcessCrawlJob orchestration — full handle method: call fetcher, write back outcomes, dispatch outbound links, retry-on-failure logic. Feature tests covering happy path + each outcome → page status mapping.
14 - Run queue worker in production via start.sh — docker prod startup, possibly v0.1 hardening for now (or filed as separate ticket if not blocking v0.1 ship).

Out of scope (later tickets)

Per-domain politeness gating before dispatch → #11
robots.txt check before fetch → #9
Real User-Agent string + /bot info page → #10
Language detection on Success → #13
Re-crawl scheduling (post-v0.1)
Horizon dashboard for queue introspection (post-v0.1, deploy ergonomics)

Acceptance

Migration dropping scheduled_for, locked_at, attempted_at columns and their indexes from page_crawls
App\Observers\PageCrawlObserver dispatches ProcessCrawlJob on created
App\Jobs\ProcessCrawlJob implements ShouldQueue with full outcome-handling logic
Retry pattern: failed/timeout/5xx → new page_crawls row scheduled +1h, capped at 3 attempts/page
Outbound link discovery: Success outcomes call RegisterDiscoveredPageAction per link
Tests covering observer dispatch + each outcome's page status update
PLATFORM.md updated: "history-as-queue" → "history" (Redis is the queue), worker section, production deployment
Production deployment: queue:work runs alongside FrankenPHP (or separate ticket if scope creeps)

## Context **Rewritten 2026-04-26.** Original text bundled "queue worker + politeness checks + retry logic + error logging" and referenced `crawl_queue` (which doesn't exist — we built `page_crawls` in #7). Now scoped explicitly with locked-in design decisions: - Politeness gating → out of scope (lands in **#11**) - robots.txt check → out of scope (lands in **#9**) - Real User-Agent string → out of scope (lands in **#10**, currently a placeholder in `config/crawler.php`) - Language detection on success → out of scope (lands in **#13**) This ticket builds the orchestration layer that calls `FetchPageAction` and writes results back. Until #9/#10/#11 land, the worker is dev-safe (run against your own URLs) but **NOT yet safe to point at arbitrary domains at scale**. ## Major design shift from #7 In #7 we designed `page_crawls` as **history-as-queue** (one table doing both jobs). For #14 we're moving the work-queue role to **Laravel queues (Redis)**, leaving `page_crawls` as **pure attempt history**. Why: `queue:work` is reactive (job picked up within ms of dispatch), retries + backoff + failure tracking are free, no custom polling loop to maintain. The previous "history-as-queue" design predates the decision to use Laravel queues. This means **dropping 5 columns and 2 indexes from `page_crawls`** that no longer have consumers. ## Schema changes (commit 1 of this ticket) Drop from `page_crawls`: - `scheduled_for` — Redis handles delay via `dispatch()->delay($when)` if we ever need it - `locked_at` — Redis handles visibility timeout - `locked_by` — already absent (was dropped in #7) - `attempted_at` — `created_at` = job created = attempt started, redundant - `priority` — keep it; maps to Laravel's named-queue priority if we use it later Wait, on reflection: `priority` stays. Laravel queues support `dispatch()->onQueue('high')` etc., so even if we don't use it in v0.1, the column has a forward-looking consumer. Drop indexes: - `(domain) WHERE outcome IS NULL` partial — no more "pending" filter at DB level - `(scheduled_for, locked_at) WHERE outcome IS NULL` partial — same Keep: - `(page_id, created_at)` — per-page history view still valuable So `page_crawls` shrinks from 13 columns to ~9. PLATFORM.md "history-as-queue" section becomes "history." ## Worker shape **Trigger**: `PageCrawlObserver::created` dispatches `ProcessCrawlJob` to Redis. (Mirrors `PageObserver` pattern.) **Job class** `App\Jobs\ProcessCrawlJob implements ShouldQueue`: ```php public function handle(FetchPageAction $fetcher, RegisterDiscoveredPageAction $register): void { $result = ($fetcher)($this->pageCrawl->page->url); // Write outcome back to page_crawls $this->pageCrawl->update([ 'completed_at' => now(), 'outcome' => $result->outcome, 'status_code' => $result->statusCode, 'error_message' => $result->errorMessage, ]); // Update pages on terminal outcomes if ($result->outcome === Success) { $this->pageCrawl->page->update([ 'status' => Fetched, 'fetched_at' => now(), 'title' => $result->title, ]); // Discover outbound links foreach ($result->outboundLinks as $url) { ($register)($url); // null instanceId — crawler-discovered, not fediverse } } elseif ($result->outcome === Rejected) { $this->pageCrawl->page->update(['status' => Rejected]); } elseif ($result->outcome in [Blocked4xx, Failed, Timeout, Blocked5xx]) { $this->pageCrawl->page->update(['status' => Failed, 'failed_at' => now()]); } } ``` ## Retry strategy Laravel queue retries handle Redis-level retries (e.g. job throws unexpected exception → `$tries` retry). **For our outcome-level retries** (page returned 5xx, want to try again later): - ✅ Don't retry: Success, Rejected, Blocked4xx, BlockedRobots - ✅ Retry: Failed, Timeout, Blocked5xx - Strategy for retryable: insert a NEW `page_crawls` row with `dispatch()->delay(now()->addHour())` — leverages the same observer-dispatch pattern - Cap retries at 3 attempts per page (count `PageCrawl::where('page_id', $id)->count()`). After cap → `pages.status = Failed`, no more retries. Backoff is fixed 1h for v0.1. Exponential is overkill until we have data showing fixed isn't right. ## Outbound link discovery On Success, iterate `$result->outboundLinks` and call `RegisterDiscoveredPageAction($url, instanceId: null)` for each. Each call creates a `pages` row → fires `PageObserver` → creates `page_crawls` row → fires `PageCrawlObserver` → dispatches `ProcessCrawlJob`. The pipeline self-feeds. ## Production deployment (commit 4 of this ticket) `docker/prod/start.sh` needs a `php artisan queue:work` process alongside FrankenPHP. Sidecar approach (separate container) vs in-container (background process). Decide at implementation time. Document in PLATFORM.md. ## Commit plan 1. **`14 - Simplify page_crawls schema (queue moves to Redis)`** — migration drops 4 columns + 2 indexes. Factory cleanup. Test cleanup. PLATFORM.md update. 2. **`14 - Add PageCrawlObserver and ProcessCrawlJob skeleton`** — observer dispatches job. Job class with empty handle (or a single-line stub). Tests for observer firing. 3. **`14 - Implement ProcessCrawlJob orchestration`** — full handle method: call fetcher, write back outcomes, dispatch outbound links, retry-on-failure logic. Feature tests covering happy path + each outcome → page status mapping. 4. **`14 - Run queue worker in production via start.sh`** — docker prod startup, possibly v0.1 hardening for now (or filed as separate ticket if not blocking v0.1 ship). ## Out of scope (later tickets) - Per-domain politeness gating before dispatch → **#11** - robots.txt check before fetch → **#9** - Real User-Agent string + `/bot` info page → **#10** - Language detection on Success → **#13** - Re-crawl scheduling (post-v0.1) - Horizon dashboard for queue introspection (post-v0.1, deploy ergonomics) ## Acceptance - [ ] Migration dropping `scheduled_for`, `locked_at`, `attempted_at` columns and their indexes from `page_crawls` - [ ] `App\Observers\PageCrawlObserver` dispatches `ProcessCrawlJob` on `created` - [ ] `App\Jobs\ProcessCrawlJob implements ShouldQueue` with full outcome-handling logic - [ ] Retry pattern: failed/timeout/5xx → new `page_crawls` row scheduled +1h, capped at 3 attempts/page - [ ] Outbound link discovery: Success outcomes call `RegisterDiscoveredPageAction` per link - [ ] Tests covering observer dispatch + each outcome's page status update - [ ] PLATFORM.md updated: "history-as-queue" → "history" (Redis is the queue), worker section, production deployment - [ ] Production deployment: `queue:work` runs alongside FrankenPHP (or separate ticket if scope creeps)