Crawler: Queue worker #14
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Rewritten 2026-04-26. Original text bundled "queue worker + politeness checks + retry logic + error logging" and referenced
crawl_queue(which doesn't exist — we builtpage_crawlsin #7).Now scoped explicitly with locked-in design decisions:
config/crawler.php)This ticket builds the orchestration layer that calls
FetchPageActionand writes results back. Until #9/#10/#11 land, the worker is dev-safe (run against your own URLs) but NOT yet safe to point at arbitrary domains at scale.Major design shift from #7
In #7 we designed
page_crawlsas history-as-queue (one table doing both jobs). For #14 we're moving the work-queue role to Laravel queues (Redis), leavingpage_crawlsas pure attempt history.Why:
queue:workis reactive (job picked up within ms of dispatch), retries + backoff + failure tracking are free, no custom polling loop to maintain. The previous "history-as-queue" design predates the decision to use Laravel queues.This means dropping 5 columns and 2 indexes from
page_crawlsthat no longer have consumers.Schema changes (commit 1 of this ticket)
Drop from
page_crawls:scheduled_for— Redis handles delay viadispatch()->delay($when)if we ever need itlocked_at— Redis handles visibility timeoutlocked_by— already absent (was dropped in #7)attempted_at—created_at= job created = attempt started, redundantpriority— keep it; maps to Laravel's named-queue priority if we use it laterWait, on reflection:
prioritystays. Laravel queues supportdispatch()->onQueue('high')etc., so even if we don't use it in v0.1, the column has a forward-looking consumer.Drop indexes:
(domain) WHERE outcome IS NULLpartial — no more "pending" filter at DB level(scheduled_for, locked_at) WHERE outcome IS NULLpartial — sameKeep:
(page_id, created_at)— per-page history view still valuableSo
page_crawlsshrinks from 13 columns to ~9. PLATFORM.md "history-as-queue" section becomes "history."Worker shape
Trigger:
PageCrawlObserver::createddispatchesProcessCrawlJobto Redis. (MirrorsPageObserverpattern.)Job class
App\Jobs\ProcessCrawlJob implements ShouldQueue:Retry strategy
Laravel queue retries handle Redis-level retries (e.g. job throws unexpected exception →
$triesretry).For our outcome-level retries (page returned 5xx, want to try again later):
page_crawlsrow withdispatch()->delay(now()->addHour())— leverages the same observer-dispatch patternPageCrawl::where('page_id', $id)->count()). After cap →pages.status = Failed, no more retries.Backoff is fixed 1h for v0.1. Exponential is overkill until we have data showing fixed isn't right.
Outbound link discovery
On Success, iterate
$result->outboundLinksand callRegisterDiscoveredPageAction($url, instanceId: null)for each. Each call creates apagesrow → firesPageObserver→ createspage_crawlsrow → firesPageCrawlObserver→ dispatchesProcessCrawlJob. The pipeline self-feeds.Production deployment (commit 4 of this ticket)
docker/prod/start.shneeds aphp artisan queue:workprocess alongside FrankenPHP. Sidecar approach (separate container) vs in-container (background process).Decide at implementation time. Document in PLATFORM.md.
Commit plan
14 - Simplify page_crawls schema (queue moves to Redis)— migration drops 4 columns + 2 indexes. Factory cleanup. Test cleanup. PLATFORM.md update.14 - Add PageCrawlObserver and ProcessCrawlJob skeleton— observer dispatches job. Job class with empty handle (or a single-line stub). Tests for observer firing.14 - Implement ProcessCrawlJob orchestration— full handle method: call fetcher, write back outcomes, dispatch outbound links, retry-on-failure logic. Feature tests covering happy path + each outcome → page status mapping.14 - Run queue worker in production via start.sh— docker prod startup, possibly v0.1 hardening for now (or filed as separate ticket if not blocking v0.1 ship).Out of scope (later tickets)
/botinfo page → #10Acceptance
scheduled_for,locked_at,attempted_atcolumns and their indexes frompage_crawlsApp\Observers\PageCrawlObserverdispatchesProcessCrawlJoboncreatedApp\Jobs\ProcessCrawlJob implements ShouldQueuewith full outcome-handling logicpage_crawlsrow scheduled +1h, capped at 3 attempts/pageRegisterDiscoveredPageActionper linkqueue:workruns alongside FrankenPHP (or separate ticket if scope creeps)