Backfill command for crawler queue (catch pages missed by observer) #27

Open
opened 2026-04-26 14:31:42 +02:00 by myrmidex · 0 comments
Owner

Context

Ticket #8 wires queue population via an Eloquent Page::created observer. That covers the happy path: every newly created pages row gets a corresponding page_crawls row inserted synchronously.

Failure mode: anything that creates a Page without firing the created event misses the queue. Examples:

  • Direct INSERTs (seeders, raw DB::table('pages')->insert(...), restoring from backup, manual SQL)
  • Worker / observer running while the worker process was crashed/restarted (events fire only at insert time, not after)
  • A future bulk-import path that uses insertOrIgnore (skips Eloquent events for performance)
  • Schema changes that re-run the observer logic and need to catch existing rows

A scheduled backfill command catches these orphans.

Acceptance

  • App\Console\Commands\PopulateCrawlQueueCommandcrawler:populate-queue (or similar)
  • Calls a new App\Actions\PopulateCrawlQueueAction that finds pages with status=Discovered AND no pending page_crawls row (outcome IS NULL), and inserts a page_crawls row for each
  • Action is idempotent (safe to run repeatedly) — uses firstOrCreate semantics or equivalent on (page_id) filtered by outcome IS NULL
  • Schedule it everyFifteenMinutes()->withoutOverlapping(5)->runInBackground() in routes/console.php — backfill cadence; observer handles the hot path
  • Command reports counts: pages scanned, queue rows inserted, skipped (already queued)
  • Tests covering: action idempotency, domain extraction (reuse UrlService::host), command exit codes

Notes

  • Reuses App\Services\UrlService::host() from #8 — no duplicate URL parsing
  • Acts as a safety net, not the primary queue-population mechanism — keep cadence relaxed (15 min, not every minute)
  • If observer-only proves sufficient through v0.1, this can stay deferred to v0.2 (or even later)
## Context Ticket #8 wires queue population via an Eloquent `Page::created` observer. That covers the happy path: every newly created `pages` row gets a corresponding `page_crawls` row inserted synchronously. Failure mode: anything that creates a `Page` *without* firing the `created` event misses the queue. Examples: - Direct INSERTs (seeders, raw `DB::table('pages')->insert(...)`, restoring from backup, manual SQL) - Worker / observer running while the worker process was crashed/restarted (events fire only at insert time, not after) - A future bulk-import path that uses `insertOrIgnore` (skips Eloquent events for performance) - Schema changes that re-run the observer logic and need to catch existing rows A scheduled backfill command catches these orphans. ## Acceptance - [ ] `App\Console\Commands\PopulateCrawlQueueCommand` — `crawler:populate-queue` (or similar) - [ ] Calls a new `App\Actions\PopulateCrawlQueueAction` that finds `pages` with `status=Discovered` AND no pending `page_crawls` row (`outcome IS NULL`), and inserts a `page_crawls` row for each - [ ] Action is idempotent (safe to run repeatedly) — uses `firstOrCreate` semantics or equivalent on `(page_id)` filtered by `outcome IS NULL` - [ ] Schedule it `everyFifteenMinutes()->withoutOverlapping(5)->runInBackground()` in `routes/console.php` — backfill cadence; observer handles the hot path - [ ] Command reports counts: pages scanned, queue rows inserted, skipped (already queued) - [ ] Tests covering: action idempotency, domain extraction (reuse `UrlService::host`), command exit codes ## Notes - Reuses `App\Services\UrlService::host()` from #8 — no duplicate URL parsing - Acts as a safety net, not the primary queue-population mechanism — keep cadence relaxed (15 min, not every minute) - If observer-only proves sufficient through v0.1, this can stay deferred to v0.2 (or even later)
myrmidex added this to the v0.2 milestone 2026-04-26 14:31:42 +02:00
myrmidex self-assigned this 2026-04-26 14:31:42 +02:00
myrmidex modified the milestone from v0.2 to v0.3 2026-04-29 23:55:37 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:02:00 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#27
No description provided.