Crawler: Per-domain politeness #11

Closed
opened 2026-04-23 02:28:41 +02:00 by myrmidex · 0 comments
Owner

Context

Make the crawler polite. Never hit the same domain too often. Default 10-second minimum gap, configurable, with a hook for #9's robots.txt Crawl-delay to override per-domain when that lands.

Trove targets the small web (personal blogs, indie sites, fediverse-shared pages — small VPS country). 10s default biases on the side of caution. Production-safe.

Locked-in decisions

  • Gate location: inside ProcessCrawlJob::handle() via Laravel's Cache::lock($key, $delay)->get() — atomic Redis SETNX, race-free across multiple workers. If lock can't be claimed, $this->release($delay) re-queues the job.
  • Default delay: 10 seconds. Configurable via config('crawler.min_domain_delay_seconds').
  • Storage: Redis-backed via Laravel's Cache::lock. TTL semantics auto-expire the lock; no manual cleanup. Test suite uses array cache driver, same atomic semantics.
  • Override hook: new App\Services\PolitenessService with minDelayFor(string $domain): int. v0.1 implementation returns the config value. #9 (robots.txt) extends it later with the parsed Crawl-delay directive.

Implementation shape

App\Services\PolitenessService:

final readonly class PolitenessService
{
    public function minDelayFor(string $domain): int
    {
        // v0.1: static config value
        // #9 will extend: max(config, robotsCrawlDelay($domain) ?? 0)
        return (int) config('crawler.min_domain_delay_seconds', 10);
    }
}

config/crawler.php adds:

'min_domain_delay_seconds' => env('CRAWLER_MIN_DOMAIN_DELAY_SECONDS', 10),

ProcessCrawlJob::handle() gets a politeness gate at the top:

public function handle(
    FetchPageAction $fetcher,
    RegisterDiscoveredPageAction $register,
    PolitenessService $politeness,
): void {
    $delay = $politeness->minDelayFor($this->pageCrawl->domain);
    $lock = Cache::lock("crawler:domain:{$this->pageCrawl->domain}", $delay);

    if (! $lock->get()) {
        $this->release($delay);
        return;
    }

    // ... existing fetch/outcome/retry logic
}

Notes:

  • $lock->get() is non-blocking — returns true if claimed, false if already held. This is SET NX EX semantics under Redis.
  • We don't call $lock->release() after the fetch — let the TTL expire naturally so the domain stays in cooldown for the full delay.
  • The release($delay) call uses Laravel's job-release-with-delay, requeuing for $delay seconds later. Worker picks up other domains' jobs in the meantime — no head-of-line blocking across domains.

Tests

  • PolitenessServiceTest (unit): minDelayFor() returns the config default; respects config()->set('crawler.min_domain_delay_seconds', 30) override.
  • ProcessCrawlJobTest (feature) additions:
    • test_handle_acquires_domain_lock_before_fetching — first call to handle claims the lock and proceeds with the fetch
    • test_handle_releases_job_when_domain_is_locked — second concurrent call (lock pre-acquired) re-queues via $this->release() and does NOT call FetchPageAction
    • test_handle_does_not_release_lock_after_completion — the lock stays held for its TTL after the fetch (verifies we're not calling $lock->release())

Out of scope

  • Robots.txt parsing and Crawl-delay integration → #9. The PolitenessService hook is the integration point.
  • Per-domain backoff escalation (e.g. "if 5xx, double the delay") → post-v0.1
  • Admin/debug UI showing current cooldowns → post-v0.1

Acceptance

  • App\Services\PolitenessService with minDelayFor(string $domain): int returning config value
  • config/crawler.php adds min_domain_delay_seconds (default 10)
  • ProcessCrawlJob::handle() gates on Cache::lock("crawler:domain:{$domain}", $delay). On claim failure, $this->release($delay) and return early
  • Tests for PolitenessService and the gate behavior in ProcessCrawlJob
  • PLATFORM.md updated: politeness section under crawler architecture; document the hook for #9 to extend
## Context Make the crawler polite. Never hit the same domain too often. Default 10-second minimum gap, configurable, with a hook for #9's robots.txt `Crawl-delay` to override per-domain when that lands. Trove targets the small web (personal blogs, indie sites, fediverse-shared pages — small VPS country). 10s default biases on the side of caution. Production-safe. ## Locked-in decisions - **Gate location**: inside `ProcessCrawlJob::handle()` via Laravel's `Cache::lock($key, $delay)->get()` — atomic Redis SETNX, race-free across multiple workers. If lock can't be claimed, `$this->release($delay)` re-queues the job. - **Default delay**: 10 seconds. Configurable via `config('crawler.min_domain_delay_seconds')`. - **Storage**: Redis-backed via Laravel's `Cache::lock`. TTL semantics auto-expire the lock; no manual cleanup. Test suite uses array cache driver, same atomic semantics. - **Override hook**: new `App\Services\PolitenessService` with `minDelayFor(string $domain): int`. v0.1 implementation returns the config value. #9 (robots.txt) extends it later with the parsed `Crawl-delay` directive. ## Implementation shape `App\Services\PolitenessService`: ```php final readonly class PolitenessService { public function minDelayFor(string $domain): int { // v0.1: static config value // #9 will extend: max(config, robotsCrawlDelay($domain) ?? 0) return (int) config('crawler.min_domain_delay_seconds', 10); } } ``` `config/crawler.php` adds: ```php 'min_domain_delay_seconds' => env('CRAWLER_MIN_DOMAIN_DELAY_SECONDS', 10), ``` `ProcessCrawlJob::handle()` gets a politeness gate at the top: ```php public function handle( FetchPageAction $fetcher, RegisterDiscoveredPageAction $register, PolitenessService $politeness, ): void { $delay = $politeness->minDelayFor($this->pageCrawl->domain); $lock = Cache::lock("crawler:domain:{$this->pageCrawl->domain}", $delay); if (! $lock->get()) { $this->release($delay); return; } // ... existing fetch/outcome/retry logic } ``` Notes: - `$lock->get()` is non-blocking — returns true if claimed, false if already held. This is `SET NX EX` semantics under Redis. - We don't call `$lock->release()` after the fetch — let the TTL expire naturally so the domain stays in cooldown for the full delay. - The `release($delay)` call uses Laravel's job-release-with-delay, requeuing for `$delay` seconds later. Worker picks up other domains' jobs in the meantime — no head-of-line blocking across domains. ## Tests - `PolitenessServiceTest` (unit): `minDelayFor()` returns the config default; respects `config()->set('crawler.min_domain_delay_seconds', 30)` override. - `ProcessCrawlJobTest` (feature) additions: - `test_handle_acquires_domain_lock_before_fetching` — first call to `handle` claims the lock and proceeds with the fetch - `test_handle_releases_job_when_domain_is_locked` — second concurrent call (lock pre-acquired) re-queues via `$this->release()` and does NOT call FetchPageAction - `test_handle_does_not_release_lock_after_completion` — the lock stays held for its TTL after the fetch (verifies we're not calling `$lock->release()`) ## Out of scope - Robots.txt parsing and `Crawl-delay` integration → **#9**. The `PolitenessService` hook is the integration point. - Per-domain backoff escalation (e.g. "if 5xx, double the delay") → post-v0.1 - Admin/debug UI showing current cooldowns → post-v0.1 ## Acceptance - [ ] `App\Services\PolitenessService` with `minDelayFor(string $domain): int` returning config value - [ ] `config/crawler.php` adds `min_domain_delay_seconds` (default 10) - [ ] `ProcessCrawlJob::handle()` gates on `Cache::lock("crawler:domain:{$domain}", $delay)`. On claim failure, `$this->release($delay)` and return early - [ ] Tests for `PolitenessService` and the gate behavior in `ProcessCrawlJob` - [ ] PLATFORM.md updated: politeness section under crawler architecture; document the hook for #9 to extend
myrmidex added this to the v0.1 milestone 2026-04-23 02:28:41 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex self-assigned this 2026-04-27 00:48:19 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#11
No description provided.