Crawler: Per-domain politeness #11
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Make the crawler polite. Never hit the same domain too often. Default 10-second minimum gap, configurable, with a hook for #9's robots.txt
Crawl-delayto override per-domain when that lands.Trove targets the small web (personal blogs, indie sites, fediverse-shared pages — small VPS country). 10s default biases on the side of caution. Production-safe.
Locked-in decisions
ProcessCrawlJob::handle()via Laravel'sCache::lock($key, $delay)->get()— atomic Redis SETNX, race-free across multiple workers. If lock can't be claimed,$this->release($delay)re-queues the job.config('crawler.min_domain_delay_seconds').Cache::lock. TTL semantics auto-expire the lock; no manual cleanup. Test suite uses array cache driver, same atomic semantics.App\Services\PolitenessServicewithminDelayFor(string $domain): int. v0.1 implementation returns the config value. #9 (robots.txt) extends it later with the parsedCrawl-delaydirective.Implementation shape
App\Services\PolitenessService:config/crawler.phpadds:ProcessCrawlJob::handle()gets a politeness gate at the top:Notes:
$lock->get()is non-blocking — returns true if claimed, false if already held. This isSET NX EXsemantics under Redis.$lock->release()after the fetch — let the TTL expire naturally so the domain stays in cooldown for the full delay.release($delay)call uses Laravel's job-release-with-delay, requeuing for$delayseconds later. Worker picks up other domains' jobs in the meantime — no head-of-line blocking across domains.Tests
PolitenessServiceTest(unit):minDelayFor()returns the config default; respectsconfig()->set('crawler.min_domain_delay_seconds', 30)override.ProcessCrawlJobTest(feature) additions:test_handle_acquires_domain_lock_before_fetching— first call tohandleclaims the lock and proceeds with the fetchtest_handle_releases_job_when_domain_is_locked— second concurrent call (lock pre-acquired) re-queues via$this->release()and does NOT call FetchPageActiontest_handle_does_not_release_lock_after_completion— the lock stays held for its TTL after the fetch (verifies we're not calling$lock->release())Out of scope
Crawl-delayintegration → #9. ThePolitenessServicehook is the integration point.Acceptance
App\Services\PolitenessServicewithminDelayFor(string $domain): intreturning config valueconfig/crawler.phpaddsmin_domain_delay_seconds(default 10)ProcessCrawlJob::handle()gates onCache::lock("crawler:domain:{$domain}", $delay). On claim failure,$this->release($delay)and return earlyPolitenessServiceand the gate behavior inProcessCrawlJob