Crawler: User agent and /bot page #10

Closed
opened 2026-04-23 02:28:22 +02:00 by myrmidex · 0 comments
Owner

Context

Original text was lightweight and accurate. This rewrite just locks in the design choices.

The crawler currently sends User-Agent: TroveBot/0.1 (+https://trove.lvl0.xyz/bot) (set during #12 as a placeholder), but that URL 404s. This ticket makes the URL real and ensures the UA reaches every outbound HTTP request.

Locked-in decisions

  • UA string: keep TroveBot/0.1 (+https://trove.lvl0.xyz/bot). No Mozilla/5.0 (compatible; ...) prefix — we don't pretend to be a browser. Honest identity.
  • /bot page tech: plain Blade view via Route::view('/bot', 'bot'). Uses the existing <x-layout> component. No Livewire (no interactivity needed). No controller.
  • Robots.txt claim: the page documents the bot's contract — respects User-agent: TroveBot rules in robots.txt. The worker doesn't get pointed at arbitrary domains until #9 (robots.txt handling) lands, so the claim is true as of the actual production rollout. Accept that v0.1 is dev-only at this stage.
  • Contact: Forge issues link (https://forge.lvl0.xyz/lvl0/trove/issues). No email — avoids harvesting, transparent process.

Page content

  • Title: "About TroveBot" or similar
  • What it is: one paragraph — "Trove is a federated search engine for the small web, seeded by fediverse attention. TroveBot is its crawler — it discovers and indexes URLs shared by people on the fediverse."
  • Identity: User-Agent string verbatim, current version
  • Crawling behavior: respects robots.txt (under User-agent: TroveBot), per-domain rate limit (TBD per #11), follows ≤5 redirects, fetches only HTML, ignores non-HTML responses
  • Opt-out: example robots.txt block:
    User-agent: TroveBot
    Disallow: /
    
  • Contact: link to https://forge.lvl0.xyz/lvl0/trove/issues
  • Source: link to https://forge.lvl0.xyz/lvl0/trove

Acceptance

  • Route::view('/bot', 'bot') registered in routes/web.php
  • resources/views/bot.blade.php — uses <x-layout>, contains all content sections above
  • Confirm config('crawler.user_agent') is the final v0.1 string (already is — 'TroveBot/0.1 (+https://trove.lvl0.xyz/bot)')
  • Feature test: GET /bot returns 200, contains the UA string and the robots.txt opt-out example
  • Feature test: outbound HTTP request from FetchPageAction includes the User-Agent header (already covered? verify; if not, add an explicit assertion)
  • PLATFORM.md updated with the /bot route and the bot's public contract

Out of scope

  • Actual robots.txt enforcement → #9
  • Per-domain politeness numbers → #11 (the /bot page can stay vague about exact rate; reference "polite, configurable" or similar)
  • Welcome page redesign — / still serves Laravel default welcome until search UI lands
## Context Original text was lightweight and accurate. This rewrite just locks in the design choices. The crawler currently sends `User-Agent: TroveBot/0.1 (+https://trove.lvl0.xyz/bot)` (set during #12 as a placeholder), but that URL 404s. This ticket makes the URL real and ensures the UA reaches every outbound HTTP request. ## Locked-in decisions - **UA string**: keep `TroveBot/0.1 (+https://trove.lvl0.xyz/bot)`. No `Mozilla/5.0 (compatible; ...)` prefix — we don't pretend to be a browser. Honest identity. - **`/bot` page tech**: plain Blade view via `Route::view('/bot', 'bot')`. Uses the existing `<x-layout>` component. No Livewire (no interactivity needed). No controller. - **Robots.txt claim**: the page documents the bot's *contract* — respects `User-agent: TroveBot` rules in robots.txt. The worker doesn't get pointed at arbitrary domains until #9 (robots.txt handling) lands, so the claim is true as of the actual production rollout. Accept that v0.1 is dev-only at this stage. - **Contact**: Forge issues link (`https://forge.lvl0.xyz/lvl0/trove/issues`). No email — avoids harvesting, transparent process. ## Page content - **Title**: "About TroveBot" or similar - **What it is**: one paragraph — "Trove is a federated search engine for the small web, seeded by fediverse attention. TroveBot is its crawler — it discovers and indexes URLs shared by people on the fediverse." - **Identity**: User-Agent string verbatim, current version - **Crawling behavior**: respects `robots.txt` (under `User-agent: TroveBot`), per-domain rate limit (TBD per #11), follows ≤5 redirects, fetches only HTML, ignores non-HTML responses - **Opt-out**: example robots.txt block: ``` User-agent: TroveBot Disallow: / ``` - **Contact**: link to `https://forge.lvl0.xyz/lvl0/trove/issues` - **Source**: link to `https://forge.lvl0.xyz/lvl0/trove` ## Acceptance - [ ] `Route::view('/bot', 'bot')` registered in `routes/web.php` - [ ] `resources/views/bot.blade.php` — uses `<x-layout>`, contains all content sections above - [ ] Confirm `config('crawler.user_agent')` is the final v0.1 string (already is — `'TroveBot/0.1 (+https://trove.lvl0.xyz/bot)'`) - [ ] Feature test: `GET /bot` returns 200, contains the UA string and the robots.txt opt-out example - [ ] Feature test: outbound HTTP request from `FetchPageAction` includes the `User-Agent` header (already covered? verify; if not, add an explicit assertion) - [ ] PLATFORM.md updated with the `/bot` route and the bot's public contract ## Out of scope - Actual robots.txt enforcement → **#9** - Per-domain politeness numbers → **#11** (the `/bot` page can stay vague about exact rate; reference "polite, configurable" or similar) - Welcome page redesign — `/` still serves Laravel default `welcome` until search UI lands
myrmidex added this to the v0.1 milestone 2026-04-23 02:28:22 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex self-assigned this 2026-04-27 00:28:51 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#10
No description provided.