Crawler: User agent and /bot page #10

New issue

Closed

opened 2026-04-23 02:28:22 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-23 02:28:22 +02:00

Owner

Context

Original text was lightweight and accurate. This rewrite just locks in the design choices.

The crawler currently sends User-Agent: TroveBot/0.1 (+https://trove.lvl0.xyz/bot) (set during #12 as a placeholder), but that URL 404s. This ticket makes the URL real and ensures the UA reaches every outbound HTTP request.

Locked-in decisions

UA string: keep TroveBot/0.1 (+https://trove.lvl0.xyz/bot). No Mozilla/5.0 (compatible; ...) prefix — we don't pretend to be a browser. Honest identity.
/bot page tech: plain Blade view via Route::view('/bot', 'bot'). Uses the existing <x-layout> component. No Livewire (no interactivity needed). No controller.
Robots.txt claim: the page documents the bot's contract — respects User-agent: TroveBot rules in robots.txt. The worker doesn't get pointed at arbitrary domains until #9 (robots.txt handling) lands, so the claim is true as of the actual production rollout. Accept that v0.1 is dev-only at this stage.
Contact: Forge issues link (https://forge.lvl0.xyz/lvl0/trove/issues). No email — avoids harvesting, transparent process.

Page content

Title: "About TroveBot" or similar
What it is: one paragraph — "Trove is a federated search engine for the small web, seeded by fediverse attention. TroveBot is its crawler — it discovers and indexes URLs shared by people on the fediverse."
Identity: User-Agent string verbatim, current version
Crawling behavior: respects robots.txt (under User-agent: TroveBot), per-domain rate limit (TBD per #11), follows ≤5 redirects, fetches only HTML, ignores non-HTML responses
Opt-out: example robots.txt block:
```
User-agent: TroveBot
Disallow: /
```
Contact: link to https://forge.lvl0.xyz/lvl0/trove/issues
Source: link to https://forge.lvl0.xyz/lvl0/trove

Acceptance

Route::view('/bot', 'bot') registered in routes/web.php
resources/views/bot.blade.php — uses <x-layout>, contains all content sections above
Confirm config('crawler.user_agent') is the final v0.1 string (already is — 'TroveBot/0.1 (+https://trove.lvl0.xyz/bot)')
Feature test: GET /bot returns 200, contains the UA string and the robots.txt opt-out example
Feature test: outbound HTTP request from FetchPageAction includes the User-Agent header (already covered? verify; if not, add an explicit assertion)
PLATFORM.md updated with the /bot route and the bot's public contract

Out of scope

Actual robots.txt enforcement → #9
Per-domain politeness numbers → #11 (the /bot page can stay vague about exact rate; reference "polite, configurable" or similar)
Welcome page redesign — / still serves Laravel default welcome until search UI lands

## Context Original text was lightweight and accurate. This rewrite just locks in the design choices. The crawler currently sends `User-Agent: TroveBot/0.1 (+https://trove.lvl0.xyz/bot)` (set during #12 as a placeholder), but that URL 404s. This ticket makes the URL real and ensures the UA reaches every outbound HTTP request. ## Locked-in decisions - **UA string**: keep `TroveBot/0.1 (+https://trove.lvl0.xyz/bot)`. No `Mozilla/5.0 (compatible; ...)` prefix — we don't pretend to be a browser. Honest identity. - **`/bot` page tech**: plain Blade view via `Route::view('/bot', 'bot')`. Uses the existing `<x-layout>` component. No Livewire (no interactivity needed). No controller. - **Robots.txt claim**: the page documents the bot's *contract* — respects `User-agent: TroveBot` rules in robots.txt. The worker doesn't get pointed at arbitrary domains until #9 (robots.txt handling) lands, so the claim is true as of the actual production rollout. Accept that v0.1 is dev-only at this stage. - **Contact**: Forge issues link (`https://forge.lvl0.xyz/lvl0/trove/issues`). No email — avoids harvesting, transparent process. ## Page content - **Title**: "About TroveBot" or similar - **What it is**: one paragraph — "Trove is a federated search engine for the small web, seeded by fediverse attention. TroveBot is its crawler — it discovers and indexes URLs shared by people on the fediverse." - **Identity**: User-Agent string verbatim, current version - **Crawling behavior**: respects `robots.txt` (under `User-agent: TroveBot`), per-domain rate limit (TBD per #11), follows ≤5 redirects, fetches only HTML, ignores non-HTML responses - **Opt-out**: example robots.txt block: ``` User-agent: TroveBot Disallow: / ``` - **Contact**: link to `https://forge.lvl0.xyz/lvl0/trove/issues` - **Source**: link to `https://forge.lvl0.xyz/lvl0/trove` ## Acceptance - [ ] `Route::view('/bot', 'bot')` registered in `routes/web.php` - [ ] `resources/views/bot.blade.php` — uses `<x-layout>`, contains all content sections above - [ ] Confirm `config('crawler.user_agent')` is the final v0.1 string (already is — `'TroveBot/0.1 (+https://trove.lvl0.xyz/bot)'`) - [ ] Feature test: `GET /bot` returns 200, contains the UA string and the robots.txt opt-out example - [ ] Feature test: outbound HTTP request from `FetchPageAction` includes the `User-Agent` header (already covered? verify; if not, add an explicit assertion) - [ ] PLATFORM.md updated with the `/bot` route and the bot's public contract ## Out of scope - Actual robots.txt enforcement → **#9** - Per-domain politeness numbers → **#11** (the `/bot` page can stay vague about exact rate; reference "polite, configurable" or similar) - Welcome page redesign — `/` still serves Laravel default `welcome` until search UI lands