URL normalization on pages.url (strip tracking params, canonicalize) #23

Closed
opened 2026-04-26 11:47:44 +02:00 by myrmidex · 0 comments
Owner

Context

Page::firstOrCreate(['url' => ...]) matches on the raw URL string. Variants of the same canonical resource create distinct pages rows:

  • https://example.com/post
  • https://example.com/post/
  • https://example.com/post?utm_source=x
  • https://example.com/post#fragment
  • HTTPS://Example.com/post

Affects both write paths:

  • App\Listeners\UrlDiscoveredListener (fediverse-sourced)
  • App\Livewire\UrlSubmissionForm (form-sourced)

Acceptance

  • Define a canonicalization rule set (proposed): lowercase scheme + host, strip default ports (:80, :443), strip trailing / on path (except root), drop fragment, drop tracking params matching ^(utm_.*|fbclid|gclid|ref|ref_src|igshid|mc_eid|mc_cid)$
  • Implement as a single helper (e.g. App\Support\UrlCanonicalizer::canonicalize(string $url): string)
  • Apply at every pages.url write site (UrlDiscoveredListener, UrlSubmissionForm)
  • Backfill migration for existing rows, with a dry-run command (fedi-discover:canonicalize-urls --dry-run or app-level equivalent)
  • Tests: unit tests for the canonicalizer (table-driven), feature tests verifying both write paths apply it

Out of scope

  • Domain-level dedup (e.g. m.example.com vs www.example.com vs example.com). Touchier — different content sometimes.
  • HTML <link rel="canonical"> — that lands with the crawler.
## Context `Page::firstOrCreate(['url' => ...])` matches on the raw URL string. Variants of the same canonical resource create distinct `pages` rows: - `https://example.com/post` - `https://example.com/post/` - `https://example.com/post?utm_source=x` - `https://example.com/post#fragment` - `HTTPS://Example.com/post` Affects both write paths: - `App\Listeners\UrlDiscoveredListener` (fediverse-sourced) - `App\Livewire\UrlSubmissionForm` (form-sourced) ## Acceptance - [ ] Define a canonicalization rule set (proposed): lowercase scheme + host, strip default ports (`:80`, `:443`), strip trailing `/` on path (except root), drop fragment, drop tracking params matching `^(utm_.*|fbclid|gclid|ref|ref_src|igshid|mc_eid|mc_cid)$` - [ ] Implement as a single helper (e.g. `App\Support\UrlCanonicalizer::canonicalize(string $url): string`) - [ ] Apply at every `pages.url` write site (`UrlDiscoveredListener`, `UrlSubmissionForm`) - [ ] Backfill migration for existing rows, with a dry-run command (`fedi-discover:canonicalize-urls --dry-run` or app-level equivalent) - [ ] Tests: unit tests for the canonicalizer (table-driven), feature tests verifying both write paths apply it ## Out of scope - Domain-level dedup (e.g. `m.example.com` vs `www.example.com` vs `example.com`). Touchier — different content sometimes. - HTML `<link rel="canonical">` — that lands with the crawler.
myrmidex added this to the v0.2 milestone 2026-04-26 11:47:44 +02:00
myrmidex self-assigned this 2026-04-26 11:47:44 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:01:59 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#23
No description provided.