URL normalization on pages.url (strip tracking params, canonicalize) #23

New issue

Closed

opened 2026-04-26 11:47:44 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-26 11:47:44 +02:00

Owner

Context

Page::firstOrCreate(['url' => ...]) matches on the raw URL string. Variants of the same canonical resource create distinct pages rows:

https://example.com/post
https://example.com/post/
https://example.com/post?utm_source=x
https://example.com/post#fragment
HTTPS://Example.com/post

Affects both write paths:

App\Listeners\UrlDiscoveredListener (fediverse-sourced)
App\Livewire\UrlSubmissionForm (form-sourced)

Acceptance

Define a canonicalization rule set (proposed): lowercase scheme + host, strip default ports (:80, :443), strip trailing / on path (except root), drop fragment, drop tracking params matching ^(utm_.*|fbclid|gclid|ref|ref_src|igshid|mc_eid|mc_cid)$
Implement as a single helper (e.g. App\Support\UrlCanonicalizer::canonicalize(string $url): string)
Apply at every pages.url write site (UrlDiscoveredListener, UrlSubmissionForm)
Backfill migration for existing rows, with a dry-run command (fedi-discover:canonicalize-urls --dry-run or app-level equivalent)
Tests: unit tests for the canonicalizer (table-driven), feature tests verifying both write paths apply it

Out of scope

Domain-level dedup (e.g. m.example.com vs www.example.com vs example.com). Touchier — different content sometimes.
HTML <link rel="canonical"> — that lands with the crawler.

## Context `Page::firstOrCreate(['url' => ...])` matches on the raw URL string. Variants of the same canonical resource create distinct `pages` rows: - `https://example.com/post` - `https://example.com/post/` - `https://example.com/post?utm_source=x` - `https://example.com/post#fragment` - `HTTPS://Example.com/post` Affects both write paths: - `App\Listeners\UrlDiscoveredListener` (fediverse-sourced) - `App\Livewire\UrlSubmissionForm` (form-sourced) ## Acceptance - [ ] Define a canonicalization rule set (proposed): lowercase scheme + host, strip default ports (`:80`, `:443`), strip trailing `/` on path (except root), drop fragment, drop tracking params matching `^(utm_.*|fbclid|gclid|ref|ref_src|igshid|mc_eid|mc_cid)$` - [ ] Implement as a single helper (e.g. `App\Support\UrlCanonicalizer::canonicalize(string $url): string`) - [ ] Apply at every `pages.url` write site (`UrlDiscoveredListener`, `UrlSubmissionForm`) - [ ] Backfill migration for existing rows, with a dry-run command (`fedi-discover:canonicalize-urls --dry-run` or app-level equivalent) - [ ] Tests: unit tests for the canonicalizer (table-driven), feature tests verifying both write paths apply it ## Out of scope - Domain-level dedup (e.g. `m.example.com` vs `www.example.com` vs `example.com`). Touchier — different content sometimes. - HTML `<link rel="canonical">` — that lands with the crawler.