URL normalization on pages.url (strip tracking params, canonicalize) #23
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Page::firstOrCreate(['url' => ...])matches on the raw URL string. Variants of the same canonical resource create distinctpagesrows:https://example.com/posthttps://example.com/post/https://example.com/post?utm_source=xhttps://example.com/post#fragmentHTTPS://Example.com/postAffects both write paths:
App\Listeners\UrlDiscoveredListener(fediverse-sourced)App\Livewire\UrlSubmissionForm(form-sourced)Acceptance
:80,:443), strip trailing/on path (except root), drop fragment, drop tracking params matching^(utm_.*|fbclid|gclid|ref|ref_src|igshid|mc_eid|mc_cid)$App\Support\UrlCanonicalizer::canonicalize(string $url): string)pages.urlwrite site (UrlDiscoveredListener,UrlSubmissionForm)fedi-discover:canonicalize-urls --dry-runor app-level equivalent)Out of scope
m.example.comvswww.example.comvsexample.com). Touchier — different content sometimes.<link rel="canonical">— that lands with the crawler.