Tighten URL extraction regex for parens-bearing URLs #17
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
PollFediverseAction::processLinks()uses regex~https?://[^\s<>"\'()\[\]]+~to extract URLs from post bodies. The character class excludes parentheses, which means it under-extracts URLs that legitimately contain them.Example miss:
https://en.wikipedia.org/wiki/Foo_(disambiguation)— onlyhttps://en.wikipedia.org/wiki/Foo_is matched, the trailing(disambiguation)is dropped.For the small-web target audience this is probably fine for v0.1 (Wikipedia isn't really small-web), but it's a known under-extraction bug worth flagging.
Goal
URL extraction handles balanced parens correctly:
https://en.wikipedia.org/wiki/Foo_(disambiguation)→ extracted intact(https://example.com/x)(URL inside surrounding parens) → still extractshttps://example.com/xwithout the wrapping parens.,,,;,:,!,?) still trimmedAcceptance criteria
PollFediverseActionTestcovering wikipedia-style URL, parenthesized URL, URL with trailing close-paren that's NOT part of the URL (e.g."see (https://example.com/foo)"should extracthttps://example.com/foo, nothttps://example.com/foo)).References
GitHub uses this approach: https://github.com/github/cmark-gfm — autolink extension. See their algorithm for paren handling.