Tighten URL extraction regex for parens-bearing URLs #17

New issue

Open

opened 2026-04-26 01:26:42 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-26 01:26:42 +02:00

Owner

Context

PollFediverseAction::processLinks() uses regex ~https?://[^\s<>"\'()\[\]]+~ to extract URLs from post bodies. The character class excludes parentheses, which means it under-extracts URLs that legitimately contain them.

Example miss: https://en.wikipedia.org/wiki/Foo_(disambiguation) — only https://en.wikipedia.org/wiki/Foo_ is matched, the trailing (disambiguation) is dropped.

For the small-web target audience this is probably fine for v0.1 (Wikipedia isn't really small-web), but it's a known under-extraction bug worth flagging.

Goal

URL extraction handles balanced parens correctly:

https://en.wikipedia.org/wiki/Foo_(disambiguation) → extracted intact
(https://example.com/x) (URL inside surrounding parens) → still extracts https://example.com/x without the wrapping parens
Trailing punctuation (., ,, ;, :, !, ?) still trimmed

Acceptance criteria

New test cases in PollFediverseActionTest covering wikipedia-style URL, parenthesized URL, URL with trailing close-paren that's NOT part of the URL (e.g. "see (https://example.com/foo)" should extract https://example.com/foo, not https://example.com/foo)).
Existing tests stay green.
Implementation can be a smarter regex (balanced-paren matching is non-trivial in regex) OR a post-filter that walks the URL string and counts parens.

References

GitHub uses this approach: https://github.com/github/cmark-gfm — autolink extension. See their algorithm for paren handling.

## Context `PollFediverseAction::processLinks()` uses regex `~https?://[^\s<>"\'()\[\]]+~` to extract URLs from post bodies. The character class excludes parentheses, which means it under-extracts URLs that legitimately contain them. Example miss: `https://en.wikipedia.org/wiki/Foo_(disambiguation)` — only `https://en.wikipedia.org/wiki/Foo_` is matched, the trailing `(disambiguation)` is dropped. For the small-web target audience this is probably fine for v0.1 (Wikipedia isn't really small-web), but it's a known under-extraction bug worth flagging. ## Goal URL extraction handles balanced parens correctly: - `https://en.wikipedia.org/wiki/Foo_(disambiguation)` → extracted intact - `(https://example.com/x)` (URL inside surrounding parens) → still extracts `https://example.com/x` without the wrapping parens - Trailing punctuation (`.`, `,`, `;`, `:`, `!`, `?`) still trimmed ## Acceptance criteria - New test cases in `PollFediverseActionTest` covering wikipedia-style URL, parenthesized URL, URL with trailing close-paren that's NOT part of the URL (e.g. `"see (https://example.com/foo)"` should extract `https://example.com/foo`, not `https://example.com/foo)`). - Existing tests stay green. - Implementation can be a smarter regex (balanced-paren matching is non-trivial in regex) OR a post-filter that walks the URL string and counts parens. ## References GitHub uses this approach: https://github.com/github/cmark-gfm — autolink extension. See their algorithm for paren handling.