Tighten URL extraction regex for parens-bearing URLs #17

Open
opened 2026-04-26 01:26:42 +02:00 by myrmidex · 0 comments
Owner

Context

PollFediverseAction::processLinks() uses regex ~https?://[^\s<>"\'()\[\]]+~ to extract URLs from post bodies. The character class excludes parentheses, which means it under-extracts URLs that legitimately contain them.

Example miss: https://en.wikipedia.org/wiki/Foo_(disambiguation) — only https://en.wikipedia.org/wiki/Foo_ is matched, the trailing (disambiguation) is dropped.

For the small-web target audience this is probably fine for v0.1 (Wikipedia isn't really small-web), but it's a known under-extraction bug worth flagging.

Goal

URL extraction handles balanced parens correctly:

  • https://en.wikipedia.org/wiki/Foo_(disambiguation) → extracted intact
  • (https://example.com/x) (URL inside surrounding parens) → still extracts https://example.com/x without the wrapping parens
  • Trailing punctuation (., ,, ;, :, !, ?) still trimmed

Acceptance criteria

  • New test cases in PollFediverseActionTest covering wikipedia-style URL, parenthesized URL, URL with trailing close-paren that's NOT part of the URL (e.g. "see (https://example.com/foo)" should extract https://example.com/foo, not https://example.com/foo)).
  • Existing tests stay green.
  • Implementation can be a smarter regex (balanced-paren matching is non-trivial in regex) OR a post-filter that walks the URL string and counts parens.

References

GitHub uses this approach: https://github.com/github/cmark-gfm — autolink extension. See their algorithm for paren handling.

## Context `PollFediverseAction::processLinks()` uses regex `~https?://[^\s<>"\'()\[\]]+~` to extract URLs from post bodies. The character class excludes parentheses, which means it under-extracts URLs that legitimately contain them. Example miss: `https://en.wikipedia.org/wiki/Foo_(disambiguation)` — only `https://en.wikipedia.org/wiki/Foo_` is matched, the trailing `(disambiguation)` is dropped. For the small-web target audience this is probably fine for v0.1 (Wikipedia isn't really small-web), but it's a known under-extraction bug worth flagging. ## Goal URL extraction handles balanced parens correctly: - `https://en.wikipedia.org/wiki/Foo_(disambiguation)` → extracted intact - `(https://example.com/x)` (URL inside surrounding parens) → still extracts `https://example.com/x` without the wrapping parens - Trailing punctuation (`.`, `,`, `;`, `:`, `!`, `?`) still trimmed ## Acceptance criteria - New test cases in `PollFediverseActionTest` covering wikipedia-style URL, parenthesized URL, URL with trailing close-paren that's NOT part of the URL (e.g. `"see (https://example.com/foo)"` should extract `https://example.com/foo`, not `https://example.com/foo)`). - Existing tests stay green. - Implementation can be a smarter regex (balanced-paren matching is non-trivial in regex) OR a post-filter that walks the URL string and counts parens. ## References GitHub uses this approach: https://github.com/github/cmark-gfm — autolink extension. See their algorithm for paren handling.
myrmidex added this to the v0.2 milestone 2026-04-26 01:26:42 +02:00
myrmidex self-assigned this 2026-04-26 01:26:42 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex removed the
enhancement
label 2026-04-26 01:30:13 +02:00
myrmidex modified the milestone from v0.2 to v0.3 2026-04-29 23:55:27 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:01:59 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#17
No description provided.