URL Discovery and Storage #4

New issue

Closed

opened 2026-04-23 01:32:53 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-23 01:32:53 +02:00

Owner

Migration: discovered_urls (url, source, source_instance, source_post_id, engagement_count, first_seen_at, provenance json)
DiscoveredUrl model
Extract URLs from Mastodon Note content (HTML parsing)
Extract URLs from Lemmy Page objects (url + body)
URL normalization (strip tracking params, lowercase host)
Deduplicate within batch
Queue job ProcessFediverseBatch for extraction + storage

- [ ] Migration: discovered_urls (url, source, source_instance, source_post_id, engagement_count, first_seen_at, provenance json) - [ ] DiscoveredUrl model - [ ] Extract URLs from Mastodon Note content (HTML parsing) - [ ] Extract URLs from Lemmy Page objects (url + body) - [ ] URL normalization (strip tracking params, lowercase host) - [ ] Deduplicate within batch - [ ] Queue job ProcessFediverseBatch for extraction + storage