URL Discovery and Storage #4
Owner
- Migration: discovered_urls (url, source, source_instance, source_post_id, engagement_count, first_seen_at, provenance json)
- DiscoveredUrl model
- Extract URLs from Mastodon Note content (HTML parsing)
- Extract URLs from Lemmy Page objects (url + body)
- URL normalization (strip tracking params, lowercase host)
- Deduplicate within batch
- Queue job ProcessFediverseBatch for extraction + storage
- [ ] Migration: discovered_urls (url, source, source_instance, source_post_id, engagement_count, first_seen_at, provenance json)
- [ ] DiscoveredUrl model
- [ ] Extract URLs from Mastodon Note content (HTML parsing)
- [ ] Extract URLs from Lemmy Page objects (url + body)
- [ ] URL normalization (strip tracking params, lowercase host)
- [ ] Deduplicate within batch
- [ ] Queue job ProcessFediverseBatch for extraction + storage
Switch poll loop to per-instance dispatched jobs #18
enhancement
label
Test environment hardening: APP_KEY override and Postgres test runs #22
Crawler Data Model #7
Release v0.1.0 #30
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?