LemmyClient: page-walking via page=N until cursor reached #16

Open
opened 2026-04-26 01:26:41 +02:00 by myrmidex · 0 comments
Owner

Context

LemmyClient::fetchPostsSince() currently fetches a single page (?sort=New&limit=50) and stops. If a Lemmy instance posts more than 50 items between polls, the older items are dropped.

MastodonClient doesn't have this issue because it uses min_id cursor semantics — the API returns everything newer than min_id in one response.

Lemmy's /api/v3/post/list doesn't support min_id; pagination is page-based (page=1, 2, 3...) sorted by recency.

Goal

Walk Lemmy pages forward from page=1 until reaching a post with id <= last_seen_id, then stop. New high-water mark is the highest id seen across all walked pages (the first post on page 1).

Acceptance criteria

  • First poll (last_seen_id=null): fetch only page 1. Don't backfill the entire history.
  • Subsequent polls: walk page 1, 2, 3, ... until either (a) any post has id <= last_seen_id, or (b) page returns 0 posts, or (c) hard cap (e.g. 10 pages) reached as a safety net.
  • Tests with Http::fake() that simulate 3 pages of posts spanning the cursor — confirm correct stop, correct cursor written, no infinite loop.
  • Tests that verify the safety cap kicks in and a warning is logged.

Risks

  • An instance with very fast post rate could trigger the cap on every poll → file a follow-up if observed in real corpora.
  • Increased HTTP requests per poll → respects existing timeout(10s) per request, but 10 pages × 10s = up to 100s per instance worst case. May want to factor into the scheduler's withoutOverlapping(5) window or per-page timeout reduction.
## Context `LemmyClient::fetchPostsSince()` currently fetches a single page (`?sort=New&limit=50`) and stops. If a Lemmy instance posts more than 50 items between polls, the older items are dropped. `MastodonClient` doesn't have this issue because it uses `min_id` cursor semantics — the API returns everything newer than `min_id` in one response. Lemmy's `/api/v3/post/list` doesn't support `min_id`; pagination is page-based (`page=1, 2, 3...`) sorted by recency. ## Goal Walk Lemmy pages forward from page=1 until reaching a post with `id <= last_seen_id`, then stop. New high-water mark is the highest `id` seen across all walked pages (the first post on page 1). ## Acceptance criteria - First poll (last_seen_id=null): fetch only page 1. Don't backfill the entire history. - Subsequent polls: walk page 1, 2, 3, ... until either (a) any post has `id <= last_seen_id`, or (b) page returns 0 posts, or (c) hard cap (e.g. 10 pages) reached as a safety net. - Tests with `Http::fake()` that simulate 3 pages of posts spanning the cursor — confirm correct stop, correct cursor written, no infinite loop. - Tests that verify the safety cap kicks in and a warning is logged. ## Risks - An instance with very fast post rate could trigger the cap on every poll → file a follow-up if observed in real corpora. - Increased HTTP requests per poll → respects existing `timeout(10s)` per request, but 10 pages × 10s = up to 100s per instance worst case. May want to factor into the scheduler's `withoutOverlapping(5)` window or per-page timeout reduction.
myrmidex added this to the v0.2 milestone 2026-04-26 01:26:41 +02:00
myrmidex self-assigned this 2026-04-26 01:26:41 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex modified the milestone from v0.2 to v0.3 2026-04-29 23:54:15 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#16
No description provided.