Developers

Data Sources

AIDRAN ingests public discourse from 8 platforms. Each source has its own adapter, cron schedule, rate-limiting strategy, and authentication requirements. All data is normalized into a shared schema, deduplicated on ingest, and queued for embedding and analysis.

469k

Total Records

Platforms

Cron Jobs

29k

Last 24 Hours

Every signal begins as raw public discourse — and ends as structured intelligence.

Volume by Platform

Source	Total	24h	Share
Reddit	239k	15k	51.0%
Bluesky	155k	5.2k	33.0%
Google News	32k	5.2k	6.8%
X / Twitter	22k	2.0k	4.6%
gdelt	12k	0	2.6%
YouTube	8.4k	872	1.8%
arXiv	787	0	0.2%
Hacker News	170	13	0.0%

Shared Pipeline

Every source adapter normalizes its API response into a shared RawDiscourseRecord shape. From there, the shared pipeline handles everything.

Dedup

Each record has a composite sourceId (e.g., {author.did}:{rkey} for Bluesky, reddit-{postId} for Reddit). The seen_source_ids table prevents re-ingestion.

Insert

New records are inserted into the discourse table with source, content_type, topic, content_text, title, url, author, published_at, and source_metadata.

Queue

Each new record is added to both embedding_queue (for Voyage AI vectorization) and analysis_queue (for Claude Haiku sentiment/entity analysis).

Log

Every ingestion run is logged in ingestion_log with source, topic, records fetched/stored, status, and error messages.

Record Schema

The shape every source adapter must produce:

TypeScript

interface RawDiscourseRecord {
  source:      'reddit' | 'bluesky' | 'news' | 'hackernews'
             | 'arxiv' | 'youtube' | 'twitter';
  contentType: 'post' | 'comment' | 'article';
  sourceId:    string;       // unique per source
  contentText: string | null;
  title:       string | null;
  url:         string | null;
  author:      string | null;
  publishedAt: Date | null;
  topic:       string;       // topic slug
  subreddit?:  string | null;
  language?:   string;
  sourceMetadata: Record<string, unknown>;
}

Source Details

High Frequency·≤1 hour

Bluesky

Healthy

Last run 21m agoNext every 30 minutes

AT Protocol (@atproto/api)

155k records+5.2k / 24h

ScheduleEvery 30 minutes

Cron*/30 * * * *

AuthBLUESKY_HANDLE + BLUESKY_APP_PASSWORD

ContentPosts, threads (replies detected via record.reply)

Max Duration4.5 min time budget

Query Strategy

Keyword search across all public posts. Collects all keyword searches for all topics, then runs in waves of 3 concurrent requests.

Rate Limiting

Concurrency of 3 (reduced from 5 — Bluesky rate limits aggressively), 1s wave delay

Last run: 21m ago · 239 fetched · 0 new

Healthy

Last run 45m agoNext hourly

Arctic Shift API (fallback: Reddit OAuth2)

239k records+15k / 24h

ScheduleHourly

Cron0 * * * *

AuthNone required

ContentPosts, comments

Max Duration5 min

Query Strategy

Per-topic subreddit lists + keyword search. Fetches posts from the last 24 hours per topic, concurrency of 2 with 2s wave delay.

Rate Limiting

Community project — limited to 2 concurrent requests with 2s between waves

Last run: 45m ago · 2411 fetched · 154 new

Hacker News

No data

Next hourly

Algolia HN Search API

170 records+13 / 24h

ScheduleHourly

Cron0 * * * *

AuthNone required

ContentStories (posts with >5 points)

Max Duration5 min

Query Strategy

Per-topic search using the topic's first 3 keywords. Filters to stories with >5 points from the last 24 hours.

Rate Limiting

Very generous — 3 concurrent, 500ms wave delay

Periodic·Every 4 hours

Google News

No data

Next every 4 hours

Google News RSS + Claude Haiku query generation

32k records+5.2k / 24h

ScheduleEvery 4 hours

Cron0 */4 * * *

AuthNone

ContentArticles (title, description, publisher)

Max Duration5 min

Query Strategy

For each topic, Claude Haiku generates a natural search query (3–6 words) from the topic keywords. Query varies each run to avoid bot detection.

Rate Limiting

5s delay between topics, 10s fetch timeout

YouTube

No data

Next every 4 hours

YouTube Data API v3

8.4k records+872 / 24h

ScheduleEvery 4 hours

Cron0 */4 * * *

AuthYOUTUBE_API_KEY

ContentVideo metadata (as article) + top comments

Max Duration5 min

Query Strategy

Per-topic search using the first keyword. Fetches up to 5 videos and their top 10 comments each.

Rate Limiting

1s delay between topics, ~2,520 units/day (within 10K budget)

arXiv

No data

Next every 4 hours

arXiv API (Atom XML)

787 records

ScheduleEvery 4 hours

Cron0 */4 * * *

AuthNone required

ContentPapers (title, abstract, authors, categories)

Max Duration5 min

Query Strategy

Per-topic query using the first keyword, filtered to AI-relevant categories: cs.AI, cs.CL, cs.LG. Sorted by submission date, max 30 results.

Rate Limiting

3-second delay between topics (arXiv honor system)

Daily·Once per day

X / Twitter

No data

Next daily at 2 pm utc

Twitter API v2 (recent search)

22k records+2.0k / 24h

ScheduleDaily at 2 PM UTC

Cron0 14 * * *

AuthTWITTER_BEARER_TOKEN

ContentTweets (with public_metrics)

Max Duration5 min

Query Strategy

Per-topic search using the first 2 keywords. 7-day search window, max 50 results per query.

Rate Limiting

450 requests/15 min on Basic plan, 3 concurrent with 2s wave delay