Data Sources

AIDRAN ingests public discourse from 8 platforms. Each source has its own adapter, cron schedule, rate-limiting strategy, and authentication requirements. All data is normalized into a shared schema, deduplicated on ingest, and queued for embedding and analysis.

469k

Total Records

8

Platforms

4

Cron Jobs

29k

Last 24 Hours

Every signal begins as raw public discourse — and ends as structured intelligence.

Volume by Platform

SourceTotal24hShare
Reddit239k15k
51.0%
Bluesky155k5.2k
33.0%
Google News32k5.2k
6.8%
X / Twitter22k2.0k
4.6%
gdelt12k0
2.6%
YouTube8.4k872
1.8%
arXiv7870
0.2%
Hacker News17013
0.0%

Shared Pipeline

Every source adapter normalizes its API response into a shared RawDiscourseRecord shape. From there, the shared pipeline handles everything.

Dedup

Each record has a composite sourceId (e.g., {author.did}:{rkey} for Bluesky, reddit-{postId} for Reddit). The seen_source_ids table prevents re-ingestion.

Insert

New records are inserted into the discourse table with source, content_type, topic, content_text, title, url, author, published_at, and source_metadata.

Queue

Each new record is added to both embedding_queue (for Voyage AI vectorization) and analysis_queue (for Claude Haiku sentiment/entity analysis).

Log

Every ingestion run is logged in ingestion_log with source, topic, records fetched/stored, status, and error messages.

Record Schema

The shape every source adapter must produce:

TypeScript
interface RawDiscourseRecord {
  source:      'reddit' | 'bluesky' | 'news' | 'hackernews'
             | 'arxiv' | 'youtube' | 'twitter';
  contentType: 'post' | 'comment' | 'article';
  sourceId:    string;       // unique per source
  contentText: string | null;
  title:       string | null;
  url:         string | null;
  author:      string | null;
  publishedAt: Date | null;
  topic:       string;       // topic slug
  subreddit?:  string | null;
  language?:   string;
  sourceMetadata: Record<string, unknown>;
}

Source Details

High Frequency·≤1 hour

Bluesky

Healthy
Last run 21m agoNext every 30 minutes

AT Protocol (@atproto/api)

155k records+5.2k / 24h
ScheduleEvery 30 minutes
Cron*/30 * * * *
AuthBLUESKY_HANDLE + BLUESKY_APP_PASSWORD
ContentPosts, threads (replies detected via record.reply)
Max Duration4.5 min time budget

Query Strategy

Keyword search across all public posts. Collects all keyword searches for all topics, then runs in waves of 3 concurrent requests.

Rate Limiting

Concurrency of 3 (reduced from 5 — Bluesky rate limits aggressively), 1s wave delay

Last run: 21m ago · 239 fetched · 0 new

Reddit

Healthy
Last run 45m agoNext hourly

Arctic Shift API (fallback: Reddit OAuth2)

239k records+15k / 24h
ScheduleHourly
Cron0 * * * *
AuthNone required
ContentPosts, comments
Max Duration5 min

Query Strategy

Per-topic subreddit lists + keyword search. Fetches posts from the last 24 hours per topic, concurrency of 2 with 2s wave delay.

Rate Limiting

Community project — limited to 2 concurrent requests with 2s between waves

Last run: 45m ago · 2411 fetched · 154 new

Hacker News

No data
Next hourly

Algolia HN Search API

170 records+13 / 24h
ScheduleHourly
Cron0 * * * *
AuthNone required
ContentStories (posts with >5 points)
Max Duration5 min

Query Strategy

Per-topic search using the topic's first 3 keywords. Filters to stories with >5 points from the last 24 hours.

Rate Limiting

Very generous — 3 concurrent, 500ms wave delay

Periodic·Every 4 hours

Google News

No data
Next every 4 hours

Google News RSS + Claude Haiku query generation

32k records+5.2k / 24h
ScheduleEvery 4 hours
Cron0 */4 * * *
AuthNone
ContentArticles (title, description, publisher)
Max Duration5 min

Query Strategy

For each topic, Claude Haiku generates a natural search query (3–6 words) from the topic keywords. Query varies each run to avoid bot detection.

Rate Limiting

5s delay between topics, 10s fetch timeout

YouTube

No data
Next every 4 hours

YouTube Data API v3

8.4k records+872 / 24h
ScheduleEvery 4 hours
Cron0 */4 * * *
AuthYOUTUBE_API_KEY
ContentVideo metadata (as article) + top comments
Max Duration5 min

Query Strategy

Per-topic search using the first keyword. Fetches up to 5 videos and their top 10 comments each.

Rate Limiting

1s delay between topics, ~2,520 units/day (within 10K budget)

arXiv

No data
Next every 4 hours

arXiv API (Atom XML)

787 records
ScheduleEvery 4 hours
Cron0 */4 * * *
AuthNone required
ContentPapers (title, abstract, authors, categories)
Max Duration5 min

Query Strategy

Per-topic query using the first keyword, filtered to AI-relevant categories: cs.AI, cs.CL, cs.LG. Sorted by submission date, max 30 results.

Rate Limiting

3-second delay between topics (arXiv honor system)

Daily·Once per day