Developers
Data Sources
AIDRAN ingests public discourse from 8 platforms. Each source has its own adapter, cron schedule, rate-limiting strategy, and authentication requirements. All data is normalized into a shared schema, deduplicated on ingest, and queued for embedding and analysis.
469k
Total Records
8
Platforms
4
Cron Jobs
29k
Last 24 Hours
Every signal begins as raw public discourse — and ends as structured intelligence.
Volume by Platform
| Source | Total | 24h | Share |
|---|---|---|---|
| 239k | 15k | 51.0% | |
| Bluesky | 155k | 5.2k | 33.0% |
| Google News | 32k | 5.2k | 6.8% |
| X / Twitter | 22k | 2.0k | 4.6% |
| gdelt | 12k | 0 | 2.6% |
| YouTube | 8.4k | 872 | 1.8% |
| arXiv | 787 | 0 | 0.2% |
| Hacker News | 170 | 13 | 0.0% |
Shared Pipeline
Every source adapter normalizes its API response into a shared RawDiscourseRecord shape. From there, the shared pipeline handles everything.
Dedup
Each record has a composite sourceId (e.g., {author.did}:{rkey} for Bluesky, reddit-{postId} for Reddit). The seen_source_ids table prevents re-ingestion.
Insert
New records are inserted into the discourse table with source, content_type, topic, content_text, title, url, author, published_at, and source_metadata.
Queue
Each new record is added to both embedding_queue (for Voyage AI vectorization) and analysis_queue (for Claude Haiku sentiment/entity analysis).
Log
Every ingestion run is logged in ingestion_log with source, topic, records fetched/stored, status, and error messages.
Record Schema
The shape every source adapter must produce:
interface RawDiscourseRecord {
source: 'reddit' | 'bluesky' | 'news' | 'hackernews'
| 'arxiv' | 'youtube' | 'twitter';
contentType: 'post' | 'comment' | 'article';
sourceId: string; // unique per source
contentText: string | null;
title: string | null;
url: string | null;
author: string | null;
publishedAt: Date | null;
topic: string; // topic slug
subreddit?: string | null;
language?: string;
sourceMetadata: Record<string, unknown>;
}Source Details
Bluesky
HealthyAT Protocol (@atproto/api)
Query Strategy
Keyword search across all public posts. Collects all keyword searches for all topics, then runs in waves of 3 concurrent requests.
Rate Limiting
Concurrency of 3 (reduced from 5 — Bluesky rate limits aggressively), 1s wave delay
Last run: 21m ago · 239 fetched · 0 new
Arctic Shift API (fallback: Reddit OAuth2)
Query Strategy
Per-topic subreddit lists + keyword search. Fetches posts from the last 24 hours per topic, concurrency of 2 with 2s wave delay.
Rate Limiting
Community project — limited to 2 concurrent requests with 2s between waves
Last run: 45m ago · 2411 fetched · 154 new
Hacker News
No dataAlgolia HN Search API
Query Strategy
Per-topic search using the topic's first 3 keywords. Filters to stories with >5 points from the last 24 hours.
Rate Limiting
Very generous — 3 concurrent, 500ms wave delay
Google News
No dataGoogle News RSS + Claude Haiku query generation
Query Strategy
For each topic, Claude Haiku generates a natural search query (3–6 words) from the topic keywords. Query varies each run to avoid bot detection.
Rate Limiting
5s delay between topics, 10s fetch timeout
YouTube
No dataYouTube Data API v3
Query Strategy
Per-topic search using the first keyword. Fetches up to 5 videos and their top 10 comments each.
Rate Limiting
1s delay between topics, ~2,520 units/day (within 10K budget)
arXiv
No dataarXiv API (Atom XML)
Query Strategy
Per-topic query using the first keyword, filtered to AI-relevant categories: cs.AI, cs.CL, cs.LG. Sorted by submission date, max 30 results.
Rate Limiting
3-second delay between topics (arXiv honor system)
X / Twitter
No dataTwitter API v2 (recent search)
Query Strategy
Per-topic search using the first 2 keywords. 7-day search window, max 50 results per query.
Rate Limiting
450 requests/15 min on Basic plan, 3 concurrent with 2s wave delay