Data Sources // AIDRAN

Overview

In plain English

AIDRAN uses source-specific ingestion tasks for public discourse, public article discovery, official releases, developer ecosystem signals, regulatory records, and enrichment watchlists. We do not access private messages, locked accounts, or paywalled article bodies.

AIDRAN’s ingestion layer is made of source-specific adapters. The active source set includes arXiv, Bluesky, Hacker News, Google News, Reddit, Twitter/X, YouTube, Exa, Websets, OpenAlex, and Hugging Face. The corpus also recognizes optional expansion lanes for official web sources, GitHub, package registries, Stack Exchange, regulatory and filing sources, Product Hunt, Mastodon, and GDELT. Those workers write source configuration rows and public record rows while preserving actual publisher or platform attribution. Analysis, signal detection, story generation, and the web app read from that corpus through the Delivery API.

The system is built around public evidence. It does not read private messages, locked accounts, authentication-only pages, or paywalled article bodies. When a public source exposes author handles, bylines, timestamps, links, or engagement metrics, those fields may be stored so AIDRAN can attribute and weigh the record.

Display Buckets

In plain English

The website may show article records as News. That is a display bucket, not a hidden source list.

Source kind and public display label are not always the same thing. The database keeps the upstream source kind on each record, while the web app groups article-category records under a reader-facing News label where that is clearer than naming a discovery or enrichment provider.

Google News is one of the scheduled ingestion sources. arXiv, Exa, Websets, and OpenAlex are also article-category sources. Official web, GitHub, package registries, regulatory sources, Product Hunt, and GDELT are article-category sources too. Hugging Face keeps a distinct Hugging Face label because readers need to distinguish model, paper, and dataset watchlist records from generic article discovery. External articles surfaced during story enrichment may also appear in the same News bucket with the actual publisher domain shown when available. News is therefore a presentation bucket for public article records and web citations, not a separate private feed.

Cadence And Status

In plain English

Ingestion is run by source-specific Cloudflare workflow tasks. Cadence and enabled status are operational settings, not public promises.

Each scheduled ingestion source has its own Cloudflare workflow task and deployment trigger. Cadence, limits, and whether a credentialed source is enabled can change as upstream APIs, quotas, and reliability change. Enrichment article records are created by story-enrichment work or curated imports, not by a public scrape schedule. Delivery exposes source rows with enabled status and recent record volume; public stories and citations are generated from the records and citation links actually present in the corpus.

In plain English

Public subreddit discussions provide structured community-level AI discourse.

Reddit records come from public AI-related subreddit listings. The ingestion worker iterates a maintained subreddit set, skips removed or deleted text, and stores public post fields such as title, text when present, URL, author handle, subreddit, score, comment count, flair, and permalink.

Record category: Discourse
Record type: Public posts from AI-related subreddits
Context stored: Subreddit, link metadata, and public engagement fields

Bluesky

In plain English

Public Bluesky search results capture AT Protocol posts about AI.

Bluesky records come from public AT Protocol search results. The adapter searches for AI-related language, paginates through public posts, and stores text, author handle or DID, URL, language, and public reply, repost, and like counts.

Record category: Discourse
Record type: Public posts
Context stored: Handles, language, and public engagement fields

Hacker News

In plain English

Hacker News provides public technical-community discussion through its read-only API.

Hacker News records come from the public Firebase API. The current worker reads the public top-stories feed, fetches item details, filters dead or deleted items, and stores title, text when present, URL, author, score, descendant count, and item metadata.

Record category: Discourse
Record type: Public story items
Context stored: Points, descendant counts, item type, and public links

Google News

In plain English

Google News is a public RSS discovery source for AI-related articles.

Google News records come from public RSS search results. AIDRAN stores article titles, descriptions from the feed, publisher names when present, publication times, and the stable Google News redirect URL. It does not bypass publisher paywalls or claim to store the full article body from the linked site.

Record category: Article
Record type: Public RSS article entries
Context stored: Publisher, description, publication time, and source URL

YouTube

In plain English

YouTube records capture public video metadata and engagement statistics.

YouTube records come from the YouTube Data API. The worker searches recent public videos about AI and stores video title, description, channel information, publication time, thumbnail URL, public view, like, and comment counts, and tags when the API provides them.

Record category: Discourse
Record type: Public video metadata
Context stored: Channel, description, thumbnail, available tags, and public metrics

arXiv

In plain English

arXiv provides public research preprints in AI-relevant categories.

arXiv records come from the public Atom API. AIDRAN searches AI-relevant categories, including cs.AI, cs.LG, cs.CL, and stat.ML, and stores paper title, abstract, authors, categories, publication time, and canonical arXiv URL.

Record category: Article
Record type: Public preprint metadata and abstracts
Context stored: Authors, categories, abstract, and canonical URL

OpenAlex

In plain English

OpenAlex provides public scholarly works metadata for AI-relevant research records and enrichment.

OpenAlex Works records come from the public OpenAlex API. AIDRAN stores public scholarly metadata such as titles, abstracts or inverted abstracts when available, authorship and venue metadata, DOI or OpenAlex identifiers, publication dates, concepts, and canonical source URLs.

Record category: Article
Record type: Public scholarly works metadata
Context stored: Work identifiers, authorship, venue, concepts, abstract metadata, and source URLs

X (Twitter)

In plain English

Twitter/X captures public recent-search posts when API access is configured.

Twitter/X records come from the API v2 recent-search endpoint when a bearer token is configured. The adapter searches public English-language AI posts, excludes retweets and replies in its query, and stores text, author id, public URL, publication time, language, public metrics, and expanded URLs.

Record category: Discourse
Record type: Public recent-search posts
Context stored: Public metrics, language, linked URLs, and tweet URL

Exa and Websets

In plain English

Public web articles can supplement story context. The site displays them as News with publisher or domain attribution when available.

Story enrichment can use public web article results from Exa and curated Webset article imports when a story needs outside article context. Live Exa results may be stored as article records, and curated Webset entries can be imported as article records. Some cited web sources appear only as external citation links attached to a story rather than as scheduled ingestion rows.

Record category: Article
Record type: Public web article results and curated public article entries
Context stored: Publisher or domain, title, snippet or excerpt, URL, publication date when available, and provider metadata

Hugging Face

In plain English

Hugging Face records track public AI model, paper, and dataset pages without treating them as generic News.

Hugging Face records come from public watchlist targets on huggingface.co, such as model, paper, organization, and dataset pages relevant to AI discourse. AIDRAN stores public page metadata, titles, URLs, timestamps when available, and provider metadata needed for attribution.

Record category: Article
Record type: Public AI repository and research watchlist items
Context stored: Title, URL, public page metadata, and provider metadata

Official Web

In plain English

Official web sources track public release notes, changelogs, model cards, docs, and policy pages from configured publishers.

Official web records come from configured public feeds, sitemaps, and pages for AI labs, companies, standards bodies, and product teams. AIDRAN stores public titles, summaries, canonical URLs, publisher names or domains, timestamps when available, and source metadata needed to distinguish the configured official source from the linked publisher.

Record category: Article
Record type: Public release, documentation, policy, model-card, and changelog pages
Context stored: Publisher, configured source, URL, summary, and publication or update time when available

GitHub

In plain English

GitHub records track public release metadata from configured repositories.

GitHub records come from configured public repositories. The worker currently focuses on release metadata, storing release names, tags, URLs, authors when present, publication times, and bounded release-note text from the public API.

Record category: Article
Record type: Public repository release metadata
Context stored: Repository, release tag, author, URL, and release-note excerpt

Package Registries

In plain English

Package registry records track public npm and PyPI package release metadata from configured watchlists.

Package registry records come from configured npm and PyPI packages relevant to AI infrastructure. AIDRAN stores public package names, versions, descriptions, release times when available, registry URLs, and provider metadata. It does not install packages or inspect private registry content.

Record category: Article
Record type: Public package and version metadata
Context stored: Registry, package, version, URL, description, and publication time when available

Stack Exchange

In plain English

Stack Exchange records capture public technical Q&A about AI and developer tooling.

Stack Exchange records come from public search results on configured Stack Exchange sites. AIDRAN stores question titles, bounded public excerpts or body text when returned by the API, canonical question URLs, author display names, tags, scores, answer counts, and accepted-answer status.

Record category: Discourse
Record type: Public Q&A questions
Context stored: Site, tags, scores, answer counts, public author display name, and canonical URL

Regulatory and Filings

In plain English

Regulatory records track public filings, notices, standards, and official RSS sources.

Regulatory records come from configured public sources such as the Federal Register, SEC EDGAR company filings, and official RSS feeds. AIDRAN stores titles, summaries, agency or company names, forms or docket identifiers when present, canonical URLs, and publication times.

Record category: Article
Record type: Public filings, notices, standards, and official records
Context stored: Agency or company, document identifiers, URL, summary, and publication time

Product Hunt

In plain English

Product Hunt is an opt-in launch metadata source and remains restricted until API/commercial-use review approves activation.

Product Hunt records, when explicitly enabled, come from the Product Hunt GraphQL API and describe public product launches. The task is disabled by default and requires configured credentials plus operator approval. Raw Product Hunt records are not served to customer API keys until API and commercial-use review approves that access.

Record category: Article
Record type: Public launch metadata
Context stored: Product name, tagline, launch URL, maker names, topics, rankings, votes, and publication time

Mastodon

In plain English

Mastodon records capture public Fediverse posts from configured accounts, instances, or hashtags.

Mastodon records come from public instance API responses for configured accounts and hashtags. AIDRAN stores post text, canonical URLs, account identifiers, publication time, language when present, and public reply, favorite, and reblog counts.

Record category: Discourse
Record type: Public Fediverse posts
Context stored: Instance, account or hashtag, URL, language, and public engagement fields

GDELT

In plain English

GDELT records broaden public article discovery while preserving publisher or domain attribution.

GDELT records come from public GDELT DOC article-list results for configured AI-related queries. AIDRAN stores article titles, URLs, publisher or domain attribution, language, source country when available, and GDELT metadata. The linked publisher remains the attributed source for the article.

Record category: Article
Record type: Public article discovery metadata
Context stored: Publisher or domain, URL, language, source country, and discovery-query metadata

What We Don't Collect

In plain English

No private messages, locked accounts, or paywalled article bodies. Public attribution fields may be stored when the source provides them.

Private messages, DMs, or non-public account content
Content behind authentication walls that the public cannot access
Paywalled article bodies or paywall bypasses
Reader behavior profiles or user-submitted private material
Content from locked or private accounts
Private identity enrichment beyond public handles, bylines, and source attribution fields

Content Removal

In plain English

If your public post or article metadata appears in AIDRAN and you want it removed, send us the original URL or source identifier.

If you are the author or rights holder for content that appears in AIDRAN and would like it removed, please contact us at [email protected] with a link to the original content or enough source information for us to identify the record. We will process removal requests within 30 days.