Same story, twenty sources
This section is currently available in English only.
When a species comes back from the brink, every nature outlet covers it. Twenty versions of the same event land in our pipeline within hours. Which one do you keep? And what do the other nineteen tell you?
These are two different questions. The first is about removing noise. The second is about detecting a signal. We handle them with two separate mechanisms: deduplication and corroboration.
The dedup problem
Our pipeline collects from over 1,400 sources across 50 countries. Every four hours, new articles arrive. When something genuinely newsworthy happens, a dozen outlets publish within the same day. Showing all of them wastes your attention.
The naive approach is keyword matching: if two titles share enough words, they're the same story. We tried this. Title Jaccard similarity (the overlap between the sets of words in two titles, divided by their union) seemed promising. In practice, it barely works.
Why naive approaches fail
We analyzed duplicate pairs across several weeks of data. The Jaccard scores between titles that humans would call "the same story" ranged from 0.00 to 0.33. That's essentially random.
Why so low? Because different outlets frame the same event with entirely different vocabulary. One writes "Wolf population rebounds in Germany." Another writes "Predator numbers surge in Central Europe." Same story. Zero word overlap after removing stop words.
Multilingual sources make this worse. The same event appears in English, Dutch, German, and Spanish with no lexical overlap at all. Semantic similarity (comparing meaning, not words) would help, but requires embedding every title on every run. At our scale, that's a real cost.
What we actually do
Deduplication happens in two places, at two different scales:
Upstream (NexusMind): The scoring service detects duplicates across runs using content hashing and URL normalization. If the same article appears in multiple collection cycles, only the first instance is scored. This catches exact and near-exact duplicates before they reach us.
At build time (Chief Editor): A rule-based editorial layer compares article titles within each lens using Jaccard similarity. When two titles cross a configurable threshold (currently 0.8) within a 3-day window, the version with the lower score is removed. The threshold is deliberately high: we'd rather show a near-duplicate than accidentally remove a genuinely different story.
This two-layer approach catches most duplicates. It misses some. A story about forest recovery in the Amazon might appear from both a Brazilian and a European source with completely different framing. We accept this trade-off: a duplicate is annoying, but a missed story is invisible.
Corroboration: the flip side
Here's the insight that changed our approach: duplicate coverage is not just noise. When multiple independent sources report the same story, that's evidence of reliability. A reef recovery story covered by one blog could be aspirational. The same story covered by Reuters, a marine research institute, and a local newspaper is corroborated.
Our Chief Editor tracks corroboration. When an article has been independently reported by other outlets (tracked via NexusMind's cross-source analysis), it gets promoted in the feed. Corroborated stories are sorted by three criteria:
- Image quality. A story with a real photograph from the source (og:image) is more credible than one with a stock photo.
- Source credibility. Articles from verified, high-credibility sources are preferred. We check against external credibility databases including IDIAP, Media Bias/Fact Check, and Wikipedia Perennial Sources.
- Corroboration count. More independent sources confirming the story means stronger evidence.
The result: you see one version of each story, and the one you see is the most trustworthy version available. Independent sources are shown in a badge on the article page, linking to each outlet that covered the same story.
Noise vs. signal
Dedup and corroboration solve opposite problems using the same raw material:
| Dedup | Corroboration | |
|---|---|---|
| Goal | Remove noise | Amplify signal |
| Input | Multiple versions of same story | Independent coverage of same event |
| Action | Keep best, remove rest | Promote to top of feed |
| Signal | Redundancy (annoying) | Reliability (valuable) |
What we haven't solved
Cross-lingual dedup remains an open problem. A story about the same forest recovery project might appear in the Recovery lens from a Brazilian source in Portuguese and from a German source in German. Our title comparison works on English translations, which helps, but the translations themselves can diverge enough to slip past the Jaccard threshold.
Semantic dedup (comparing article meaning rather than title words) is the likely next step. We're watching embedding costs come down. Until then, the current system catches most duplicates while staying fast enough to run twelve times a day.
Technical details
Both dedup and corroboration are implemented as composable rules in the Chief Editor (ADR-029). Rules are deterministic TypeScript functions with configurable thresholds, not AI. Every action (removal, promotion) is logged with a reason. The configuration can be adjusted at runtime via a JSON file without code changes.
Current settings: Jaccard threshold 0.8, time window 3 days, minimum 1 corroborating source for promotion. These are deliberately conservative. We'd rather show a duplicate than hide a unique story.
Last updated: April 2026