Same story, twenty sources

This section is currently available in English only.

When a species comes back from the brink, every nature outlet covers it. Twenty versions of the same event land in our pipeline within hours. Which one do you keep? And what do the other nineteen tell you?

These are two different questions. The first is about removing noise. The second is about detecting a signal. We handle them with two separate mechanisms: deduplication and corroboration.

The dedup problem

Our pipeline collects from over 1,400 sources across 50 countries. Every four hours, new articles arrive. When something genuinely newsworthy happens, a dozen outlets publish within the same day. Showing all of them wastes your attention.

The naive approach is keyword matching: if two titles share enough words, they're the same story. We tried this. Title Jaccard similarity (the overlap between the sets of words in two titles, divided by their union) seemed promising. In practice, it barely works.

Why naive approaches fail

We analyzed duplicate pairs across several weeks of data. The Jaccard scores between titles that humans would call "the same story" ranged from 0.00 to 0.33. That's essentially random.

Why so low? Because different outlets frame the same event with entirely different vocabulary. One writes "Wolf population rebounds in Germany." Another writes "Predator numbers surge in Central Europe." Same story. Zero word overlap after removing stop words.

Multilingual sources make this worse. The same event appears in English, Dutch, German, and Spanish with no lexical overlap at all. Semantic similarity (comparing meaning, not words) would help, but requires embedding every title on every run. At our scale, that's a real cost.

What we actually do

Deduplication happens in two places, at two different scales:

Upstream (NexusMind): The scoring service detects duplicates across runs using content hashing and URL normalization. If the same article appears in multiple collection cycles, only the first instance is scored. This catches exact and near-exact duplicates before they reach us.

At build time (Chief Editor): A second, title-comparison pass (Jaccard similarity) exists in the editorial layer but is currently switched off — in practice, paraphrased titles defeated it. Cross-source duplicates are instead folded upstream by cluster detection: when several outlets cover the same story, one representative survives and the others become corroborating sources on its page.

This approach catches most duplicates. It misses some. A story about forest recovery in the Amazon might appear from both a Brazilian and a European source with completely different framing. We accept this trade-off: a duplicate is annoying, but a missed story is invisible.

Corroboration: the flip side

Here's the insight that changed our approach: duplicate coverage is not just noise. When multiple independent sources report the same story, that's evidence of reliability. A reef recovery story covered by one blog could be aspirational. The same story covered by Reuters, a marine research institute, and a local newspaper is corroborated.

Our Chief Editor tracks corroboration. When an article has been independently reported by other outlets (tracked via NexusMind's cross-source analysis), it gets promoted in the feed. Corroborated stories are sorted by three criteria:

Image quality. A story with a real photograph from the source (og:image) is more credible than one with a stock photo.
Source credibility. Articles from verified, high-credibility sources are preferred. We check against external credibility databases including IDIAP, Media Bias/Fact Check, and Wikipedia Perennial Sources.
Corroboration count. More independent sources confirming the story means stronger evidence.

The result: you see one version of each story, and the one you see is the most trustworthy version available. Independent sources are shown in a badge on the article page, linking to each outlet that covered the same story.

Noise vs. signal

Dedup and corroboration solve opposite problems using the same raw material:

	Dedup	Corroboration
Goal	Remove noise	Amplify signal
Input	Multiple versions of same story	Independent coverage of same event
Action	Keep best, remove rest	Promote to top of feed
Signal	Redundancy (annoying)	Reliability (valuable)

What we haven't solved

Cross-lingual dedup remains an open problem. A story about the same forest recovery project might appear in the Recovery lens from a Brazilian source in Portuguese and from a German source in German. Our title comparison works on English translations, which helps, but the translations themselves can diverge enough to slip past the Jaccard threshold.

Semantic dedup (comparing article meaning rather than title words) is the likely next step. We're watching embedding costs come down. Until then, the current system catches most duplicates while staying fast enough to run twelve times a day.

Technical details

Corroboration is a composable rule in the Chief Editor (ADR-029) — a deterministic TypeScript function with configurable thresholds, no AI involved. A Chief-Editor title-dedup rule (Jaccard similarity, 0.8 threshold, 3-day window) exists in the same codebase but is currently switched off: paraphrased titles defeat it, so the duplicate removal you see today happens upstream, before scoring. (The Chief Editor also runs a separate AI editorial gate for content-type decisions; see the trust page.) Every action (removal, promotion) is logged with a reason. The configuration can be adjusted at runtime via a JSON file without code changes.

Current settings: minimum 1 corroborating source for promotion. Deliberately conservative: we'd rather show a duplicate than hide a unique story.

Last updated: July 2026