How we select stories

This section is currently available in English only.

"Just filter the news for what's actually working." It sounds like a weekend project. Grab an LLM, score some articles, train a classifier. Ship it.

We tried. It took us a year, multiple filters, and a few humbling failures before we understood why this is fundamentally harder than it looks. Not because the technology is immature, but because the problem has a structural property that standard approaches don't handle well: the thing you're looking for is rare, and the reason it's rare is the same bias you're trying to correct.

An article qualifies when the constructive finding sits at the center of the story rather than the edge, when its claims rest on evidence rather than aspiration, and when the piece is honest about what didn't work or who didn't benefit. The bar is what still looks true on a second reading.

Five constructive lenses

Each lens looks for a different kind of evidence that the daily news cycle tends to miss:

Thriving People thriving, health improving, lives getting better
Belonging Community bonds, rootedness, living cultural practices and heritage
Recovery Ecosystems healing, species returning, nature bouncing back
Solutions Technology, policy, and initiatives that work — including long-horizon plans with delivered outcomes
Discovery Archaeology, rediscovered knowledge, historical context surfacing

Together they surface evidence that the world is more functional than the news makes it appear. The engineering challenge: constructive news is a needle in a haystack, and the haystack is designed to hide the needle.

A note on vocabulary used below. A scorer is a trained filter that reads articles and produces a 0–10 score (we have six, described in llm-distillery). A lens is a reader-facing pool on the site. Each scorer has a default lens; one scorer (foresight) now routes into Solutions. When we describe training challenges below, the "foresight filter" means the scorer, not a tab on the site.

Why keywords and sentiment fail

The first instinct is keyword matching. Surely "community" finds belonging, "long-term" finds foresight, "recovery" finds nature recovery?

It doesn't. What we're looking for isn't a topic — it's a judgment.

Our belonging filter scores articles on six dimensions (defined in llm-distillery/filters/belonging): intergenerational bonds, community fabric, reciprocal care, rootedness, purpose beyond self, and slow presence. A LinkedIn article about "building community at work" contains all the right keywords. It scores 1.3 out of 10. A story about a 94-year-old making pasta with her granddaughter using her mother's recipe scores 8.5. No keyword distinguishes them. The difference is what kind of community — commodified versus organic, optimized versus lived.

Sentiment analysis fails in the opposite direction. Constructive news isn't positive news. A country admitting its drug war failed and shifting to treatment — that reads as negative. Our foresight filter scores it 6.1 because the decision-making process shows evidence-based course correction, systems awareness, and institutional durability. Meanwhile, a cheerful wellness listicle about living longer scores 1.3 on belonging because it commodifies community as a longevity hack.

The judgment we need — "does this article demonstrate genuine foresight / belonging / recovery?" — requires understanding intent, process quality, and evidence, not just topic or tone.

Scoring across dimensions

Our approach: decompose each lens into 6 to 8 weighted dimensions that can be scored independently on a 0–10 scale. Take Thriving as an example:

Thriving lens scoring dimensions and weights
Dimension	Weight	What it measures
Human wellbeing impact	30%	Health, quality of life, personal welfare
Social cohesion impact	25%	Community strengthening, solidarity
Justice & rights impact	15%	Fairness, equality, human rights
Evidence level	12%	Reliability of claims
Benefit distribution	10%	Who benefits, equity of reach
Change durability	8%	Lasting vs. temporary improvement

The Solutions lens weighs completely different things: technology readiness (20%), technical performance (18%), economic competitiveness (15%), environmental life cycle impact (15%). Same scoring system, different definition of what matters. Each of our six scorers has its own dimension set and weights — the exact numbers are published in llm-distillery.

Each article gets a weighted average across all dimensions, then a tier: high, medium, or low. Only the top tiers make it to the site.

The needle problem

A large language model (the oracle) scores articles on these dimensions. It's good at this. But it's a cloud API call for every article, every lens, every run. What if we could capture what it knows and run it locally? We use the oracle to score thousands of articles, then train a small language model (SLM) to replicate those judgments. This is loosely called knowledge distillation — the large model teaches, the small model learns.

But this only works if the student has good training data. When we trained the foresight filter, we scored 300 random articles from our news corpus. The distribution:

Foresight score distribution before screening
Score range	% of articles
0–2 (outside this lens)	90%
2–5 (some foresight)	9%
5+ (genuine foresight)	1%

A low score does not mean bad journalism. It means the article covers territory outside what this particular lens looks for. Most of the 90% is competent, well-reported work on topics that simply aren't about long-term institutional decision-making. But from a training-data perspective, ninety percent of articles cluster at the bottom of the scale. The 2–5 range — where the model needs to learn the gradient from "a bit of foresight" to "strong foresight" — is almost empty. And the high-scoring articles that define what foresight looks like? Three articles out of 300.

This is not a labeling error. It reflects what the daily news cycle covers. News selects for immediacy: this week's crisis, this quarter's earnings, this election's polls. Genuine foresight — decisions made for generations ahead — is not what newsrooms cover.

A small model trained on this distribution learns exactly one thing: predict low scores for everything. That minimizes average error when 90% of your training data is low-relevance articles. The resulting model has a technically acceptable loss but is useless — it can't distinguish a New Zealand wellbeing budget reform from a celebrity interview.

Two-stage screening

The solution separates two questions that the oracle was trying to answer simultaneously:

Stage 1: "Is this article relevant?" — handled by an embedding screener before oracle scoring.

We write 10–15 descriptions of what the concept looks like in practice (for foresight: New Zealand's wellbeing budget, Costa Rica's 30-year reforestation, Wales's Future Generations Commissioner). A small, fast model finds articles that resemble these examples. The top candidates get sent to the oracle.

Stage 2: "How much foresight does it contain?" — handled by the oracle, now scoring only relevant articles instead of everything.

The result:

Foresight score distribution before and after screening
Score range	Before screening	After screening
0–2 (outside lens)	90%	23%
2–5 (some foresight)	9%	55%
5+ (genuine foresight)	1%	20%

The dead zone disappeared. The SLM now has examples across the full score range, and foresight went from unusable to a working production filter.

Knowledge distillation

The deeper framing of distillation is energy. An oracle scoring run is a one-time energy investment. It calls a cloud API for a few hours, scores a few thousand articles, and produces training data. After that, the SLM runs on a local GPU — no data center, no network round-trip. Training takes about 30 minutes on a consumer GPU. At 2,000 articles per day across 6 filters, the daily energy draw of inference is a fraction of what the oracle run consumed.

The summarization pipeline on ovr.news itself runs a 27-billion parameter model locally, not in the cloud. Full control, no API costs for the bulk of the work.

What we don't solve yet

Dimensions bleed into each other. Some concepts we try to score separately turn out to be genuinely related. The models conflate them, and we haven't fully solved that.
Subtle judgment is hard for small models. Distinguishing "token caveat" from "genuine nuance" requires reading comprehension that may have a floor.
Calibration misleads on small datasets. We can tune accuracy on a validation set, but with only a few hundred examples, the tuning overfits. We report the uncalibrated number as the honest one.
We miss things. Pre-screening finds articles that resemble our seed examples. If a foresighted decision is described in unusual language, the screener won't find it. The seeds themselves encode our editorial judgment — different seeds would find different needles.

The pattern

The solution is not a single technique but a pipeline:

Dimensional scoring — decompose judgment into measurable sub-factors
Embedding pre-screening — find the needles before you score them
Soft scope gating — let the oracle grade on a gradient, not a binary
Knowledge distillation — invest energy once, infer sustainably forever

We didn't design this pipeline in advance. We discovered it by failing. Each failure taught us one piece. The pattern applies beyond news — any domain where the target signal is rare and the noise is systematically produced will have this property.

Once an article passes these filters, it enters the ranking algorithm. See how we rank stories for the full formula.

If you're a publisher and believe your work fits one of our lenses, get in touch.

The scoring dimensions and filter weights are published in llm-distillery. Two trained filters are available on Hugging Face.

Last updated: April 2026