Models Misidentify Historical Indian Artifacts

Why this is here: The TAB-VLM benchmark includes questions about 1,600 Indian cultural artifacts, ranging from prehistoric times to the modern era.

Researchers evaluated ten vision-language models on 1,600 Indian cultural artifacts from prehistoric to modern times. They identified a problem called cultural anachronism. This happens when models incorrectly interpret objects using concepts from the wrong time period.

The team created the Temporal Anachronism Benchmark for Vision-Language Models, or TAB-VLM. This benchmark uses 600 questions across six categories to test how well models reason about time. Results show even the best model, GPT-5.2, only achieved 58.7% accuracy on the benchmark.

The performance gap exists across different model types and sizes. This suggests current visual AI systems struggle with historical context, especially when dealing with visual cultures not well represented in training data. The researchers released the dataset and code for further study.