TL;DR - We introduce Temporal Difference in Vision (TDV), a self-supervised learning approach that trains purely from video using a single causal assumption: the past causes the future. By enforcing that a frame's representation plus an encoded motion signal equals the next frame's representation, TDV avoids the strong inductive biases (augmentations, masking, cropping) relied on by prior SSL methods — and matches or surpasses their performance on dense spatial tasks.
@misc{daithankar2026dontneedstrongassumptions,
title={You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences},
author={Ninad Daithankar and Alexi Gladstone and Yann LeCun and Heng Ji},
year={2026},
eprint={2606.15956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.15956},
}