You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Abstract

Progress in the field of AI has largely been driven by methods that more effectively leverage increasing computation and data. Generally, this takes the form of approaches with weaker inductive biases or assumptions about the data performing asymptotically better than approaches with stronger assumptions. This is particularly characteristic in the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the eventual widespread success of Self-Supervised Learning without human labels. Yet even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, cropping, or raw data reconstruction. Notably, we show empirically that the optimal strength of these inductive biases decreases as data scale grows — motivating the search for approaches that rely on as few assumptions as possible. To push this frontier further, we introduce Temporal Difference in Vision (TDV), a new approach for self-supervised learning from video that avoids the reliance on existing inductive biases, relying instead on the causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches or surpasses state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning with weaker assumptions.

Intuition

The frame encoder maps the current frame to its latent embedding, while the motion encoder maps the RGB difference to a latent delta, enforcing the causal principle that the current representation plus the motion delta equals the next frame's representation. Since consecutive frames are temporally close, the RGB difference captures primarily edges and moving regions, encouraging the motion encoder to learn compact representations of spatial change rather than full scene appearance.

Architecture

The frame encoder produces a representation of the current frame, while the motion encoder predicts how that representation should change by encoding the RGB difference, conditioned on the current frame's representation via cross-attention. Adding these together yields a predicted next-frame representation, supervised against a teacher EMA frame encoder's embedding of the next frame via MSE. A DINO-style categorical cross-entropy loss ensures representations remain discriminative and do not collapse.

Results

TDV outperforms iBOT and DINO on most optical flow and stereo depth comparisons. On optical flow, TDV consistently achieves lower EPE (endpoint error), which we attribute to TDV explicitly learning to predict how representations evolve between frames — naturally preserving local motion structure that image-based methods with invariance augmentations tend to discard. On stereo depth, TDV achieves lower "bad" pixel rates at both the 0.5px and 1px thresholds, indicating significantly fewer large correspondence errors, with a small trade-off on average disparity error.

BibTeX

@misc{daithankar2026dontneedstrongassumptions,
      title={You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences},
      author={Ninad Daithankar and Alexi Gladstone and Yann LeCun and Heng Ji},
      year={2026},
      eprint={2606.15956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.15956},
}