Learning Temporally Consistent Video Depth from Video Diffusion Priors

1Zhejiang University 2University of Bologna 3Ant Group 4Rock Universe
*equal contribution; corresponding author

This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy — first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen — yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.

Comparison Gallery

Overall Framework

training pipeline

Training pipeline. We add an RGB video conditioning branch to a pre-trained video diffusion model, Stable Video Diffusion, and fine-tune it for consistent depth estimation. Both the RGB and depth videos are projected to a lower-dimensional latent space using a pre-trained encoder. The video depth estimator is trained via denoising score matching (DSM). Our training involves two stages: first, we train the spatial layers with single-frame depths; then, we freeze the spatial layers and train the temporal layers using clips of randomly sampled lengths. This sequential spatial-temporal fine-tuning approach yields better performance than training the full network.

Marigold training scheme

Inference pipeline. We explore two inference strategies. \(\textit{Option 1}\): Separate Inference involves dividing the videos into non-overlapping clips and predicting their depths individually. \(\textit{Option 2}\): Temporal Inpaint Inference denotes inpainting later frames \(\mathbf{z}^{(\mathbf{z}_{[W:T]})}\) in a clip based on the previous frames' prediction \(\mathbf{z}^{(\mathbf{z}_{[0:W]})}\). Our proposed temporal inpaint enhances temporal consistency.

Citation

@misc{shao2024learning,
        title={Learning Temporally Consistent Video Depth from Video Diffusion Priors}, 
        author={Jiahao Shao and Yuanbo Yang and Hongyu Zhou and Youmin Zhang and Yujun Shen and Matteo Poggi and Yiyi Liao},
        year={2024},
        eprint={2406.01493},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }