DepthFM: Fast Monocular Depth Estimation with Flow Matching

Ming Gui*, Johannes S. Fischer*, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan A. Baumann, Vincent Tao Hu, Björn Ommer
CompVis @ LMU Munich, MCML
Teaser image demonstrating DpethFM depth estimation.

Overview

We present DepthFM, a versatile and fast state-of-the-art generative monocular depth estimation model. Beyond conventional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within few inference steps.

The gallery below presents images sourced from the internet, accompanied by a comparison between our DepthFM, ZoeDepth, and Marigold. Utilize the slider and gestures to reveal details on both sides. Note that our depth maps in the first two rows are generated with 10 inference steps, while the last row showcases our results with one single inference step compared to Marigold with two inference steps. We achieve significantly faster inference speed with minimal performance sacrifices.

Methods

DepthFM training scheme

Data dependent flow matching

DepthFM regresses a straight vector field between image distribution $x$ and depth distribution $d$ by leveraging the image-to-depth pairs. This approach facilitates efficient few-step inference without sacrificing performance.

Finetuning from diffusion prior

We demonstrate the successful transfer of the strong image prior from a foundation image synthesis diffusion model (Stable Diffusion v2-1) to a flow matching model with minimal reliance on training data and without the need for real-world images.

Auxiliary surface normal loss

Given that we only train on synthetic data and most synthetic datasets provide groundtruth surface normals, we incorporate surface normal loss as an auxiliary objective to enhance the accuracy of our depth estimates.

Comparison with other methods

Quantitative comparison of DepthFM with affine-invariant depth estimators on several zero-shot benchmarks. Bold numbers are the best, underscored second best. Our method outperforms other methods on both indoor and outdoor scenes in most cases with only little training on purely synthetic datasets.

Comparison with other methods

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.

Citation

@misc{gui2024depthfm,
      title={DepthFM: Fast Monocular Depth Estimation with Flow Matching}, 
      author={Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer},
      year={2024},
      eprint={2403.13788},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}