Depth Anything - Generating Depth Maps from a Single Image with Neural Networks

Поділитися
Вставка
  • Опубліковано 6 лип 2024
  • This week we cover the "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data" paper from TikTok, The University of Hong Kong, Zhejiang Lab, and Zhejiang University. In this paper, they create a large dataset of labeled and unlabeled imagery to train a neural network for depth estimation from a single image, without any extra hardware or algorithmic complexity.
    --
    Get Oxen 🐂 oxen.ai/
    Oxen.ai makes versioning your datasets as easy as versioning your code! Even is millions of unstructured images, we quickly handle any type of data so you can build cutting-edge AI.
    --
    Depth Anything 📜 arxiv.org/abs/2401.10891
    The Dataset 🔢 www.oxen.ai/datasets/HRWSI
    Depth Anything Notes 📜 www.oxen.ai/blog/arxiv-dives-...
    MiDas 📜 arxiv.org/abs/1907.01341v3
    Demo Depth Anything 🤗 huggingface.co/spaces/LiheYoung/Depth-Anything
    Join Arxiv Dives 🤿 oxen.ai/community
    Discord 🗿 / discord
    --
    Chapters
    0:00 Intro to Depth Anything
    2:00 Use Cases
    3:10 Real World Example
    5:12 What is a Depth Map?
    7:00 Crash Course in Traditional Techniques
    9:42 Enter Depth Anything
    16:00 Learning from the Teacher Model
    18:35 DINOv2 Model
    19:18 Depth Anything Architecture
    21:29 Evaluation
    25:55 Ablation Studies
    28:22 Data, Perturbations, Feature Loss
    31:15 Qualitative Results
    33:00 Limitations
  • Наука та технологія

КОМЕНТАРІ • 11

  • @keshav2136
    @keshav2136 3 місяці тому

    Great to see someone working on it.
    I have an application for depth anything, could it be possible to talk about it with you on video call or meet? 😃

    • @oxen-ai
      @oxen-ai  3 місяці тому +1

      Yes! I’ll DM you on LinkedIn and happy to chat through it

  • @mwysocz
    @mwysocz 4 місяці тому

    I am doing stereophotography and I did once purchase a custom app to generate depthmaps from stereo pairs. Later I did use some opensource google workspace solution for that as well. Both did not work well enough for my needs but that was 5+ years ago. I'm curious if this approach got an upgrade since AI got boosted and if using the sterepair could result in better depthmaps than lens Depth of Field blur used in smartphones. Maybe using twin cameras in smartphones could result in better portrait photos? I know it won't work for objects far away because the stereobase would be to narrow but for regular portraits using twin cameras few cm away in order to generate stereopair so that the smartphone could combine lens blur and AI generated depthmap from stereopair should improve the final depthmap that is used to blur out the portrait. Any thoughts on that idea?

    • @oxen-ai
      @oxen-ai  4 місяці тому

      I think combining existing sensors with the AI could potentially give you a cleaner output - what were you using the depth map for? Depth of field on portrait photos?

    • @mwysocz
      @mwysocz 4 місяці тому

      @@oxen-ai i was doing stereo pairs for 3D photos and I was ocasionally converting them to depth maps to generate 10+ frames for lenticular print. (The 3D-like prints of chosen photos). I think I used stereophotomaker to generate these frames. There are better options to do that for bigger projects. They include multiple layers with each having corresponglding depthmap but these programs are very specialized and expensive and make little sense to buy for private use for single lenticular prints

  • @user-pt5uu9ue2g
    @user-pt5uu9ue2g 4 місяці тому

    The question in the end of video is that they normalize the prediction and GT before calculating MAE. Just like the paper said, they did the same thing as Midas(subtract median and divide scale).
    With normalization, they don’t care the GT is disparity or depth value. Just keep the GT to same format, which is inverse depth. If the GT is depth value, they just do 1/depth before normalization.

    • @oxen-ai
      @oxen-ai  4 місяці тому +1

      That makes sense for evaluation against the ground truth. I guess my question is - the camera parameters are not the same for every camera, so if I took a photo with my iphone on wide lense vs a random android phone (a camera which is not in the training set) and ran them through depth anything I'm assuming we get values from 0..1.0. Since the camera parameters are different for those two cameras, how can we know the depth values are correct when you do the inverse? Are we assuming the model has seen enough different types of cameras that it has also learned to estimate the camera parameters internally and scales the 0..1.0 values properly?

  • @entrepreneerit4490
    @entrepreneerit4490 4 місяці тому

    What's the point of a depth map if the values are relative/normalized and you can't get the actual estimated distance to the pixel? I'd love to know if I'm missing something and it is possible to get distances from a relative depth map but I haven't been able to.

    • @oxen-ai
      @oxen-ai  4 місяці тому

      I was wondering the same thing - it is really only useful for semantic segmentation of different depth planes without the actual depth values (ie place object x behind object y or blur a background etc). Obtaining the actual depth values feels hard since they are normalized and every camera has different camera parameters in terms of field of view and focal length etc.

    • @entrepreneerit4490
      @entrepreneerit4490 4 місяці тому

      @@oxen-ai I used zoedepth for metric depth values. The results are okay. The indoor model Zoe_N actually does well for distances 1m or more. Not good with anything closer than that but I expected it. Hard to know how far you are from a wall when it's just a white image.