[NeurIPS 2023] SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Поділитися
Вставка
  • Опубліковано 12 гру 2023
  • This is the video for our paper "SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding".
    This work was done at Google and is published at NeurIPS 2023.
    Paper: arxiv.org/pdf/2306.05407
    Code: github.com/google-research/snap
    Authors: Paul-Edouard Sarlin, Eduard Trulls, Marc Pollefeys, Jan Hosang, Simon Lynen
    Abstract: Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving. However, these maps have limitations: they lack detail, often contain inaccuracies, and are difficult to create and maintain, especially in an automated fashion. Can we use raw imagery to automatically create better maps that can be easily interpreted by both humans and machines? We introduce SNAP, a deep network that learns rich neural 2D maps from ground-level and overhead images. We train our model to align neural maps estimated from different inputs, supervised only with camera poses over tens of millions of StreetView images. SNAP can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. Moreover, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. This enables effective pre-training for data-efficient semantic scene understanding, with the potential to unlock cost-efficient creation of more detailed maps.
  • Наука та технологія

КОМЕНТАРІ • 8

  • @Unique-Concepts
    @Unique-Concepts 5 місяців тому +2

    Fantastic work....Love it👏👏👏🙏🙏👌👌👌👍👍👍

  • @mlachahesaidsalimo9958
    @mlachahesaidsalimo9958 2 місяці тому

    Your work is incredible ! Thank you for sharing. I really like the dynamism and playfulness of the presentation. Which software did you use to make the video presentation ? Thank you in advance for your reply

    • @pesarlin
      @pesarlin  2 місяці тому

      Thank you! I used only PowerPoint :)

  • @RicanSamurai
    @RicanSamurai 6 місяців тому +3

    Fascinating! Very interesting novel approach to this problem. At 6:39, it appears as though you have ~12 map images that cover the area of interest (of which you highlight four), and then you are able to successfully get a position prediction from a query image. Do you have a sense of how densely that area needs to be covered by your map images before SNAP beats other models? Similarly, is there a map image density at which you see diminishing returns?
    I'm just curious how many training images are necessary to cover a given region before SNAP's predictions become useful. For that same region in your example, would 50 map images of the region make a meaningful difference to the prediction?
    Thanks!

    • @pesarlin
      @pesarlin  6 місяців тому +2

      We use a rig with 3 cameras so we actually have 36 images in these examples (each triangle is a camera pose). We have an ablation study in Table 1 in the paper: aerial-only is a bit worse than semantic maps, while StreetView-only is a bit worse than aerial+StreetView. So aerial-only can already get you quite far but having some coverage of ground-level images is important. During training we actually map with fewer images (20 instead of 36) so the model is pretty robust to sparse views, but indeed more is better. I don't have numbers at hand, but I guess that the performance is already quite saturated at 36 views (0.6 per meter), unless there is strong occlusion (e.g. from trucks) in most views.

  • @user-rs4mf6ju8d
    @user-rs4mf6ju8d 5 місяців тому

    Very impressive work!
    Question: Can I generate a neural map for localization only from birds eye view? Let us say using images from a downward looking camera for a flight from Brussels to Amsterdam.

  • @anywallsocket
    @anywallsocket 5 місяців тому

    How you choose validation data areas within training data areas?

    • @pesarlin
      @pesarlin  5 місяців тому +2

      We randomly sampled a fixed number of S2 cells in each training city.