CSE5519 Advances in Computer Vision (Topic E: 2024: Deep Learning for Geometric Computer Vision)

DUSt3R: Geometric 3D Vision Made Easy.Links to an external site.

Novelty in DUSt3R

Use point map to represent the 3D scene, combining with the camera intrinsics to estimate the 3D scene.

Direct-RGB to 3D scene.

Use ViT to encode the image, and then use two Transformer decoder (with information sharing between them) to decode the two representation of the same scene $F_1$ and $F_2$ . Direct regression from RGB to point map and confidence map.

Tip

Compared with previous works, this paper directly regresses the point map and confidence map from RGB, producing a more accurate and efficient 3D scene representation.

However, I’m not sure how the information across the two representations is shared in the Transformer decoder. If for a multiview image, there are two pairs of images that don’t have any overlapping region, how can the model correctly reconstruct the 3D scene?