Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic E: 2024: Deep Learning for Geometric Computer Vision)

CSE5519 Advances in Computer Vision (Topic E: 2024: Deep Learning for Geometric Computer Vision)

link to paper 

Novelty in DUSt3R

Use point map to represent the 3D scene, combining with the camera intrinsics to estimate the 3D scene.

Direct-RGB to 3D scene.

Use ViT to encode the image, and then use two Transformer decoder (with information sharing between them) to decode the two representation of the same scene F1F_1 and F2F_2. Direct regression from RGB to point map and confidence map.

Tip

Compared with previous works, this paper directly regresses the point map and confidence map from RGB, producing a more accurate and efficient 3D scene representation.

However, I’m not sure how the information across the two representations is shared in the Transformer decoder. If for a multiview image, there are two pairs of images that don’t have any overlapping region, how can the model correctly reconstruct the 3D scene?

Last updated on