CSE5519 Advances in Computer Vision (Topic I: 2021 and before: Embodied Computer Vision and Robotics)

link to the paper

We consider the problem of goal-directed visual navigation: a robot is tasked with navigating to a goal location G given an image observation $o_G$ taken at $G$ . In addition to navigating to the goal, the robot also needs to recognize when it has reached the goal, signaling that the task has been completed. The robot does not have a spatial map of the environment, but we assume that it has access to a small number of trajectories that it has collected previously. This data will be used to construct a graph over the environment using a learned distance and reachability function. We make no assumptions on the nature of the trajectories: they may be obtained by human teleoperation, self-exploration, or a result of a random walk. Each trajectory is a dense sequence of observations $o_1, o_2, . . . , o_n$ recorded by its on-board camera. Since the robot only observes the world from a single on-board camera and does not run any state estimation, our system operates in a partially observed setting. Our system commands continuous linear and angular velocities.

NVIDIA Jetson TX2 computer.

Our method solely operates using images taken from the onboard camera.

our experiments demonstrate that two key technical insights contribute to significantly improved performance in the real-world setting: graph pruning (Sec. IV-B2) and negative mining (Sec. IV-A1). Our comparisons to prior methods in Section V and ablation studies in Section V-D demonstrate these novel improvements enable ViNG to learn goal-conditioned policies entirely from offline data, avoiding the need for simulators and online sampling, while prior methods struggle to attain good performance, particularly for long-horizon goals.

Novelty in ViNG

Learning dynamical distance

More precisely, we will learn to predict the estimated number of time steps required by a controller to navigate from one observation to another. This function must encapsulate knowledge of physics beyond just geometry.

Negative Mining: Use augmented data with view from different trajectories to train the model.
Graph Pruning: As the robot gathers more experience, maintaining a dense graph of traversability across all observation nodes becomes redundant and infeasible, as the graph size grows quadratically. For our experiments, we sparsify trajectories by thresholding the edges that get added to the graph: edges that are easily traversable $(T (o_i , o_j ) < \delta_{sparsify})$ are not added to the graph, since the controller can traverse those edges with high probability.
Weighted Distance: We use a weighted distance (traversability) to compute the distance between two nodes with weighted-Dijkstra algorithm.

Tip

This is a really interesting paper that use learned policy and positional graph to navigate in real-world.

As the authors mentioned, the performance is sensitive to season change and illumination changes. I wonder if we can use some more advanced pattern recognition model help the model easily recognize the goal. Is there any way to do this?

How the model generalized the knowledge about the topology of the environment and know it’s on the correct path if the robot is interrupted by other objects, for example, crossing bicycles or dropping leaves on the ground?

CSE5519 Advances in Computer Vision (Topic I: 2021 and before: Embodied Computer Vision and Robotics)

ViNG: Learning Open-World Navigation with Visual Goals.

Novelty in ViNG

Learning dynamical distance