CSE5519 Advances in Computer Vision (Topic A: 2025: Semantic Segmentation)

Dual Semantic Guidance for Open Vocabulary Sematic segmentation

Novelty in Dual Semantic Guidance

Use dual semantic guidance for semantic segmentation. For each mask, deploy clip like object detection to align the mask with text description.

Tip

This paper proposed a generalizable semantic segmentation model with a CLIP-like image-text encoder to refine the mask prediction.

However, I wonder how this model generalized to segment different faces of geometry and create a clear boundary between different objects and the background. In most cases, CLIP may not need complete image information to predict the object and can make a decision based on partial objects. If we have some novel objects containing features of two that might be out of CLIP’s codebook, will the CLIP-alignment still work?