JoonHo (Brian) Lee

Multi-modal Pre-training for foundational models (and Open Vocab Semantic Image processing)

The introduction of CLIP brought many attention to Vision-Language Pre-training, or VLP. The CLIP framework in general involves gathering large amounts of (image,text) data pairs and training a two-stream network with self-supervisision to learn the joint vision-language embedding that correctly matches the pair. The embeddings then may be used for zero-shot multi-modal transfer such as image classification and retrival. More recently it has led to open vocabulary vision tasks such as open vocabulary semantic segmentation and detection. This survey explores along this line of work.

Last update date: 2023-10-25

(CLIP) Learning Transferable Visual Models From Natural Language Supervision

[Paper] [Github]

(ALIGN) Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

[Paper] [Blog]

(UniCL) Unified Contrastive Learning in Image-Text-Label Space

[Paper]

Florence: A New Foundation Model for Computer Vision

[Paper]

Open Vocabulary Perception

Vision-Language Pretraining methods such as CLIP and ALIGN has shown that embeddings may be developed to produce embeddings in joint vision-language space, and hence another line of work has been developing to then explore the use of such embeddings to detect and/or segment objects (things and stuff) with open vocabulary.

The main idea is similar between the following survey of works, where instead of a one-hot encded (or N-size probability distribution) vector for closed set semantic prediction, the output is converted to directly predict for the embedding dimension in the same space as the vision-language embeddings trained via CLIP, ALIGN, etc.

(ViLD) Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

[Paper]

ViLD

ViLD focuses on enabling open vocabulary object detection by utilizing as much information from CLIP/ALIGN as possible, and they achieve by incorporating both the text encoder and the image encoder trained by CLIP/ALIGN.

(LSeg) Language-driven Semantic Segmentation

[Paper]

Language Driven Semantic Segmentation

(OpenSeg) Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

[Paper]

OpenSeg Training

(VLPart) Going Denser with Open-Vocabulary Part Segmentation

[Paper]

VLPart taxonomy example

VLPart further expands capabilities of open vocabulary semantic segmentation by creating a data engine that creates labels for object parts, therefore training a segmentation model with all (scene, object, part) levels of annotations.

VLPart taxonomy example

(SimpleSeg) A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-langauge Model

[ArXiv]

SimpleSeg Method

(OVSeg) Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

[Github] [ArXiv]

OVSeg analysis

OVSeg method

  1. Common method for open vocab seg is to propose masks then provide embedding for classification via models such as CLIP, but it is shown that CLIP does not perform well on masked images, thus leading to quick performance degredation and misalignment.

  2. MaskFormer trained (like Openseg) to output mask ROI and per-mask CLIP emebeding. However, their method differs by also modifying the data extraction pipeline where a modified CLIP is used to extract embedding for the masked regions.

  3. Training data is collected from pretrained CLIP & maskformer by finding regions of interest, parsing nouns from caption to get several object labels, then matching them to the most likely mask as pseuo label generation.

  4. To alleviate CLIP’s domain shift issue arising from feeding in masked images, authors first train a learnable zero token that can replace the zero tokens caused by empty patches in the masked image. This is used to greatly improve CLIP’s performance and this prompting is applied when training the final OVSeg model.

minor detail: authors used average of the following prompts: ’a photo of a {}.’, ’This is a photo of a {}’, ’There is a {} in the scene’, ’There is the {} in the scene’, ’a photo of a {} in the scene’, ’a photo of a small {}.’, ’a photo of a medium {}.’, ’a photo of a large {}.’, ’This is a photo of a small {}.’, ’This is a photo of a medium {}.’, ’This is a photo of a large {}.’, ’There is a small {} in the scene.’, ’There is a medium {} in the scene.’, ’There is a large {} in the scene.’,

Side Adapter Network for Open-Vocabulary Semantic Segmentation

[Arxiv]

SAN architecture

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

[ArXiv]

Alpha-CLIP

Alpha-CLIP is a finetuned enhancement of the base CLIP model. While CLIP processes the entire image and fails to attend to specific regions, Alpha-CLIP learns to use alpha-mask as an additional input channel to help focus on specific regions. Additionally, since Alpha-CLIP still can see the whole image, it preserves more contextual information than prior work that masks/crops the image to only process the ROI.

Key Contributions:

Minor details in Experiment results:

Future direction & Limitations: