JoonHo (Brian) Lee

Multi-modal Pre-training for foundational models (and Open Vocab Semantic Image processing)

The introduction of CLIP brought many attention to Vision-Language Pre-training, or VLP. The CLIP framework in general involves gathering large amounts of (image,text) data pairs and training a two-stream network with self-supervisision to learn the joint vision-language embedding that correctly matches the pair. The embeddings then may be used for zero-shot multi-modal transfer such as image classification and retrival. More recently it has led to open vocabulary vision tasks such as open vocabulary semantic segmentation and detection. This survey explores along this line of work.

Last update date: 2023-10-25

(CLIP) Learning Transferable Visual Models From Natural Language Supervision

[Paper] [Github]

(ALIGN) Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

[Paper] [Blog]

(UniCL) Unified Contrastive Learning in Image-Text-Label Space

[Paper]

Florence: A New Foundation Model for Computer Vision

[Paper]

Open Vocabulary Perception

Vision-Language Pretraining methods such as CLIP and ALIGN has shown that embeddings may be developed to produce embeddings in joint vision-language space, and hence another line of work has been developing to then explore the use of such embeddings to detect and/or segment objects (things and stuff) with open vocabulary.

The main idea is similar between the following survey of works, where instead of a one-hot encded (or N-size probability distribution) vector for closed set semantic prediction, the output is converted to directly predict for the embedding dimension in the same space as the vision-language embeddings trained via CLIP, ALIGN, etc.

(ViLD) Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

[Paper]

ViLD

ViLD focuses on enabling open vocabulary object detection by utilizing as much information from CLIP/ALIGN as possible, and they achieve by incorporating both the text encoder and the image encoder trained by CLIP/ALIGN.

(LSeg) Language-driven Semantic Segmentation

[Paper]

Language Driven Semantic Segmentation

(OpenSeg) Scaling Open-Vocabulary Image Segmentation with Image-Level Labels

[Paper]

OpenSeg Training

(VLPart) Going Denser with Open-Vocabulary Part Segmentation

[Paper]

VLPart taxonomy example

VLPart further expands capabilities of open vocabulary semantic segmentation by creating a data engine that creates labels for object parts, therefore training a segmentation model with all (scene, object, part) levels of annotations.

VLPart taxonomy example