JoonHo (Brian) Lee

LLMs, General Multimodal Reasoning, and Foundation Models

Last update date: 2023-10-24

Preliminaries

As part of preliminaries, or more of an introduction, I first summarize FiLM and Flamingo.

FiLM: Visual Reasoning with a General Conditioning Layer

Date: Sept 2017

Flamingo: a Visual Language Model for Few-Shot Learning

Date: April 2022

CLIP

TransporterNet

Main Survey

A Generalist Agent

Date: Nov 2022

CLIPort: What and Where Pathways for Robotic Manipulation

Date: Sept 2021

BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning

Date: Feb 2022

RT-1: Robotics Transformer for Real-World Control at Scale

Date: Dec 2022 (RT-1 is also quite well explained in this blog by Google )

PaLM-E: An Embodied Multimodal Language Model

Date: Mar 2023

VC-1: Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Date: March 2023

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Date: July 2023 (Also very well explained in their blog here)

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Date: October 2023

Not so long after the impressive results from RT-2, a full-fledged large scale data movement has been proposed via the Open X Embodiment.