Paper-Weekly17-PaLM-E: An Embodied Multimodal Language Model

Jan 10, 2024 4 min read

Intro

PaLM-E, a large embodied multimodal model, can address a variety of embodied reasoning tasks and exhibits positive transfer across different domains. PaLM-E-562B, with 562B parameters, is the largest vision-language model reported and achieves state-of-the-art performance on OK-VQA without task-specific finetuning. The model demonstrates positive transfer across different tasks, benefiting from diverse joint training across multiple domains. It outperforms similar approaches by leveraging world knowledge embedded in its parameters and achieving embodied reasoning and question answering capabilities. Other work has proposed to leverage the zero-shot CoT ability of large language models, but to their knowledge not via an end-to-end model. While the focus of the paper is on planning tasks, it also reports results on general vision-language tasks, including OK-VQA, VQA v2, and COCO captioning, showcasing the versatility of the model.

Main Results

The model is named PaLM-E since it uses PaLM as the underlying language model backbone and makes it Embodied. The main results of PaLM-E are two folds: introducing a trainable multi-modal sentence representation and showcasing remarkable proficiency in addressing practical challenges, such as remote control of a robotic arm.

Multimodal Sentence

PaLM-E processes multi-modal sentences by combining inputs from different modalities, such as images, neural 3D representations, or states, with text tokens as input to a Language and Vision model (LLM) that is trained end-to-end. They investigate different state estimation vectors, Vision Transformers and OSRT. These vectors are then interleaved with normal embedded text tokens to form the prefix for LLM. Each token $x_i$ in prefix is come from either the word embedder or an encoder.

{% mathjax %}

\begin{equation} x_i = \begin{cases} \gamma(w_i) & \text{if } i \text{ is a text token, or} \ \phi_j(O_j)_i & \text{if } i \text{ corresponds to observation } O_j. \end{cases} \end{equation}

{% endmathjax %}

ViT and Object-centric Representation

Visual input, unlike language, lacks inherent structure. While Vision Transformers (ViT) can understand meaning, their representation appears as a static grid, not individual objects. This complicates interfacing with Language Models (LLMs) and tackling embodied reasoning, requiring interaction with physical objects. To address this, structured encoders are explored to break down visual input into distinct objects before integrating them into LLMs.

Embodied Experiments in Real World

PaLM-E is a generative model that produces text based on multi-model sentences as input. Its input can be used for embodied question answering or scene description tasks by directly considering the output as the solution. For embodied planning or control tasks, PaLM-E generates text that conditions low-level commands and is integrated into a control-loop with a robot executing the decisions.

The PaLM-E robot is trained on expert data from three different environments: Task and Motion Planning (TAMP), table-top pushing, and mobile manipulation. TAMP tasks involve complex decision sequences and combinatorics, while the table-top pushing environment has multiple objects and complex dynamics. In the mobile manipulation domain, PaLM-E performs tasks in a kitchen environment and adjusts plans in the presence of disturbances or control failures.

Researchers evaluate on general vision-language tasks, including OK-VQA, VQA v2, and COCO captioning. The performance of PaLM-E, an embodied multimodal language model, is evaluated on planning tasks in a simulated environment. Task prompts for the planning tasks are provided, including block manipulation and sorting. PaLM-E-562B achieves the highest reported number on OK-VQA and establishes itself as a competitive visual-language generalist. PaLM-E’s performance on general language tasks is also evaluated, showing less catastrophic forgetting with increasing model scale.

Limitations

In the discussion section, the researchers mention their discovery of transfer learning, meaning that PaLM-E trained on different tasks and datasets at the same time leads to significantly increased performance relative to models trained separately on the different tasks alone. For example, the model was able to solve the Language Table task with only 10 to 80 training examples and the TAMP task with 320 training examples. They also found that robotics data is less abundant. It seems in order to have this transfer ability, the model has to be trained on massive text data first to acquire such ability. Consequently, a more balanced distribution in training data holds promise for substantially enhancing data efficiency.

Besides, PaLM-E relies on an underlying low-level policy or planner to translate its textual decisions into actionable commands, emphasizing the need for an integrative framework to bridge high-level linguistic outputs to practical, executable actions. While the improvements reported are impressive, the gaps between different visual representation methods (e.g. ViT and OSRT) remain large and a detailed comparison or ablation study is absent.