Back to Articles

OpenVLA: Training Vision-Language-Action Models That Actually Manipulate Objects

[ View on GitHub ]

OpenVLA: Training Vision-Language-Action Models That Actually Manipulate Objects

Hook

What if your robot could learn to manipulate objects from nearly a million demonstrations across dozens of different robot arms, then follow natural language instructions like "pick up the blue block and place it in the bin"?

Context

For years, the robotics community has struggled with a fundamental problem: models trained on one robot rarely transfer to another, and models trained for one task fail spectacularly on variations. Google's RT-1 and RT-2 showed that vision-language models could bridge this gap, achieving impressive generalization across tasks by grounding language in robotic actions. But they remained closed-source black boxes.

OpenVLA emerged in 2024 as the first truly open implementation of this vision-language-action paradigm, trained on the Open X-Embodiment dataset—a collection of 970,000 robot trajectories spanning multiple embodiments, from WidowX arms to mobile manipulators. Unlike traditional imitation learning approaches that require task-specific training, OpenVLA treats robot control as a conditional language modeling problem: given pixels and a text instruction, predict the next action token. This framing lets it leverage pre-trained vision-language models and scale with data in ways that behavior cloning never could.

Technical Insight

OpenVLA's architecture is deceptively straightforward: it's a Prismatic VLM adapted to output robot actions instead of text tokens. The input pipeline takes a 224x224 RGB image and processes it through two parallel vision encoders—DINOv2 for spatial reasoning and SigLIP for language alignment. These fused representations get projected into the Llama-2 language model's embedding space, where they're concatenated with tokenized text instructions.

Here's what a basic inference call looks like:

from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
from PIL import Image

# Load the 7B model
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda:0")

# Prepare input
image = Image.open("robot_observation.png")
prompt = "In what direction should the robot move the gripper?"
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)

# Generate action prediction (7-DoF continuous action)
action = vla.predict_action(**inputs, unnorm_key="widowx_bridge")
print(action)  # Shape: [7] -> [x, y, z, roll, pitch, yaw, gripper]

The magic happens in the prediction head. Instead of generating text tokens, OpenVLA's final layer outputs a 7-dimensional continuous vector representing end-effector deltas and gripper commands. During training, these actions are normalized per-dataset using statistics computed from the Open X-Embodiment data, then the model learns to predict them using a simple L2 loss. The unnorm_key parameter tells the model which normalization statistics to apply—critical because a WidowX arm operates at different scales than a Franka Panda.

The dual-encoder vision architecture deserves special attention. DINOv2 excels at dense spatial features and geometric understanding, while SigLIP specializes in aligning visual concepts with language. By fusing both through learned projection layers, OpenVLA gets the best of both worlds: precise spatial reasoning for "grasp the handle" and semantic understanding for "the red cup." This is a massive departure from single-encoder approaches that force one model to handle both responsibilities.

For production deployments, OpenVLA supports OFT (OpenVLA Fast Training) optimization, which restructures the model for multi-image processing. This matters for tasks requiring temporal context—tracking a moving object or executing multi-step manipulations:

# OFT model with history window
vla_oft = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b-oft",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to("cuda:0")

# Process sequence of observations
images = [Image.open(f"frame_{i}.png") for i in range(4)]
inputs = processor(prompt, images).to("cuda:0", dtype=torch.bfloat16)
action = vla_oft.predict_action(**inputs, unnorm_key="widowx_bridge")

The training infrastructure uses PyTorch FSDP (Fully Sharded Data Parallel) to scale across GPUs, with Flash-Attention 2 reducing memory consumption during the attention operations. A full training run on Open X-Embodiment requires substantial compute—the paper reports using 32 A100 GPUs—but fine-tuning is far more accessible. The repository includes scripts for LoRA adaptation that run on a single GPU, letting you specialize the model for your specific robot and tasks with as few as 50-100 demonstrations.

One architectural choice that stands out: OpenVLA treats the entire vision-language-action pipeline as a single differentiable system. There's no separate policy head or action decoder—just continuous action predictions flowing directly from the language model's final hidden states. This tight coupling means gradients from action prediction errors can flow all the way back through the vision encoders, fine-tuning visual feature extraction for the specific demands of manipulation rather than generic vision-language tasks.

Gotcha

The Llama-2 backbone means you're subject to Meta's Community License, which includes usage restrictions and requires license agreements for applications with more than 700 million monthly active users. For academic research, this is fine. For commercial robotics startups, it's a legal minefield that requires careful review.

Performance degradation outside the training distribution is real and unforgiving. OpenVLA was trained predominantly on tabletop manipulation with specific robot arms (WidowX, Franka, etc.). Deploy it on a humanoid robot or ask it to manipulate deformable objects, and success rates plummet. The repository's troubleshooting guide explicitly warns about this: fine-tuning is not optional for novel scenarios, it's mandatory. Even with the Open X-Embodiment dataset's diversity, you're looking at collecting domain-specific data and running LoRA adaptation for anything beyond table-top pick-and-place variations. The 25-50x inference speedup from OFT optimization sounds impressive until you realize the base model runs at roughly 10Hz on an A100—still slower than many real-time control loops that expect 30-60Hz. For high-frequency manipulation or contact-rich tasks, you'll need careful system design to compensate for this latency.

Verdict

Use OpenVLA if you're building general-purpose manipulation systems that need to understand natural language instructions, you're already working with Open X-Embodiment formatted data, or you want state-of-the-art vision-language grounding without training a VLA from scratch. It's particularly compelling for research labs exploring cross-embodiment transfer and robotics teams who can afford the fine-tuning step to adapt it to their hardware. Skip it if you need commercial deployment without licensing headaches (look at Octo instead), require sub-50ms inference latency for high-frequency control, or you're solving a narrow manipulation problem where domain-specific methods like Diffusion Policy will outperform with less data and compute. Also skip if you can't collect or source task-relevant training data—OpenVLA's generalization has limits, and hoping it works out-of-the-box on your unique scenario is a recipe for disappointment.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/openvla-openvla.svg)](https://starlog.is/api/badge-click/llm-engineering/openvla-openvla)