Back to Articles

OpenVLA: Training Robot Policies by Treating Actions as Language Tokens

[ View on GitHub ]

OpenVLA: Training Robot Policies by Treating Actions as Language Tokens

Hook

What if you could train a robot to manipulate objects by treating its arm movements as tokens in a sequence to predict—similar to how language models process text?

Context

For decades, robotic manipulation has been trapped in a single-task paradigm: train a policy for one robot doing one task in one environment, then start over for anything new. Even with advances in reinforcement learning and imitation learning, generalization across different robot embodiments remained elusive. The robotics community had vast amounts of teleoperation data—the Open X-Embodiment dataset contains 970K trajectories across different platforms—but no unified architecture to learn from all of it simultaneously.

OpenVLA emerged from a simple but powerful insight: vision-language models already solve a challenging problem—mapping images and text to coherent outputs across diverse domains. What if robotic manipulation is just another sequence modeling task, where the ‘answer’ isn’t text but a 7-dimensional action vector? Built on top of Prismatic VLMs, OpenVLA combines a fused DINOv2 and SigLIP vision backbone with Llama-2 to create an open-source generalist robot policy with strong zero-shot transfer across embodiments.

Technical Insight

Continuous

Discretized

Training

Open X-Embodiment

970K Trajectories

PyTorch FSDP

Flash Attention

Camera Image

Vision Encoders

DINOv2 + SigLIP

Language Instruction

Processor

VLM Backbone

Prismatic + Llama-2

Action Format

7-DoF Action Head

Position + Gripper

FAST Tokenizer

Discrete Actions

Action Denormalization

unnorm_key

Robot Executor

System architecture — auto-generated

At its core, OpenVLA treats robotic control as a vision-language task with a twist: instead of generating text tokens, the model outputs continuous actions or discretized action tokens. The architecture fuses DINOv2 and SigLIP vision encoders with Llama-2, processing camera images and language instructions through the VLM, then projecting to a 7-DoF action space (or discrete tokens with the FAST tokenizer for up to 15x faster inference).

The inference interface is remarkably clean. Here’s the complete zero-shot control loop for a WidowX robot:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Load processor and model
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda:0")

# Control loop
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
robot.act(action, ...)

Notice the unnorm_key="bridge_orig" parameter—this reveals a crucial architectural detail. OpenVLA normalizes actions during training across different embodiments, then denormalizes at inference time using dataset-specific statistics. This normalization scheme is what enables the model to generalize across robots with vastly different action spaces.

The training infrastructure leverages PyTorch FSDP (Fully Sharded Data Parallel) and Flash Attention 2 to scale from 1B to 34B parameters. Models natively consume datasets in RLDS format, making it straightforward to construct arbitrary mixtures from Open X-Embodiment. The codebase builds on Prismatic VLMs, inheriting their modular vision encoder design—you can swap vision encoders without rewriting data pipelines.

Recent optimizations dramatically improve deployment viability. OFT (Optimized Fine-Tuning) enables 25-50x faster inference, higher task success rates, multiple input images, and high-frequency bimanual robot control while using continuous actions for greater model quality. The FAST action tokenizer compresses action chunks into fewer discrete tokens, achieving up to 15x speedup by reducing the sequence length the transformer must process. These aren’t minor tweaks; they’re the difference between a research demo and a production-ready robot controller.

For custom tasks, OpenVLA supports multiple fine-tuning strategies. Full fine-tuning updates all parameters, partial fine-tuning freezes the vision backbone, and LoRA fine-tuning (via HuggingFace PEFT) adapts the model with minimal compute. The repository includes deployment infrastructure with a REST API server, letting you separate the model runtime from robot control—run inference on a GPU server and send actions over HTTP to resource-constrained robot hardware.

Gotcha

The licensing situation creates deployment complexity. While the codebase carries an MIT license, the pretrained models inherit restrictions from their base models. Specifically, both released models are derived from Llama-2 and subject to the Llama Community License, which may restrict commercial use. If you fine-tune from scratch with a different base LLM, you avoid this—but that requires significant compute and expertise most teams lack.

Dependency management requires attention. The README explicitly pins transformers==4.40.1, timm==0.9.10, tokenizers==0.19.1, and flash-attn==2.5.5 due to reported regressions and breaking changes in later versions. The repository includes a dedicated ‘VLA Performance Troubleshooting’ section, which indicates that achieving good fine-tuning results requires careful attention to hyperparameters, normalization adjustments, and likely multiple training runs. Training requires multi-GPU setups with FSDP; this isn’t something you can iterate on quickly with a single consumer GPU.

Verdict

Use OpenVLA if you’re building research prototypes that need cross-embodiment generalization, working with Open X-Embodiment datasets, or exploring how vision-language pretraining transfers to robotics. It’s particularly valuable when you have diverse manipulation tasks and want a unified policy rather than training separate models per task. The REST API deployment makes it practical for labs with centralized GPU servers controlling multiple robot stations. Skip it if you need commercial deployment without navigating Llama-2 licensing restrictions, have strict real-time latency requirements without adopting OFT/FAST optimizations, or lack the multi-GPU infrastructure for fine-tuning. For single-task manipulation with limited data, simpler behavior cloning or offline RL methods will likely train faster and perform comparably.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/openvla-openvla.svg)](https://starlog.is/api/badge-click/llm-engineering/openvla-openvla)