Last year, after many tech company showcased its “fast-slow” system, the application of visual language models (VLMs) in autonomous driving was officially unveiled. This year, the smart driving industry is pushing forward with Visual Language Assistants (VLAs). The VLA solution offers higher scene reasoning and generalization capabilities than the end-to-end plus VLM approach. Many industry leaders consider VLA the 2.0 version of current end-to-end systems — a fusion of end-to-end and VLM methodologies. While VLMs focus primarily on environmental modeling, VLAs extend this focus to include planning and control, marking the key difference between the two.

Applications of VLM in Autonomous Driving

In the cockpit domain, VLMs can be applied more directly and facilitate easier interaction, which is why integration in this area tends to be simpler. In autonomous driving, VLM applications generally fall into two categories:

Assistive Functions: fast-slow system provided the first clear example of this during its launch.
Direct Trajectory Prediction: Although end-to-end models are fast, large models often suffer from low frame rates, making real-time interaction challenging. One promising idea is to use the output of VLMs as a reference for future frames, offering corrective cues at either the feature level or in post-processing.

Horizon’s Senna project proposed a concrete solution by using VLMs for high-level planning decisions, thereby guiding the end-to-end system’s trajectory predictions. Senna noted that while VLMs are not ideal for precise numerical predictions, they excel at forecasting intentions or performing coarse-grained planning — areas where end-to-end systems still struggle, especially with complex long-tail scenarios.

Transitioning to VLA: Advantages and Features

Several recent publications — such as DriveGPT, DriveGPT4, DriveVLM, and OmniDrive — have begun directly outputting planning or trajectory point information, an approach that aligns closely with the VLA concept. However, challenges remain in collecting high-quality real-world data and deploying these models in real time. Although an end-to-end model could theoretically replace a VLA when using purely visual inputs (aside from an increase in parameter count), significant performance differences emerge as the parameter size grows.

VLA can be seen as “end-to-end 2.0,” with its standout feature being a chain-of-thought reasoning process. For example, in scenarios involving reversible (tidal) lanes, a VLA-equipped autonomous vehicle can read road signs through text and other inputs, determine from multiple sources whether the reversible lane is usable, and interact with other vehicles via turn signals. It then makes lane changes and steering adjustments to safely enter the reversible lane. This human-like reasoning — taking into account the global context and interacting with other vehicles — enables the vehicle to make optimal and safe decisions.

The VLA large model handles what were traditionally multi-layered tasks with unified parameters, similar to current end-to-end models. However, its larger parameter count facilitates easier fine-tuning for downstream tasks and enhances its generalization capabilities, particularly in zero-shot and new-scenario conditions.

Challenges and Enhancements in Autonomous Driving

Despite its advantages, VLA faces several challenges:

Data Quality: High-quality data remains a bottleneck. Even with fine-tuning, issues like imbalanced data distribution and errors arising from model biases must be resolved.
Deployment and Computational Power: Current computational support is limited for deploying VLA. NVIDIA’s Thor chip, with its thousand-teraflop capabilities, may offer a viable solution. For instance, Zeekr’s self-developed HaoHan intelligent driving system has become the world’s first mass-produced vehicle to use the NVIDIA Thor chip, signaling promising advances in this area.

Enhancing End-to-End Systems with VLM/VLA

Both VLM and VLA can effectively boost existing end-to-end tasks through:

Improved Generalization: By incorporating large language models (LLMs), end-to-end systems can better handle rare and complex driving scenarios. Knowledge transfer and zero-shot capabilities allow these systems to learn from long-tail scenarios.
Richer Semantic Information: Visual language models generate interpretable outputs, providing end-to-end systems with detailed semantic information that improves environmental understanding.
Enhanced Planning Performance: For example, the DiMA system achieved a 37% reduction in L2 trajectory error on the nuScenes dataset.
Realistic Multi-modal Trajectory Outputs: Systems like VLM-AD and DiMA have both shown significant reductions in collision rates.
Knowledge Distillation for Real-time Deployment: Distilling knowledge from large models into smaller ones can maintain high performance while reducing computational load and model size.
Greater Interpretability: By predicting human-understandable action labels, models like VLM-AD enhance the transparency of decision-making processes.
Reduced Dependency on Massive Datasets: Synthetic data or inference annotations generated by LLMs can be used for training in scenarios where data is scarce or privacy concerns restrict data availability.

Moreover, VLMs can automatically generate high-quality annotated data, significantly cutting down on the time and cost of manual labeling.

Overall, while VLM and VLA bring promising advances to autonomous driving, they also introduce new challenges that the industry continues to address.