Menu

Paper Links

Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Vignesh Gopinathan1,2Urs Zimmermann2Michael Arnold2Matthias Rottmann3

1 Department of Mathematics, University of Wuppertal, Wuppertal, Germany
2 Aptiv, Germany
3 Institute of Computer Science, Osnabrück University, Osnabrück, Germany

Winter Conference on Applications of Computer Vision (WACV), 2026

Title image 1

Abstract

We introduce a fully automated LiDAR‑based captioning pipeline that produces object‑level, temporally grounded descriptions of street‑scene videos directly from object tracks. The captions encode lane position, motion relative to the host vehicle, and temporal transitions (e.g., lane changes). Such captions enable small and efficient video captioners like SwinBERT, trained on only the front‑camera frames to learn richer temporal semantics. Across three datasets—our proprietary data, Waymo, and NuScenes—models trained with our captions demonstrate improved temporal reasoning and reduced dependence on static background cues. We quantify this reduction using our proposed Visual Bias Measure (VBM), which evaluates a model’s reliance on scene appearance versus motion-driven dynamics.

Method Overview

Our captioning pipeline automatically converts LiDAR object tracks into rich, temporally grounded descriptions without any human annotation. The system uses a set of template sentences with placeholders that are filled using three types of attributes computed from LiDAR over time: lane position, motion relative to the host vehicle, and object type.

The procedure runs in two stages:

A. Host-vehicle captioning: The host vehicle’s velocity and yaw-rate signals are analyzed to segment the video into short clips, each representing a single host action (e.g., “decelerating while steering left”). These segments receive a simple template-based host caption that provides context for interpreting other objects’ behaviors.

B. Neighbor (object) captioning: For each segment, LiDAR-based 3D detection and tracking models yield trajectories for surrounding agents. Only moving objects within a defined region in front of the host vehicle and visible to the front camera are retained. At each timestamp, objects are assigned:
• a lane tag (left, right, host, oncoming, left/right-lateral),
• a motion tag (approaching, moving away, constant distance, stationary), and
• an object tag (car, truck, bike, pedestrian).

These tags form a time series that is compressed into a sequence of behavioral changes. Each unique state generates one sentence, enabling captions that naturally express transitions such as lane changes or shifts in motion. The first sentence introduces the object and the subsequent sentences describe how its behavior evolves over time.

The combinatorial template system supports thousands of distinct captions, allowing scalable generation of detailed, multi-sentence object-level descriptions directly from raw LiDAR data.

Video captioning model

method_image0

The LiDAR-generated captions are used to train SwinBERT, a compact video-captioning transformer that operates solely on front-camera RGB frames. Training follows the masked-language modeling (MLM) strategy from the original SwinBERT design, with the model learning to predict the LiDAR-derived object-level captions from uniformly sampled frames per video clip.

Object Mask generation (optional step)

method_image1

To further reduce reliance on background appearance and strengthen temporal grounding, the pipeline introduces an automated LiDAR-based object masking step. For each tracked object:

A. 3D point selection: LiDAR points inside the object’s 3D bounding box are extracted.
B. Projection: These points are projected into the front-camera image using calibrated LiDAR-to-camera transforms.
C. Mask construction: A tight convex hull is computed over the projected points (via the monotone-chain algorithm) and rasterized to create a clean object-specific pixel mask.

The resulting masks highlight the exact object referenced by the caption and suppress irrelevant background context, improving the model’s ability to learn dynamics rather than visual bias.

Key Contributions and Results

• Scalable, generic captioning from LiDAR for temporally rich, object‑level descriptions without human labels.
• Temporal understanding beats strong baselines on captioning and retrieval across seen and unseen datasets.
• Visual Bias Measure (VBM), simple BLEU‑based metric, that quantifies reliance on static visual similarity. Training with our temporal captions consistently lowers VBM.
• Mask‑augmented training further improves generalization by directing attention to the object of interest, better temporal grounding between caption and the associated object.

Metrics at a Glance

Captioning Quality (BLEU4 / CIDEr / SPICE)

Trained solely on single-sentence captions from our proprietary dataset, the model generalizes well to both Waymo and NuScenes. It outperforms InternVideo, ViCLIP, and CLIP, even though those baselines face the easier task of selecting from predefined candidate captions.

table1

Visual Bias Measure (VBM)

VBM captures how much retrieval performance drops, measured using BLEU4, when visually similar videos are removed from the candidate set in a video-to-video retrieval task. A lower VBM indicates that the model is less dependent on static scene appearance and relies more on true temporal cues. Our training achieves the lowest VBM across all datasets, and adding mask-based augmentation further reduces visual bias, lowering VBM by roughly 5 percentage points on Waymo and 2 percentage points on NuScenes.

table2

Zero‑Shot Generalization to Longer Captions

Despite training only on single‑sentence captions, our embeddings retrieve videos for two‑sentence captions (more complex maneuvers) with a ~20% higher mean BLEU4 than the best baseline.

table3

Broader Impact & Potential

Scaling data generation: The trained video captioning model can be used to generate temporal object captions for camera-only dataset. Additionally, any LiDAR‑equipped dataset can be converted into rich temporal captioning dataset at low cost using out pseudo-labeling method.
Beyond captioning: The learned embeddings improve video retrieval by maneuvers and can aid trajectory forecasting, risk reasoning, or behavior search tools.
Temporal understanding metric: VBM, a metric to quantitaively evaluate temporal understanding capabilities of models. Lower VBM suggests robustness to background changes—useful for being agnostic to geographic and seasonal shifts.

Resources & Links

Acknowledgments

Citation

                      @inproceedings{temporal-object-captioning-wacv2026,
                        title={Temporal Object Captioning for Street Scene Videos from LiDAR Tracks},
                        author={To be updated post-review},
                        booktitle={WACV},
                        year={2026},
                        note={Under review; project page}}
          

Contact

Have questions or want to collaborate? Reach out: