LiDAR MOT-DETR: A LiDAR-based Two-Stage
Transformer for 3D Multiple Object Tracking
Martha Teiko Teye1,2 • Ori Maoz2 • Matthias Rottmann3
1 Department of Mathematics, University of Wuppertal, Wuppertal, Germany
2 Aptiv, Germany
3 Institute of Computer Science, Osnabrück University, Osnabrück, Germany
British Machine Vision Conference (BMVC), 2025
Abstract
Multi-object tracking from LiDAR point clouds presents unique challenges due to the sparse and irregular nature of the data, compounded by the need for temporal coherence across frames. Traditional tracking systems often rely on hand-crafted features and motion models, which can struggle to maintain consistent object identities in crowded or fast-moving scenes. We present a lidar-based two-staged DETR inspired transformer; a smoother and tracker. The smoother stage refines lidar object detections, from any off-the-shelf detector, across a moving temporal window. The tracker stage uses a DETR-based attention block to maintain tracks across time by associating tracked objects with the refined detections using the point cloud as context. The model is trained on the datasets nuScenes and KITTI in both online and offline (forward peeking) modes demonstrating strong performance across metrics such as ID-switch and multiple object tracking accuracy (MOTA). The numerical results indicate that the online mode outperforms the lidar-only baseline and SOTA models on the nuScenes dataset, with an aMOTA of 0.724 and an aMOTP of 0.475, while the offline mode provides an additional 3 pp aMOTP.
Method Overview
The pipeline consists of a smoother transformer followed by a tracking transformer.
Step 1: Detect objects using off-the-shelf object detector.
Step 2: Refine object identities and trajectories using Smoother temporal attention.
Step 3: Train the tracker model using features from lidar point clouds, detections and track queries from previous frame.
Metrics at a Glance
Overall tracking results on nuScenes Test Set
Comparison of LiDAR only tracking methods. Best method in bold, second best underlined.
Detection metrics on nuScenes validation set
Best method in bold.
Qualitative Results
Comparison between our method(LiDAR MOT-DETR) and 3DMOTFormer.
Both methods use centerpoint detections.
Resources & Links
Acknowledgments
Citation
@inproceedings{Teye_2025_BMVC,
author = {Martha Teiko Teye and Ori Maoz and Matthias Rottmann},
title = {LiDAR MOT-DETR: A LiDAR-based Two-Stage Transformer for 3D Multiple Object Tracking},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year = {2025},
url = {https://bmva-archive.org.uk/bmvc/2025/papers/Paper_220/paper.pdf}
}
Contact
Have questions or want to collaborate? Reach out: