Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

Huy Hoang Nguyen1       Johannes Huemer1       Markus Murschitz1       Tobias Glück1       Minh Nhat Vu1,2       Andreas Kugi1,2      
1AIT Austrian Institute of Technology   2ACIN - TU Wien  
Paper Video

Abstract

The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as “pick up the steel beam pallet near the crane.” The perception pipeline integrates Florence-2 and SAM2 for language-grounded segmentation with PoseFoundation for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system’s robustness and confirm its feasibility for deployment in operational logistics and construction environments.

Video

Method

method

The Lang2Lift framework consists of two primary pipelines seamlessly integrated through a centralized coordination system operating at 5 Hz: the Perception Pipeline, which processes visual input and natural language instructions to identify and localize target pallets using Vision FMs for language-grounded object detection and precise pose tracking, and the Planning and Control Pipeline, which translates pose estimates into executable forklift operations, performs collision-free motion planning, and executes precise control with centimeter-level accuracy.

Experiment in Real-World Environments

Test Datasets

Sunny

Snowy

Dark

Occlusion

Results

Quantitative performance across conditions
Results show robust zero-shot performance across challenging conditions: mIoU 0.85 (94.3% success) sunny, 0.77 (89.7%) snow, 0.78 (88.3%) occlusion, 0.65 (82.9%) low light. Overall 0.76 mIoU / 90.5% success—outperforming traditional CNN baselines (0.4–0.5 mIoU). Foundation model pre-training enables adaptation to weather, lighting, and visibility variations while supporting diverse natural language references.
Pose tolerance analysis
Operational tolerance analysis: lateral ±0.05 m and vertical ±0.04 m required for reliable fork insertion. Accuracy degrades quadratically with distance (stereo geometry) and single views beyond 15° often fail, but refinement during approach restores compliance. Heavy or tall cargo affects Z-boundary detection yet remains within functional limits.
Runtime performance breakdown
Runtime profile: full language-to-pose cycle ≈1.05 s. Florence-2 detection 7.1 Hz; SAM-2 segmentation 25.0 Hz; pose estimation + geometric adjustment 1.2 Hz (primary bottleneck) yet sufficient for typical forklift approach dynamics.

BibTeX

@misc{nguyen2025lang2liftframeworklanguageguidedpallet,
              title={Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation}, 
              author={Huy Hoang Nguyen and Johannes Huemer and Markus Murschitz and Tobias Glueck and Minh Nhat Vu and Andreas Kugi},
              year={2025},
              eprint={2508.15427},
              archivePrefix={arXiv},
              primaryClass={cs.RO},
              url={https://arxiv.org/abs/2508.15427}, 
        }

Acknowledgements

We borrow the page template from  Nerfies project page. Special thanks to them!
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.