Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

¹AIT Austrian Institute of Technology ²ACIN - TU Wien

Abstract

The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as “pick up the steel beam pallet near the crane.” The perception pipeline integrates Florence-2 and SAM2 for language-grounded segmentation with PoseFoundation for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system’s robustness and confirm its feasibility for deployment in operational logistics and construction environments.

Test Datasets

Sunny

Sunny 1

Sunny 2

Sunny 3

Sunny 4

Sunny 5

Sunny 6

Sunny 7

Sunny 8

Sunny 9

Sunny 10

Sunny 11

Sunny 12

Sunny 13

Sunny 14

Sunny 15

Sunny 16

Sunny 17

Sunny 18

Sunny 19

Sunny 20

Sunny 21

Sunny 22

Sunny 23

Sunny 24

Sunny 25

Sunny 26

Sunny 27

Sunny 28

Sunny 29

Sunny 30

Sunny 31

Sunny 32

Sunny 33

Sunny 34

Sunny 35

Sunny 36

Sunny 37

Sunny 38

Sunny 39

Sunny 40

Sunny 41

Sunny 42

Snowy

Snow 1

Snow 2

Snow 3

Snow 4

Snow 5

Snow 6

Snow 7

Snow 8

Snow 9

Snow 10

Snow 11

Snow 12

Snow 13 A

Snow 13 B

Snow 14

Snow 15

Snow 16

Snow 17

Snow 18

Snow 19

Snow 20

Snow 21

Snow 22

Snow 23

Snow 24

Snow 25

Snow 26

Snow 27

Snow 28

Snow 29

Snow 30

Snow 31

Snow 32

Snow 33

Snow 34

Snow 35

Snow 36

Snow 37

Snow 38

Snow 39

Snow 40

Dark

Dark 1

Dark 2

Dark 3

Dark 4

Dark 5

Dark 6

Dark 7

Dark 8

Dark 9

Dark 10

Dark 11

Dark 12

Occlusion

Occlusion 1

Occlusion 2

Occlusion 3

Occlusion 4

Occlusion 5

Occlusion 6

Occlusion 7

Occlusion 8

Occlusion 9

Occlusion 10

Occlusion 11

Occlusion 12

Occlusion 13

Occlusion 14

Occlusion 15

Occlusion 16

Occlusion 17

Occlusion 18

Occlusion 19

Occlusion 20

Occlusion 21

Occlusion 22

Occlusion 23

Occlusion 24

Occlusion 25

Occlusion 26

Occlusion 27

Occlusion 28

Occlusion 29

Occlusion 30

Occlusion 31

Occlusion 32

Occlusion 33

Occlusion 34

Results

Quantitative performance across conditions — Results show robust zero-shot performance across challenging conditions: mIoU 0.85 (94.3% success) sunny, 0.77 (89.7%) snow, 0.78 (88.3%) occlusion, 0.65 (82.9%) low light. Overall 0.76 mIoU / 90.5% success—outperforming traditional CNN baselines (0.4–0.5 mIoU). Foundation model pre-training enables adaptation to weather, lighting, and visibility variations while supporting diverse natural language references.

Pose tolerance analysis — Operational tolerance analysis: lateral ±0.05 m and vertical ±0.04 m required for reliable fork insertion. Accuracy degrades quadratically with distance (stereo geometry) and single views beyond 15° often fail, but refinement during approach restores compliance. Heavy or tall cargo affects Z-boundary detection yet remains within functional limits.

Runtime performance breakdown — Runtime profile: full language-to-pose cycle ≈1.05 s. Florence-2 detection 7.1 Hz; SAM-2 segmentation 25.0 Hz; pose estimation + geometric adjustment 1.2 Hz (primary bottleneck) yet sufficient for typical forklift approach dynamics.

BibTeX

@misc{nguyen2025lang2liftframeworklanguageguidedpallet, title={Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation}, author={Huy Hoang Nguyen and Johannes Huemer and Markus Murschitz and Tobias Glueck and Minh Nhat Vu and Andreas Kugi}, year={2025}, eprint={2508.15427}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2508.15427}, }

Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

Abstract

Video

Method

Experiment in Real-World Environments