검색 상세

D-PMC: Driving-Scene Point Transformer with Multi-Scale 3D Convolution for Semantic Segmentation

초록(요약문)

For 3D LiDAR semantic segmentation in driving scenes, it is crucial to capture the fine-grained 3D structural details and long-range dependencies across large-scale data. Although 3D convolutions effectively preserve structural information, high computational costs restrict the expansion of their receptive fields. In contrast, serialization-based methods efficiently capture long-range dependencies via attention by organizing unstructured point clouds into sequences, but inevitably lose structural details by flattening 3D data. To address these limitations, we propose D-PMC: Driving-Scene Point Transformer with Multi-Scale Convolution, a novel hybrid architecture that synergistically integrates 3D CNNs and point transformers. Specifically, we introduce Cylindrical Sector Encoding (CSE), a serialization strategy tailored for driving-scene LiDAR data that effectively preserves spatial proximity. Building upon this, we design the Spatial and Global Perception (SGP) block, the core component of D-PMC that utilizes efficient multi-scale 3D convolution and the CSE-based serialized attention to robustly capture local structural context and model global dependencies. Extensive experiments on three large-scale autonomous driving benchmarks—nuScenes, SemanticKITTI, and Waymo Open Dataset—demonstrate that D-PMC consistently achieves state-of-the-art performance, offering a superior balance between segmentation accuracy and computational efficiency.

more

초록(요약문)

주행 환경에서의 3D LiDAR semantic segmentation을 수행하기 위해서는 대규모 데이터 전반에 걸친 정밀한 3D 구조적 세부 정보와 long-range dependncy를 파악하는 것이 필수적이다. 3D convolutions은 구조 정보를 효과적으로 보존하지만, 높은 계산 비용 때문에 receptive fields 확장에 제약이 있다. 반면, serialization 기반 방법은 비구조화된 포인트 클라우드를 시퀀스 형태로 정렬하여 attention을 통해 long-range dependencies을 효율적으로 포착하지만, 3D 데이터를 flattening 함으로써 구조적 세부 정보를 불가피하게 손실한다. 이러한 한계를 극복하기 위해, 본 연구에서는 3D CNN과 point transformer를 효과적으로 결합한 새로운 하이브리드 아키텍처인 D-PMC : Driving-Scene Point Transformer with Multi-Scale Convolution를 제안한다. 구체적으로, 주행 환경의 LiDAR 데이터에 적합한 serialization 방식인 Cylindrical Sector Encoding (CSE) 을 도입하여 spatial proximity를 효과적으로 보존한다. 이를 기반으로, D-PMC의 핵심 구성 요소인 Spatial and Global Perception (SGP) block을 설계하여, 효율적인 multi-scale 3D convolution과 CSE 기반의 serialized attention을 통해 지역적인 구조적 맥락과 전역적인 의존성을 강건하게 포착한다. nuScenes, SemanticKITTI, Waymo Open Dataset 등 3가지 대규모 자율주행 벤치마크 데이터셋을 통한 광범위한 실험 결과, D-PMC는 state-of-the-art 성능을 달성하며, 정확도와 계산 효율성 간의 우수한 균형을 입증하였다.

more

목차

1 Introduction 1
2 Related Works 7
2.1 3D CNN-based approaches 8
2.2 Transformer-based approaches 9
3 The Proposed Method 11
3.1 Cylindrical Sector Encoding 13
3.1.1 Cylindrical sector indexing 13
3.1.2 Index encoding 15
3.2 Spatial and Global Perception block 16
3.2.1 3D Spatial Perception Unit 17
3.2.2 Serialized Relation Perception Unit 20
4 Experiments 23
4.1 Experimental settings 23
4.1.1 Datasets 23
4.1.2 Evaluation metrics 24
4.1.3 Implementation details 24
4.2 Comparison with state-of-the-art methods 25
4.2.1 Quantitative results 25
4.2.2 Qualitative results 28
4.3 Ablation study 31
4.3.1 Effectiveness of D-PMC’s core components 31
4.3.2 Efficiency analysis 33
5 Conclusion 35
Bibliography 37

more