dCollection 디지털 학술정보 유통시스템

MAVIS: Image-centric Alignment and Region-Aware Prompt Learning for Industrial Anomaly Detection

원문보기

주제(키워드) Anomaly detection , Multi-modal , LLM
발행기관 서강대학교 일반대학원
지도교수 강석주
발행년도 2026
학위수여년월 2026. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과협동과정
실제URI http://www.dcollection.net/handler/sogang/000000082290
UCI I804:11029-000000082290
본문언어 영어
저작권 논문은 저작권에 의해 보호받습니다.

초록(요약문)

Vision-language models such as CLIP have gained significant attention in anomaly detection due to their zero-shot generalization and interpretable reasoning capabilities, yet they possess fundamental limitations. CLIP, trained on large-scale general data, lacks the ability to explicitly contrast normal and abnormal states—a capability we term "negation"—which is essential for domain-specific AD tasks. Consequently, normal image features and abnormal text features are not semantically distinguished in the feature space, with some abnormal text features being highly similar to normal image features, acting as hard negatives and causing false positives. To address this problem, we propose Negation-aware N-pair loss inspired by metric learning. By setting image features as anchors, normal text features as positives, and abnormal text features as negatives, we apply this loss at both global image and local patch levels to enforce explicit separation between abnormal text features and normal image features. The Negation-aware N-pair loss fundamentally resolves the hard negative problem by calculating distances between vision anchors and abnormal text features, then selecting the top-k closest abnormal text features as negatives. Furthermore, to explicitly convey spatial location information of anomalies to the vision-language model, we propose RoI Selector and Region-aware prompt learner. The RoI Selector extracts coordinates of anomaly candidate regions, which are converted into position embeddings and transmitted to the Region-aware prompt learner. These position embeddings are combined with learnable queries for training. Through cross-attention modules, they are aligned with regional visual features, and the position embeddings explicitly encode spatial context, enabling accurate correspondence between detected anomalies and their locations. Our method achieves state-of-the-art performance with 97% image-level AUC and 96% pixel-level AUC on MVTec-AD dataset, and demonstrates competitive performance on VisA. Additionally, with 92.9% language-based anomaly classification accuracy in language-based reasoning, we demonstrate superior detection performance and interpretability simultaneously in a one-shot setting.

Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
II . Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Industrial Anomaly Detection . . . . . . . . . . . . . . . . . 8
2.2 Zero-/Few-shot Industrial Anomaly Detection . . . . . . . . 9
2.3 Metric Learning for Discriminative Features . . . . . . . . . 10
2.4 Large Vision-Language Models . . . . . . . . . . . . . . . . 10
III . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 MAVIS framework . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Motivation for Negation-aware Learning . . . . . . . . . . 13
3.3 Negation-aware N-pair Loss . . . . . . . . . . . . . . . . . 14
3.3.1 Image-level Negation-aware N-pair Loss . . . . . . . 16
3.3.2 Patch-level Negation-aware N-pair Loss . . . . . . . . 17
3.4 Region-aware Prompt Learner . . . . . . . . . . . . . . . . 19
3.4.1 Region-aware Query Learning . . . . . . . . . . . . 19
3.4.2 Anomaly-guided Cross-attention . . . . . . . . . . . 21
IV . Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . 24
4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . 28
V . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
VI . Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 Normal and Abnormal Text Construction . . . . . . . . . . . 30
6.2 Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Additional Qualitiative results . . . . . . . . . . . . . . . . 31
6.3.1 Visualization of anomaly map . . . . . . . . . . . . 31
6.3.2 Additional results of Multi-dialouge . . . . . . . . . 31
6.3.3 Limitations and Failure Cases . . . . . . . . . . . . 32

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

반출 Meta View 목록

서강대학교

검색 상세

MAVIS: Image-centric Alignment and Region-Aware Prompt Learning for Industrial Anomaly Detection

초록(요약문)

목차