검색 상세

MAVIS: Image-centric Alignment and Region-Aware Prompt Learning for Industrial Anomaly Detection

초록(요약문)

Vision-language models such as CLIP have gained significant attention in anomaly detection due to their zero-shot generalization and interpretable reasoning capabilities, yet they possess fundamental limitations. CLIP, trained on large-scale general data, lacks the ability to explicitly contrast normal and abnormal states—a capability we term "negation"—which is essential for domain-specific AD tasks. Consequently, normal image features and abnormal text features are not semantically distinguished in the feature space, with some abnormal text features being highly similar to normal image features, acting as hard negatives and causing false positives. To address this problem, we propose Negation-aware N-pair loss inspired by metric learning. By setting image features as anchors, normal text features as positives, and abnormal text features as negatives, we apply this loss at both global image and local patch levels to enforce explicit separation between abnormal text features and normal image features. The Negation-aware N-pair loss fundamentally resolves the hard negative problem by calculating distances between vision anchors and abnormal text features, then selecting the top-k closest abnormal text features as negatives. Furthermore, to explicitly convey spatial location information of anomalies to the vision-language model, we propose RoI Selector and Region-aware prompt learner. The RoI Selector extracts coordinates of anomaly candidate regions, which are converted into position embeddings and transmitted to the Region-aware prompt learner. These position embeddings are combined with learnable queries for training. Through cross-attention modules, they are aligned with regional visual features, and the position embeddings explicitly encode spatial context, enabling accurate correspondence between detected anomalies and their locations. Our method achieves state-of-the-art performance with 97% image-level AUC and 96% pixel-level AUC on MVTec-AD dataset, and demonstrates competitive performance on VisA. Additionally, with 92.9% language-based anomaly classification accuracy in language-based reasoning, we demonstrate superior detection performance and interpretability simultaneously in a one-shot setting.

more

목차

Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
I . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
II . Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Industrial Anomaly Detection . . . . . . . . . . . . . . . . . 8
2.2 Zero-/Few-shot Industrial Anomaly Detection . . . . . . . . 9
2.3 Metric Learning for Discriminative Features . . . . . . . . . 10
2.4 Large Vision-Language Models . . . . . . . . . . . . . . . . 10
III . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 MAVIS framework . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Motivation for Negation-aware Learning . . . . . . . . . . 13
3.3 Negation-aware N-pair Loss . . . . . . . . . . . . . . . . . 14
3.3.1 Image-level Negation-aware N-pair Loss . . . . . . . 16
3.3.2 Patch-level Negation-aware N-pair Loss . . . . . . . . 17
3.4 Region-aware Prompt Learner . . . . . . . . . . . . . . . . 19
3.4.1 Region-aware Query Learning . . . . . . . . . . . . 19
3.4.2 Anomaly-guided Cross-attention . . . . . . . . . . . 21
IV . Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . 24
4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . 28
V . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
VI . Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 Normal and Abnormal Text Construction . . . . . . . . . . . 30
6.2 Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Additional Qualitiative results . . . . . . . . . . . . . . . . 31
6.3.1 Visualization of anomaly map . . . . . . . . . . . . 31
6.3.2 Additional results of Multi-dialouge . . . . . . . . . 31
6.3.3 Limitations and Failure Cases . . . . . . . . . . . . 32

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

more