High Performance Vector Database Management System for High-Dimensional Vector Similarity Search : High Performance Vector Database Management System for High-Dimensional Vector Similarity Search
고차원 벡터 유사도 검색을 위한 고성능 벡터 데이터베이스 관리 시스템
- 주제(키워드) Vector Databases , High-Dimensional Vector Similarity Search , Neural Information Retrieval Systems , Index Caching , Software Stack , Coordination , 벡터 데이터베이스 , 고차원 벡터 유사도 검색 , 뉴럴 정보 검색 시스템 , 인덱스 캐싱 , 소프트웨어 스택 , 코디네이션
- 발행기관 서강대학교 일반대학원
- 지도교수 박성용
- 발행년도 2026
- 학위수여년월 2026. 2
- 학위명 박사
- 학과 및 전공 일반대학원 컴퓨터공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000082792
- UCI I804:11029-000000082792
- 본문언어 영어
- 저작권 논문은 저작권에 의해 보호받습니다.
초록(요약문)
Modern information retrieval systems increasingly rely on High-Dimensional Vector Search (HVSS) to retrieve semantically relevant information from large collections of unstructured data, such as text, images, and audio. By encoding user queries and datasets into a high-dimensional vector, HVSS enables efficient similarity search over stored vectors and scales to billions of items. To support such workloads in practice, the information retrieval systems widely adopt Vector Database Management Sys- tem (VDMS),which provides core components such as query routing, vector indexing and retrieval, and reranking. Despite these advantages, existing VDMS faces critical challenges in achieving low-latency, high-throughput vector search. These challenges stem from inefficient index caching, legacy I/O stacks, and limited coordination between upstream data processing platforms and the VDMS. As a result, VDMS often suffers from high search latency and throughput degradation, limiting their practicality at scale. This dissertation addresses these performance bottlenecks through three com- plementary techniques, each aiming to achieve high throughput for HVSS. First, we propose CALL, a context-aware query grouping mechanism designed to maximize index cache utilization in disk-based vector databases. By exploiting non-uniform cluster access patterns via query reordering and group-aware prefetching, CALL reduces redundant disk access and enables low-latency retrieval under memory con- straints. Second, we introduce FLARE, a fast late-interaction reranking engine that accelerates the reranking stage. FLARE uses io_uring for asynchronous parallel document loading and a unified document-token layout to eliminate deserialization overhead. Also, applying an adaptive document packing and state-aware early ter- mination, this approach maximizes SSD parallelism, drastically reducing reranking latency. Third, we present StreamRAG, a cross-layer coordination framework op- timized for stream-based retrieval augmented generation (RAG) systems. By com- bining lock-aware query routing with traffic-aware proactive instance provisioning, StreamRAG mitigates metadata-lock contention and adapts to bursty workloads, sustaining low latency during concurrent index updates. Collectively, these techniques target different layers of the VDMS stack and pro- vide system-level contributions that improve vector retrieval, reranking, and end-to- end user experience. Extensive evaluations show that the redesigned VDMS achieves consistently lower latency and high throughput across diverse workloads. These re- sults establish a practical foundation for scalable vector-data infrastructure that can support the next generation of AI-driven applications. Key words : Vector Databases, High-Dimensional Vector Similarity Search, Neural Information Retrieval Systems, Index Caching, Software Stack, Coordination
more목차
List of Figures vi
List of Tables xiii
Abstract xv
1 Introduction 1
1.1 Vector Search for Unstructured Big Data 1
1.2 VDMS for Unstructured Big Data 3
1.2.1 Query Routing Layer 4
1.2.2 Vector Index Layer 4
1.2.3 Vector Retrieval Layer 5
1.2.4 Vector Reranking Layer 6
1.3 Limitations of Existing VDMS 7
1.3.1 Non-Context Aware Index Caching 7
1.3.2 Legacy I/O Stack 7
1.3.3 Static Query Routing 8
1.3.4 Non-Proactive Instance Scaling 9
1.4 Research Objectives and Contributions 9
1.5 Dissertation Organization 11
2 Background 12
2.1 Neural Information Retrieval Models 12
2.2 Vector Database 14
2.3 Advanced Vector Indexing Algorithms 15
2.4 Vector Reranking 18
3 CALL 20
3.1 Background 20
3.1.1 Disk-based ANN Vector Search 20
3.2 Motivations 21
3.2.1 Non-Uniform Cluster Access Patterns 21
3.2.2 Analysis of Replacement Effects on Cluster Cache 24
3.2.3 Prefetch Invisibility 27
3.2.4 Imbalanced Cluster Loading 27
3.3 Design 29
3.3.1 Overview of CALL 29
3.3.2 Context-aware Grouping Module 31
3.3.3 Group-aware Prefetch Module 33
3.3.4 Latency-aware Cluster Load Module 35
3.4 Evaluation 36
3.4.1 Experimental Setup 36
3.4.2 Overall Performance 38
3.4.3 Module Effectiveness 44
3.4.4 Memory Usage Analysis 47
3.4.5 Sensitivity of Batch Size 48
3.4.6 Sensitivity of Hyperparameter 49
3.4.7 Practical Use Case Analysis 50
3.4.8 Sensitivity of Jaccard Threshold 52
3.4.9 Overhead Analysis 52
3.4.10 Related Works 53
4 FLARE 55
4.1 Background 55
4.1.1 Index Amplification of Multi-Vector Models 55
4.2 Motivations 58
4.2.1 Poor I/O Efficiency on Reranking Stage 58
4.2.2 Sweet Point in Document Packing 62
4.2.3 Stability of Partial Reranked Documents 63
4.3 Design 65
4.3.1 Overview of FLARE 65
4.3.2 Parallel Document Load Module 67
4.3.3 Adaptive Document Packing Module 73
4.3.4 Early Rerank Module 76
4.4 Evaluation 77
4.4.1 Experimental Setup 77
4.4.2 Tail Latency 81
4.4.3 Module Effectiveness 83
4.4.4 Unified Document Token Layout Analysis 87
4.4.5 Ablation 88
4.4.6 Sensitivity of io_uring Parameters 89
4.4.7 Related Works 90
5 StreamRAG 94
5.1 Background 94
5.1.1 Stream-Based RAG System 94
5.2 Motivations 96
5.2.1 Metadata Lock Contention 96
5.2.2 Reactive Scaling for Burst Traffic 99
5.3 Design 101
5.3.1 Overall Architecture 101
5.3.2 Lock-aware Query Routing Module 103
5.3.3 Traffic-aware Proactive Instance Provision Module 105
5.4 Implementation 108
5.5 Evaluation 108
5.5.1 Experimental Setup 108
5.5.2 Search Latency 112
5.5.3 Module Effectiveness 113
5.5.4 Sensitivity of β 117
5.5.5 Sensitivity of Z-score 118
5.5.6 Related Works 121
6 Conclusion 122
References 125

