검색 상세

Deduplication Approaches for High-Performance and Space-Efficient Key-Value Databases

고성능 및 공간 효율적인 키-값 데이터베이스를 위한 중복 제거 접근법

초록 (요약문)

Deduplication Approaches for High-Performance and Space-Efficient Key-Value Databases The rapid growth of data-intensive applications has posed significant challenges for modern storage systems, particularly key-value store-based systems, which are widely adopted for their simplicity and scalability. However, key-value store systems suffer from inefficiencies, such as high write and space amplification of up to 40×. Moreover, key-value store datasets exhibit high redundancy with exacerbates the write amplification and suboptimal storage utilization. Traditional file system-level deduplication fails in key-value store environments due to granularity mismatches, while compression algorithms are limited to intra-block redundancies, leaving cross- block duplication unoptimized. Additionally, the adoption of Zoned Namespace (ZNS) SSDs introduces complexities such as file system-managed garbage collection, which exacerbates write amplification through valid data migration overhead. Ex- isting deduplication solutions for Log-Structured File Systems fail to address these ZNS-specific challenges, including amplified garbage collection overhead and ineffi- cient file organization. This dissertation addresses these challenges by proposing novel frameworks to optimize write amplification and storage utilization in key-value store systems and ZNS SSD-based file systems. First (Chapter 3), a flush-integrated inline deduplica- tion framework, HyDe-KV (Inline), introduces Asynchronous Partly Inline Dedu- plication into LSM-Tree-based key-value stores, effectively reducing write amplifi- cation and optimizing storage. Second (Chapter 4), a hybrid deduplication frame- work, Hyde-KV, combines inline and offline deduplication with an elastic execution model to balance performance and deduplication efficiency across diverse workloads in LSM-Tree-based key-value stores. Finally (Chapter 5), DeZNS, a data placement approach for deduplication-enabled ZNS file systems, addresses garbage collection overhead by segregating unique and duplicate data blocks using a lightweight CRC32 checksum-based mechanism. This segregation reduces valid data migration over- head during garbage collection, while an interruptible garbage collection mechanism maintains performance by preventing delays to ongoing I/O requests during zone resets. This dissertation advances storage system design by addressing key challenges in write amplification, redundancy management, and deduplication efficiency, offering scalable and high-performance solutions for modern key-value store architectures and ZNS SSD-based systems. Key words : Data Deduplication, Key-Value Stores, Zoned Namespace SSDs

more

목차

1 Introduction 17
1.1 Write and Space Amplification in Key-Value Databases 18
1.2 Redundancy in Key-Value Datasets: 19
1.2.1 Problem Statement 20
1.3 Research Contributions 21
1.4 Disstertation Organization 23
2 Background 24
2.1 Key-Value Stores as Node-local Storage 24
2.2 Log-Structured Merge (LSM)-Tree 25
2.2.1 Limitations of LSM-Tree 28
2.3 BlobDB: Key-Value Separation in LSM Tree 29
2.4 Zoned Namespace SSD 29
2.5 Storage Space Optimization Techniques 31
2.5.1 Compression 31
2.5.2 Deduplication 32
3 HyDe-KV (Inline): Inline Deduplication in LSM-tree 35
3.1 Introduction 35
3.2 Background 38
3.2.1 Log-Structured Merge (LSM)-Tree 38
3.2.2 Key-Value Dataset Characteristics 38
3.3 Preliminary Study and Motivation 40
3.3.1 KV Store-level Compression 41
3.3.2 File System-level Deduplication 42
3.3.3 Exploring Deduplication in LSM-Tree-based key-value store . 43
3.4 Design of HyDe-KV (Inline) 44
3.4.1 System Overview 44
3.4.2 Data Structures 46
3.4.3 I/O Flow 47
3.4.4 Garbage Collection 48
3.4.5 Performing Analysis on Node-local key-value store 49
3.5 Evaluation 50
3.5.1 Implementation 50
3.5.2 Experimental Setup 50
3.5.3 Performance Analysis 51
3.5.4 Write and Space Amplification Analysis 53
3.5.5 Analysis of HyDe-KV (Inline)'s I/O Delays 54
3.6 Related Work 57
3.7 Chapter Summary 58
4 Hyde-KV: Offline & Hybrid Deduplication in LSM-tree 60
4.1 Introduction 60
4.2 Background and Motivation 65
4.2.1 HyDe-KV (Inline): FLUSH-integrated Inline Deduplication: 65
4.3 Design of Hyde-KV 67
4.3.1 Design Goals 67
4.3.2 Design Overview 69
4.3.3 HyDe-KV (Offline): WAL file-based Offline Deduplication 69
4.3.4 Hyde-KV: Elastic Execution 72
4.3.5 I/O Service Flow 74
4.4 Evaluation 76
4.4.1 Experiment Setup 76
4.4.2 Evaluation with Duplicates 79
4.4.3 Evaluation with Real-world Traces 83
4.4.4 Evaluation with Standard Benchmark 85
4.4.5 CPU Utilization Analysis 87
4.4.6 Range Scan Analysis 89
4.4.7 Failure Analysis 90
4.5 Chapter Summary 91
5 DeZNS: Data Placement in Deduplication-Enabled ZNS Environment 93
5.1 Introduction 93
5.2 Background and Motivation 96
5.2.1 Zone Management with ZenFS 96
5.2.2 Motivation 98
5.3 Design of DeZNS 103
5.3.1 Design Goals 103
5.3.2 System Architecture 104
5.3.3 C3PO: CRC32-based Data Placement 106
5.3.4 Offline Deduplication 109
5.3.5 Garbage Collection 111
5.3.6 I/O Flow 112
5.4 Evaluation 113
5.4.1 Experimental Setup 113
5.4.2 Micro-Benchmark Results 115
5.4.3 Macro-Benchmark Results 122
5.4.4 Experiments on Real Device 125
5.4.5 Space Utilization and Metadata Overhead 130
5.5 Related Work 132
5.5.1 ZNS SSD 132
5.5.2 Data Deduplication in File System 133
5.6 Chapter Summary 134
6 Conclusion and Future Work 138
References 140

more