dCollection 디지털 학술정보 유통시스템

다중 인스턴스 LLM RAG 시스템 에서 빠른 추론을 위한 디스크 기반 공유 KV 캐시 관리 기법 연구

Disk-Based Shared KV Cache Management for Fast Inference in Multi-Instance LLM RAG Systems

원문보기

주제(키워드) 디스크 기반 KV 캐시 , LLM , Key-Vaue Cache , RAG , Vector DB , TTFT
발행기관 서강대학교 일반대학원
지도교수 김영재
발행년도 2026
학위수여년월 2026. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과협동과정
실제URI http://www.dcollection.net/handler/sogang/000000082232
UCI I804:11029-000000082232
본문언어 한국어
저작권 논문은 저작권에 의해 보호받습니다.

초록(요약문)

거대 언어 모델(LLM)은 입력 프롬프트의 길이가 길어지고 모델 크기가 커짐에 따라 추론 지연 시간이 증가하는 문제에 직면하고 있다. 또한 대부분의 LLM 추론 서비스에서 사용되는 검색 증강 생성(Retrieval Augmented Generation, RAG) 기술은 입력 프롬프트의 길이를 크게 늘려, LLM 추론 시 프리필(prefill) 단계에서 더 높은 연산 오버헤드를 유발하고 첫 토큰 생성 시간(Time To First Token, TTFT)을 지연시킨다. 본 논문은 이러한 문제를 해결하기 위해 디스크 기반 Key-Value 캐시를 사용하여 프리필 단계의 연산 부담을 줄이고, TTFT를 단축하는 KV 캐시 관리 시스템인 Shared RAG-DCache를 제안한다. Shared RAG-DCache는 다중 LLM 인스턴스 RAG 시스템에 적용 가능한 시스템으로, 사용자 질의와 관련된 문서의 지역성(query locality)과 LLM 추론 서비스의 큐 대기 시간을 활용하여, 질의 관련 문서에 대한 KV 캐시를 선제적으로 생성, 디스크에 저장하고 이를 여러 LLM 인스턴스 간에 공유함으로써 추론 성능을 향상시킨다. 단일 호스트(GPU 2개, CPU 1개) 환경에서 수행한 실험 결과, Shared RAG-DCache는 자원 구성 방식에 따라 처리량을 15~71% 증가시켰고, 지연 시간은 12~65%까지 감소시키는 성과를 보였다.

초록(요약문)

Modern Large Language Models (LLMs) face the challenge of increasing inference latency as input prompt lengths increase and model sizes grow. Furthermore, the Retrieval Augmented Generation (RAG) technique, used in most LLM inference services, significantly increases input prompt length, leading to higher computational overhead during the prefill stage of LLM inference and delaying the Time To First Token (TTFT). To address these issues, this paper proposes 'Shared RAG-DCache,' a KV cache management system that uses a disk-based Key-Value (KV) cache to reduce the computational burden of the prefill stage and shorten TTFT. 'Shared RAG-DCache' is a system applicable to multi-instance LLM RAG systems. It leverages the query locality of documents related to user queries and the queue waiting time of the LLM inference service to proactively generate, store, and share a disk KV cache for query-related documents among multiple LLM instances, thereby improving inference performance. Experimental results in a single-host environment (2 GPUs, 1 CPU) showed that Shared RAG-DCache increased throughput by 15~71% and reduced latency by 12~65%, depending on the resource configuration.

1 서론 9
2 배경지식및연구동기 12
2.1 LLM 추론에서의 Key-Value 캐시 활용 12
2.2 RAG 프롬프트 구성 13
2.3 디스크 기반 Key-Value 캐시 활용 가능성 및 손익 분기점 분석 14
2.4 RAG에서 질의에 의해 검색된 문서의 지역성 16
2.5 다중 LLM 인스턴스 추론 서비스를 위한 공유 Key-Value 캐시 17
3 설계및구현 20
3.1 디스크 기반 Key-Value 캐시 구조와 동작 20
3.2 다중 LLM 인스턴스 서비스 환경에서의 RAG-DCache 24
3.3 최신 관련 연구와 Shared RAG-DCache 비교 29
3.4 RAG-DCache 및 Shared RAG-DCache 구현 환경 32
4 실험결과 33
4.1 RAG-DCache 성능 측정 결과 및 분석 34
4.2 Shared RAG-DCache 성능 측정 결과 및 분석 38
4.3 디스크기반 Key-Value 캐시 사용 기준점 계산에 대한 논의 45
5 결론 48
참고문헌 50

반출 Meta View 목록

서강대학교

검색 상세

다중 인스턴스 LLM RAG 시스템 에서 빠른 추론을 위한 디스크 기반 공유 KV 캐시 관리 기법 연구

초록(요약문)

초록(요약문)

목차