An Architectural Framework for High-Performance and Energy-Efficient Large Language Model Acceleration on NAND Flash-Based In-Storage Computing Systems
- 주제(키워드) 3D NAND flash memory , in-storage computing , large languages models , DRAM buffering , triple-level cell
- 발행기관 서강대학교 일반대학원
- 지도교수 류성주
- 발행년도 2026
- 학위수여년월 2026. 2
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000082247
- UCI I804:11029-000000082247
- 본문언어 영어
- 저작권 논문은 저작권에 의해 보호받습니다.
초록(요약문)
This paper presents an architectural framework for NAND flash-based in-storage computing (ISC) systems, addressing challenges of high energy consumption and low performance in accelerating large language model (LLM). For energy efficiency, E-Flash introduces a novel data mapping methodology that reduces the power consumed when reading static weights. It utilizes state-switching algorithm and cell-first allocation to map frequent data patterns to low-power cell states, achieving an energy reduction of up to 37.73% compared to baseline. To increase performance, NITRO addresses the latency of handling dynamic activations by using a hybrid architecture that buffers these intermediate results in a fast DRAM subsystem. Furthermore, NITRO maximizes throughput by leveraging a distributed dataflow to exploit the parallelism of the NAND array, thereby reducing inference latency by up to 85%. These strategies enable both fast and energy-efficient LLM deployment directly inside storage systems.
more목차
Abstract
I. Introduction 1
II. Preliminaries 4
2.1. 3D NAND Flash 4
2.2. 3D NAND Flash-Based Matrix-Vector Multiplication 8
III. Bit-Pattern Aware Data Mapping Algorithm 10
3.1. Motivation 10
3.2. Proposed Methodology 13
3.2.1. State-Switching Algorithm 13
3.2.2. Cell-First Allocation 15
3.2.3. Top-Level Architecture 18
3.3. Experiments 20
3.3.1. Experimental Setup 20
3.3.2. Results 22
IV. Heterogeneous In-Storage Computing Architecture 25
4.1. Motivation 25
4.2. Proposed Methodology 28
4.2.1. Activation Buffering 28
4.2.2. Distributed Dataflow 32
4.2.3. Top-Level Architecture 35
4.3. Experiments 38
4.3.1. Experimental Setup 38
4.3.2. Results 41
V. Conclusion 47
Reference 49

