검색 상세

An Architectural Framework for High-Performance and Energy-Efficient Large Language Model Acceleration on NAND Flash-Based In-Storage Computing Systems

초록(요약문)

This paper presents an architectural framework for NAND flash-based in-storage computing (ISC) systems, addressing challenges of high energy consumption and low performance in accelerating large language model (LLM). For energy efficiency, E-Flash introduces a novel data mapping methodology that reduces the power consumed when reading static weights. It utilizes state-switching algorithm and cell-first allocation to map frequent data patterns to low-power cell states, achieving an energy reduction of up to 37.73% compared to baseline. To increase performance, NITRO addresses the latency of handling dynamic activations by using a hybrid architecture that buffers these intermediate results in a fast DRAM subsystem. Furthermore, NITRO maximizes throughput by leveraging a distributed dataflow to exploit the parallelism of the NAND array, thereby reducing inference latency by up to 85%. These strategies enable both fast and energy-efficient LLM deployment directly inside storage systems.

more

목차

Abstract
I. Introduction 1
II. Preliminaries 4
2.1. 3D NAND Flash 4
2.2. 3D NAND Flash-Based Matrix-Vector Multiplication 8
III. Bit-Pattern Aware Data Mapping Algorithm 10
3.1. Motivation 10
3.2. Proposed Methodology 13
3.2.1. State-Switching Algorithm 13
3.2.2. Cell-First Allocation 15
3.2.3. Top-Level Architecture 18
3.3. Experiments 20
3.3.1. Experimental Setup 20
3.3.2. Results 22
IV. Heterogeneous In-Storage Computing Architecture 25
4.1. Motivation 25
4.2. Proposed Methodology 28
4.2.1. Activation Buffering 28
4.2.2. Distributed Dataflow 32
4.2.3. Top-Level Architecture 35
4.3. Experiments 38
4.3.1. Experimental Setup 38
4.3.2. Results 41
V. Conclusion 47
Reference 49

more