Optimization with Access Frequency-Based Remapping for Recommendation System Inference Accelerators
- 주제(키워드) Recommendation system , NAND flash memory , in-storage computing , hardware accelerator , data remapping
- 발행기관 서강대학교 일반대학원
- 지도교수 류성주
- 발행년도 2026
- 학위수여년월 2026. 2
- 학위명 석사
- 학과 및 전공 일반대학원 전자공학과
- 실제URI http://www.dcollection.net/handler/sogang/000000082259
- UCI I804:11029-000000082259
- 본문언어 영어
- 저작권 논문은 저작권에 의해 보호받습니다.
초록(요약문)
We propose an optimization strategy for recommendation system accelerators to enhance inference on NAND flash-based in-storage computing (ISC). Modern recommendation systems provide personalized outputs from user activities such as clicks and streaming histories. Deep learning models for these tasks employ embedding layers that require large memory and show irregular access patterns. As data volumes increase, embedding tables often grow beyond DRAM capacity, making NAND flash storage necessary. However, this random access pattern causes most of the data fetched from large NAND flash pages to remain unused. This disparity between small embedding vectors and large page buffers leads to underutilized internal bandwidth and degraded performance. To solve this, we use access frequency-based remapping to group frequently accessed embedding vectors onto the same page. This is combined with plane distribution, which distributes these pages across multiple planes to maximize hardware parallelism. Second, we implement a page-wise cache in the SSD controller that stores frequently accessed pages in SRAM. Experimental results show that our proposed method improves latency by up to 81% over the existing ISC architectures.
more목차
I. Introduction 1
II. Preliminary 6
2.1. Random Data Access Pattern on Recommendation System 6
2.2. Inefficient Bandwidth Utilization on NAND Flash Memory 7
2.3. Data Access Frequency on Recommendation System 9
III. Hardware Accelerator for Recommendation System 11
3.1. Motivation 11
3.2. Embedding Layers in Prior SSD-Based Accelerators 16
3.3. RecFlash 17
3.4. RecFlash Hardware Design 34
IV. Evaluation 38
4.1. Experimental Setup 38
4.2. Latency Analysis of Embedding Operation 41
4.3. Energy Consumption Analysis 43
4.4. End-to-End Model Latency Analysis 45
4.5. Cumulative Inference Time Analysis with Online Training 48
V. Conclusion 51

