검색 상세

Dysarthria Severity Classification via 1D-Convolutional MoE and RAG-based Explanation Generation System

1D-Convolutional MoE 기반 마비말 장애 심각도 분류 및 RAG 기반 설명 생성 시스템

초록(요약문)

마비말장애평가는시간축에서나타나는특징을포착하고,이를임상적으로해석할 수있는근거와함께제시하는것이중요하다. 특히마비말장애는데이터불균형과 부족문제가심각한영역이기때문에, 이에강건하게대응할수있는모델이요구된 다. 본연구에서는 1D 컨볼루션기반신경망을활용하여최대발성지속시간(MPT) 과 교대운동속도(DDK) 과제의 시간적 패턴을 학습하고, 마비말장애 분류 성능을 향상시키는동시에시계열데이터에서이상구간을식별할수있도록설계하였다. 데이터희소성을보완하기위해Mixture-of-Experts(MoE) 분류기를도입하여, 중증 도에따라이질적인발화패턴을전문화된경로로처리함으로써제한된데이터환 경에서도표현력과안정성을확보하였다. 또한시간기반설명성을제공하기위해 Grad-CAM을적용하여분류결정에기여한핵심구간을도출하였고,해당구간에서 추출한특징을RAG기반절차로해석하여LLM이임상적맥락에맞는정교한설명을 생성하도록하였다. 실험결과,제안된모델은MPT와DDK 전반에서안정적인분류 성능을보였으며,조음및운율이상을시간단위로세밀하게해석하는높은설명성을 제공하였다. 본연구는실제임상환경의데이터희소성과이질성을반영한해석가 능하고확장성있는마비말장애평가방법을제시하며, 임상및연구응용모두에서 높은활용가능성을보여준다.

more

초록(요약문)

Accurate dysarthria assessment requires models that capture temporal char- acteristics of speech and provide clinically interpretable evidence. This need is heightened by the severe data imbalance and scarcity common in dysarthria research, necessitating robust modeling strategies. In this study, we employ a 1D convolution–based neural architecture to learn temporal patterns from maximum phonation time (MPT) and diadochokinetic (DDK) tasks, enhancing dysarthria classification while enabling the identification of abnormal segments within the time series. To address data sparsity, a Mixture-of-Experts (MoE) classifier is incorporated, allowing heterogeneous speech patterns across severity levels to be processed through specialized expert pathways, thereby improving model stability and representational capacity in limited-data settings. To further provide temporal interpretability, Grad-CAM is applied to highlight key intervals contributing to classification decisions. Features extracted from these intervals are then inter- preted through a retrieval-augmented generation (RAG) process, enabling a large language model (LLM) to generate clinically grounded and contextually coherent explanations. Experimental results demonstrate that the proposed model achieves stable classification performance across both MPT and DDK tasks and offers fine-grained interpretability for articulatory and prosodic abnormalities. Overall, this work presents an interpretable and scalable approach to dysarthria assessment that effectively addresses the data scarcity and heterogeneity inherent in clinical environments, supporting both research and clinical applications.

more

목차

1.1 설명가능한마비말진단모델의필요성 1
1.2 데이터부족및불균형문제점 2
1.3 Retrieval Augmented Generation (RAG) 3
1.4 연구의필요성 4
1.5 기여점 5
2 관련연구 7
2.1 심각도분류를위한마비말장애음성데이터셋현황 7
2.2 마비말장애심각도분류연구 8
2.3 음성분류모델에서의설명성 9
2.4 의학분야에서의대규모언어모델활용 11
2.4.1 Hallucination 방지를위한 Retrieval-Augmented Generation 연구 13
2.4.2 설명성을위한의료 LLM 연구 14
3 방법론 17
3.1 1D Conv MoE 분류기 17
3.2 Retrieval-Augmented Generation (RAG) 파이프라인 20
3.2.1 특성분석및임베딩구성 20
3.2.2 Temporal Retrieval 및 Filtering 전략 22
3.2.3 LLM 기반설명생성 22
4 Experiment 24
4.1 데이터셋 24
4.1.1 언어치료사라벨링서브셋 26
4.2 기존연구& 실험설정 29
4.2.1 Audio Preprocessing 29
Whisper Preprocessing 29
1D Conv. 심각도분류기 Preprocessing 30
4.2.2 심각도분류기베이스라인:Whisper + MLP 30
4.2.3 Retrieval 베이스라인및비교설정 31
4.2.4 LLM & RAG 설정: Zero-shot, Few-shot, Omni-modal 33
4.2.5 실험설정 34
Audio 전처리와Mel-spectrogram 변환 34
학습전략및하이퍼파라미터 34
음향및과제특징추출설정 35
LLM과Omni-Modal Model 37
4.3 실험결과 37
4.3.1 심각도분류성능 37
4.3.2 검색기성능 41
4.3.3 설명생성성능 44
4.4 사례분석 46
4.4.1 정성분석과사례연구 46
방법론별비교분석: Speaker V128E 47
실패사례분석: Speaker EUMCS08 50
5 결론과한계점그리고향후연구 53
5.1 요약 53
5.2 한계와향후연구 54
A Appendix 56
A.1 Common System Prompt 56
A.2 Prompt Templates for Clinical Report Generation 58
A.2.1 MPT Task Prompt Template 58
A.2.2 DDK Task Prompt Template 59
A.3 LLM Judge Evaluation Prompt 60
Bibliography 62

more