검색 상세

Unified Music Translation between Score Images, MusicXML, MIDI and Audio

초록 (요약문)

Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. Here, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. The viability of this approach is enabled by two key factors. Firstly, a new dataset that we collect consisting of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a shared vocabulary of tokens, enabling a single encoder–decoder Trans- former to tackle multiple cross-modal translation as one coherent sequence-to- sequence task. Experimental results confirm that our unified multitask model surpasses single-task baselines, lowering the Symbol Error Rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.

more

목차

Contents i
List of Figures iv
List of Tables vi
Abstract 1
1 Introduction 3
2 Background 7
2.1 Sequence-to-Sequence Model 7
2.2 Language Model 8
2.3 Transformer 8
2.3.1 Transformer Encoder 9
2.3.2 Transformer Decoder 9
2.4 Vector Quantization 10
2.4.1 Residual Vector Quantization 11
2.4.2 Usage of VQ and RVQ Tokens with Language Model 11
3 Problem Formulation and Related Works 12
3.1 Music Representation in Different Modalities 12
3.2 Tasks in Music Information Retrieval 14
3.2.1 Automatic Music Transcription 14
3.2.2 Optical Music Recognition 15
3.3 Multimodal and Multitask Approaches 16
4 Methodologies 17
4.1 Tokenization 18
4.1.1 Image tokens 19
4.1.2 Audio tokens 24
4.1.3 Linearized MusicXML (LMX) 26
4.1.4 MIDI-Like Tokens 27
4.1.5 Unified Vocabulary 27
4.2 Model Architecture and Unified Token Space 27
4.2.1 Decoding with a Sub-Decoder Dsub 28
4.2.2 Training Objective 29
4.2.3 Autoregressive Inference 29
5 Dataset 30
5.1 YouTube Score Video Dataset 30
5.1.1 Data Collection 31
5.1.2 Data Extraction and Processing 32
5.1.3 Corpus-level Filtering (Metadata-based) 35
5.1.4 Sample-Level Filtering 40
5.2 Dataset Collection 43
6 Experiments 46
6.1 Modal Directions 46
6.2 Implementation 47
6.3 Data Split and Test Sets 49
6.4 Evaluation Metrics 51
6.4.1 Optical Music Recognition 51
6.4.2 Automatic Music Transcription 52
6.4.3 Image-to-Audio 54
6.4.4 MIDI-to-Audio 57
6.4.5 Audio-to-Image 57
7 Results 61
7.1 Image-to-Audio Generation 61
7.2 Audio-to-Image Generation 64
7.3 MIDI-to-Audio Synthesis 66
7.4 OMR and AMT 66
8 Conclusion 69
Bibliography 70
초록 78

more