dCollection 디지털 학술정보 유통시스템

타겟팅 광고 시스템을 위한 키워드 증강 및 한국어 BERT를 활용한 관심사 키워드 추출 연구

Interest Keyword Extraction using Keyword Augmentation and Korean BERT for Targeting Advertising System

원문보기

주제어 (키워드) 데이터 증강 , 한국어 BERT , 준지도 학습 , 키워드 추출 , 자연어 처리 , 단어 임베딩 , 타겟팅 광고; Data Augmentation , Korean BERT , Semi-Supervised Learning , Keyword Extraction , Natural Language Processing , Word Embedding , Targeting Advertising
발행기관 서강대학교 정보통신대학원
지도교수 양지훈
발행년도 2024
학위수여년월 2024. 2
학위명 석사
학과 및 전공 정보통신대학원 데이터사이언스 · 인공지능
실제URI http://www.dcollection.net/handler/sogang/000000076645
UCI I804:11029-000000076645
본문언어 한국어
저작권 서강대학교 논문은 저작권 보호를 받습니다.

초록

방송광고 시장은 IPTV를 주축으로 디지털 매체로 변화하고 있고, 상품들의 광고 효과를 높이기 위해 MASS 광고와 차별화된 맞춤형 광고를 제공하는 것을 목표로 하고 있다[1]. 맞춤형 광고를 위해서는 제품군 분류가 핵심이며, 신규 상품 분류 수작업을 최소화하고 객관성을 높이기 위해 키워드 추출 과정을 시스템화하는 것이 반드시 필요하다. 본 논문은 맞춤형 광고의 기반이 되는 관심 키워드 추출 과정에서 선행 연구 대비 3가지 차별화된 연구 및 시스템 제안을 한 점에서 큰 의의가 있다. 첫 번째로, 학습 데이터의 품질은 모델 성능에 많은 영향을 주게 되므로, 도메인 전문가 지식이 반영된 정답 데이터의 지속적 확보와 모델 반영될 수 있도록 시스템화하여 필수 프로세스인 검수 업무 외 모든 키워드 추출 과정을 자동화하였다. 두 번째로, 지도 학습과 비지도 학습 두 모델이 서로의 단점을 보완하는 역할을 할 수 있도록 준지도 학습 구조로 구현하였다. 도메인 전문가를 통해 확보된 소량의 정답 키워드는 지도 학습 모델을 통해 학습하고, 대용량 사전학습 BERT[10]와 유사 키워드 추출을 결합하는 KeyBERT[7] 모델을 통해 객관적이고 다양한 후보 키워드를 추출하였다. 두 모델 추출 키워드로의 준지도 학습하는 본 제안 시스템의 성능이 기존 비지도 학습과 지도 학습 대비 약 30~40% 개선된 것을 확인하였다. 또한 학습 데이터의 양과 학습 횟수(Epoch) 변경에 따른 성능 비교를 통해 최적값을 찾아내었다. 마지막으로, 사전학습 BERT 모델별 임베딩 성능 비교를 통해 한국어에 특화된 모델을 연구하였고, 약 10% 우수한 결과를 보인 KoBERT[11]를 적용하였다.

초록

The broadcast advertising market is changing to a digital medium centered on IPTV, and aims to provide customized advertising that is differentiated from MASS advertising in order to increase the advertising effectiveness of products[1]. For customized advertising, product classification is key, and it is necessary to automate the keyword extraction process to minimize the manual work of classifying new products and increase objectivity. This paper is significant in that it proposes three differentiated research and systems compared to previous studies in the process of extracting keywords of interest, which is the basis for customized advertising. First, since the quality of learning data has a lot of influence on model performance, all keyword extraction processes except for inspection, which is an essential process, were automated by systemizing correct answer data that reflects domain expert knowledge and to reflect the model. Second, two models of supervised learning and unsupervised learning were implemented as semi-supervised learning structures so that they could play a role in complementing each other's shortcomings. A small amount of correct answer keywords secured through domain experts were learned through supervised learning models, and objective and diverse candidate keywords were extracted through the KeyBERT[7] model that combines large-capacity pre-learning BERT[10] and similar keyword extraction. It was confirmed that the performance of this proposed system for semi-supervised learning with the two model extraction keywords was improved by about 30~40% compared to the existing unsupervised learning and supervised learning. In addition, the optimal value was found through performance comparison according to the change in the amount of learning data and the number of learning(Epoch). Finally, we studied Korean-specific models through the comparison of embedding performance by pre-learning BERT model, and KoBERT[11], which showed about 10% superior results, was applied.

제 1 장 서론 1
1.1 연구의 배경 및 목적 1
1.2 연구의 범위 및 방법 3
제 2 장 이론적 배경 6
2.1 Word2Vec 6
2.2 RNN 7
2.3 LSTM 7
2.4 Encoder-Decoder 8
2.5 Transformer 9
2.6 BERT 11
2.7 SKT KoBERT 12
2.8 KeyBERT 13
2.9 ROUGE 14
제 3 장 제안 시스템 15
3.1 데이터 전처리 15
3.1.1 데이터 특성 15
3.1.2 형태소 분석 17
3.1.3 임베딩 19
3.2 키워드 증강 및 한국어 BERT를 활용한 관심사 키워드 추출 20
제 4 장 연구 실험 및 결과 26
4.1 실험 데이터 26
4.2 평가 지표 27
4.3 모델 평가 30
4.3.1 임베딩 성능 비교 30
4.3.2 제안 시스템 모델 성능 비교 31
제 5 장 결론 36
참고문헌 37

반출 Meta View 목록

서강대학교

검색 상세

타겟팅 광고 시스템을 위한 키워드 증강 및 한국어 BERT를 활용한 관심사 키워드 추출 연구

초록

초록

목차