검색 상세

Chronica : A Data-Imbalance-Aware Scheduler for Distributed Deep Learning

초록 (요약문)

One of the major challenges in distributed deep learning is to attenuate straggler problem. The straggler increases synchronization delay and significantly inhibits the convergence of deep learning model. We have empirically observed that the imbalanced data samples worsens the straggler problem and makes the convergence of deep learning model slower. However, existing approaches such as BOA and EP4DDL have not addressed data imbalance issue while solving the straggler problem. To overcome straggler and data imbal- ance problems, we propose Chronica, a new data-imbalance-aware scheduler. Based on the size of data and configuration of each worker, Chronica elaborately predicts the required training time for each worker. Chronica then provides equivalent training time to each of workers by alleviating both mini-batch-level and epoch-level straggler problems. Furthermore, in order to achieve fast convergence, Chronica suggests a new parameter synchronization scheme based on the weighted average of training load on each worker. Our extensive evaluation using four deep learning models over eight Amazon EC2 GPU instances showed that Chronica achieves up to 2.55× speedup against BOA and EP4DDL.

more