A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters
- 주제(키워드) 도움말 hyper-parameter optimization , deep learning cluster , GPU scheduling , container
- 발행기관 MDPI
- 발행년도 2021
- 총서유형 Journal
- 본문언어 영어
초록/요약 도움말
This paper proposes Hermes, a container-based preemptive GPU scheduling framework for accelerating hyper-parameter optimization in deep learning (DL) clusters. Hermes accelerates hyper-parameter optimization by time-sharing between DL jobs and prioritizing jobs with more promising hyper-parameter combinations. Hermes's scheduling policy is grounded on the observation that good hyper-parameter combinations converge quickly in the early phases of training. By giving higher priority to fast-converging containers, Hermes's GPU preemption mechanism can accelerate training. This enables users to find optimal hyper-parameters faster without losing the progress of a container. We have implemented Hermes over Kubernetes and compared its performance against existing scheduling frameworks. Experiments show that Hermes reduces the time for hyper-parameter optimization up to 4.04 times against previously proposed scheduling policies such as FIFO, round-robin (RR), and SLAQ, with minimal time-sharing overhead.
more