This post is about how to deploy nvidia-docker container to docker swarm service. If you have multiple GPU resources and need to allocate that resources on docker services individually, the swarm orchestrator can automatically allocate services that need GPUs to nodes that have GPUs, without us needing to manually place tasks on specific nodes. 👍
Prerequisites
# UPDATE (2019.07.26)
NVIDIA offers a fully tested docker image with CUDA, Tensorflow and other thing to execute deep learning application. You can get latest docker image from here.
We don’t have to reinvent wheel.
̶O̶b̶v̶i̶o̶u̶s̶l̶y̶,̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶a̶ ̶n̶o̶d̶e̶ ̶w̶i̶t̶h̶ ̶a̶ ̶G̶P̶U̶ ̶t̶h̶e̶n̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶ ̶m̶u̶s̶t̶ ̶b̶e̶ ̶i̶n̶s̶t̶a̶l̶l̶e̶d̶.̶ ̶I̶f̶ ̶y̶o̶u̶ ̶p̶r̶e̶v̶i̶o̶u̶s̶l̶y̶ ̶h̶a̶d̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶ ̶i̶n̶s̶t̶a̶l̶l̶e̶d̶,̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶u̶n̶i̶n̶s̶t̶a̶l̶l̶ ̶i̶t̶ ̶a̶n̶d̶ ̶c̶h̶a̶n̶g̶e̶ ̶t̶o̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶2̶ ̶f̶o̶r̶ ̶s̶w̶a̶r̶m̶ ̶s̶u̶p̶p̶o̶r̶t̶.̶ ̶F̶o̶r̶ ̶e̶x̶a̶m̶p̶l̶e̶:̶
̶#̶ ̶u̶n̶i̶n̶s̶t̶a̶l̶l̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶ ̶1̶.̶0̶
̶d̶o̶c̶k̶e̶r̶ ̶v̶o̶l̶u̶m̶e̶ ̶l̶s̶ ̶-̶q̶ ̶-̶f̶ ̶d̶r̶i̶v̶e̶r̶=̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶ ̶|̶ ̶x̶a̶r̶g̶s̶ ̶-̶r̶ ̶-̶I̶{̶}̶ ̶-̶n̶1̶ ̶d̶o̶c̶k̶e̶r̶ ̶p̶s̶ ̶-̶q̶ ̶-̶a̶ ̶-̶f̶ ̶v̶o̶l̶u̶m̶e̶=̶{̶}̶ ̶|̶ ̶x̶a̶r̶g̶s̶ ̶-̶r̶ ̶d̶o̶c̶k̶e̶r̶ ̶r̶m̶ ̶-̶f̶
̶s̶u̶d̶o̶ ̶a̶p̶t̶-̶g̶e̶t̶ ̶p̶u̶r̶g̶e̶ ̶-̶y̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶
̶#̶ ̶i̶n̶s̶t̶a̶l̶l̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶2̶
̶s̶u̶d̶o̶ ̶a̶p̶t̶-̶g̶e̶t̶ ̶-̶y̶ ̶i̶n̶s̶t̶a̶l̶l̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶2̶
̶T̶o̶ ̶t̶e̶s̶t̶ ̶n̶v̶i̶d̶i̶a̶-̶d̶o̶c̶k̶e̶r̶2̶ ̶i̶n̶s̶t̶a̶l̶l̶e̶d̶ ̶p̶r̶o̶p̶e̶r̶l̶y̶,̶ ̶e̶x̶e̶c̶u̶t̶e̶ ̶t̶h̶i̶s̶ ̶l̶i̶n̶e̶
̶#̶ ̶T̶e̶s̶t̶ ̶n̶v̶i̶d̶i̶a̶-̶s̶m̶i̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶l̶a̶t̶e̶s̶t̶ ̶o̶f̶f̶i̶c̶i̶a̶l̶ ̶C̶U̶D̶A̶ ̶i̶m̶a̶g̶e̶
̶d̶o̶c̶k̶e̶r̶ ̶r̶u̶n̶ ̶-̶-̶r̶u̶n̶t̶i̶m̶e̶=̶n̶v̶i̶d̶i̶a̶ ̶-̶-̶r̶m̶ ̶n̶v̶i̶d̶i̶a̶/̶c̶u̶d̶a̶ ̶n̶v̶i̶d̶i̶a̶-̶s̶m̶i̶
̶I̶f̶ ̶i̶t̶ ̶i̶s̶ ̶i̶n̶s̶t̶a̶l̶l̶e̶d̶ ̶w̶e̶l̶l̶,̶ ̶y̶o̶u̶ ̶c̶a̶n̶ ̶s̶e̶e̶ ̶t̶h̶e̶ ̶f̶o̶l̶l̶o̶w̶i̶n̶g̶ ̶m̶e̶s̶s̶a̶g̶e̶.̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶N̶V̶I̶D̶I̶A̶-̶S̶M̶I̶ ̶3̶8̶7̶.̶2̶6̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶D̶r̶i̶v̶e̶r̶ ̶V̶e̶r̶s̶i̶o̶n̶:̶ ̶3̶8̶7̶.̶2̶6̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶|̶
̶|̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶G̶P̶U̶ ̶ ̶N̶a̶m̶e̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶P̶e̶r̶s̶i̶s̶t̶e̶n̶c̶e̶-̶M̶|̶ ̶B̶u̶s̶-̶I̶d̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶D̶i̶s̶p̶.̶A̶ ̶|̶ ̶V̶o̶l̶a̶t̶i̶l̶e̶ ̶U̶n̶c̶o̶r̶r̶.̶ ̶E̶C̶C̶ ̶|̶
̶|̶ ̶F̶a̶n̶ ̶ ̶T̶e̶m̶p̶ ̶ ̶P̶e̶r̶f̶ ̶ ̶P̶w̶r̶:̶U̶s̶a̶g̶e̶/̶C̶a̶p̶|̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶M̶e̶m̶o̶r̶y̶-̶U̶s̶a̶g̶e̶ ̶|̶ ̶G̶P̶U̶-̶U̶t̶i̶l̶ ̶ ̶C̶o̶m̶p̶u̶t̶e̶ ̶M̶.̶ ̶|̶
̶|̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶+̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶+̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶|̶
̶|̶ ̶ ̶ ̶0̶ ̶ ̶T̶e̶s̶l̶a̶ ̶V̶1̶0̶0̶-̶P̶C̶I̶E̶.̶.̶.̶ ̶ ̶O̶f̶f̶ ̶ ̶|̶ ̶0̶0̶0̶0̶0̶0̶0̶0̶:̶0̶4̶:̶0̶0̶.̶0̶ ̶O̶f̶f̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶0̶ ̶|̶
̶|̶ ̶N̶/̶A̶ ̶ ̶ ̶4̶1̶C̶ ̶ ̶ ̶ ̶P̶0̶ ̶ ̶ ̶ ̶3̶8̶W̶ ̶/̶ ̶2̶5̶0̶W̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶M̶i̶B̶ ̶/̶ ̶1̶6̶1̶5̶2̶M̶i̶B̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶%̶ ̶ ̶ ̶ ̶ ̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶|̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶ ̶ ̶1̶ ̶ ̶T̶e̶s̶l̶a̶ ̶V̶1̶0̶0̶-̶P̶C̶I̶E̶.̶.̶.̶ ̶ ̶O̶f̶f̶ ̶ ̶|̶ ̶0̶0̶0̶0̶0̶0̶0̶0̶:̶0̶6̶:̶0̶0̶.̶0̶ ̶O̶f̶f̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶0̶ ̶|̶
̶|̶ ̶N̶/̶A̶ ̶ ̶ ̶4̶2̶C̶ ̶ ̶ ̶ ̶P̶0̶ ̶ ̶ ̶ ̶3̶8̶W̶ ̶/̶ ̶2̶5̶0̶W̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶M̶i̶B̶ ̶/̶ ̶1̶6̶1̶5̶2̶M̶i̶B̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶%̶ ̶ ̶ ̶ ̶ ̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶|̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶ ̶ ̶2̶ ̶ ̶T̶e̶s̶l̶a̶ ̶V̶1̶0̶0̶-̶P̶C̶I̶E̶.̶.̶.̶ ̶ ̶O̶f̶f̶ ̶ ̶|̶ ̶0̶0̶0̶0̶0̶0̶0̶0̶:̶0̶7̶:̶0̶0̶.̶0̶ ̶O̶f̶f̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶0̶ ̶|̶
̶|̶ ̶N̶/̶A̶ ̶ ̶ ̶4̶4̶C̶ ̶ ̶ ̶ ̶P̶0̶ ̶ ̶ ̶ ̶3̶7̶W̶ ̶/̶ ̶2̶5̶0̶W̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶M̶i̶B̶ ̶/̶ ̶1̶6̶1̶5̶2̶M̶i̶B̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶%̶ ̶ ̶ ̶ ̶ ̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶|̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶ ̶ ̶3̶ ̶ ̶T̶e̶s̶l̶a̶ ̶V̶1̶0̶0̶-̶P̶C̶I̶E̶.̶.̶.̶ ̶ ̶O̶f̶f̶ ̶ ̶|̶ ̶0̶0̶0̶0̶0̶0̶0̶0̶:̶0̶8̶:̶0̶0̶.̶0̶ ̶O̶f̶f̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶0̶ ̶|̶
̶|̶ ̶N̶/̶A̶ ̶ ̶ ̶4̶2̶C̶ ̶ ̶ ̶ ̶P̶0̶ ̶ ̶ ̶ ̶3̶5̶W̶ ̶/̶ ̶2̶5̶0̶W̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶M̶i̶B̶ ̶/̶ ̶1̶6̶1̶5̶2̶M̶i̶B̶ ̶|̶ ̶ ̶ ̶ ̶ ̶ ̶0̶%̶ ̶ ̶ ̶ ̶ ̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶|̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
̶|̶ ̶P̶r̶o̶c̶e̶s̶s̶e̶s̶:̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶G̶P̶U̶ ̶M̶e̶m̶o̶r̶y̶ ̶|̶
̶|̶ ̶ ̶G̶P̶U̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶P̶I̶D̶ ̶ ̶ ̶T̶y̶p̶e̶ ̶ ̶ ̶P̶r̶o̶c̶e̶s̶s̶ ̶n̶a̶m̶e̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶U̶s̶a̶g̶e̶ ̶ ̶ ̶ ̶ ̶ ̶|̶
̶|̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶=̶|̶
̶|̶ ̶ ̶N̶o̶ ̶r̶u̶n̶n̶i̶n̶g̶ ̶p̶r̶o̶c̶e̶s̶s̶e̶s̶ ̶f̶o̶u̶n̶d̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶ ̶|̶
̶+̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶-̶+̶
Now that we’re ready, let’s get started.
First, you have to find the identifier of the GPU on a specific node, you can find it and store it in an environment variable with this command.
${GPU_ID}=`nvidia-smi -a | grep UUID | awk ‘{print substr($4,0,12)}’`
Copy and paste the $GPU_ID you found in the previous step into the configuration file below.
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}
This configuration means the default runtime is changed to nvidia-docker
and that node provides a generic resource of type gpu. The identifiers previously stored in $GPU_ID.
After taking these steps, we need to reload the Docker daemon because dockerd configuration file is changed.
sudo systemctl daemon-reload
sudo systemctl start docker
2. Deploy docker service using GPU resources
Now our cluster nodes are advertising to the swarm that they offer access to a GPU. The final step is to ensure that the service requests a GPU. We do this by adding to the Docker service create command -generic-resource "gpu=1"
The full command looks something like this:
docker service create -generic-resource “gpu=1” -replicas 10 \
-name nvidia-docker-swarm tensorflow-gpu
Now, you can get 10 replicas of tensorflow-gpu image using 1 gpu core.
Finally, The Docker swarm orchestrator will now distribute your nvidia-docker container onto nodes with GPU capability. 🎊
Note : However, only one Docker service replica can be assigned to a given GPU. There is no time sharing between services on a single node. This means that you need as many nodes as you actually need a GPU. If you have 5 nodes with a GPU and you start 6 replicas of the service, 1 replica is held pending because of insufficient resources. In next post i will setting up distribute tensorflow with docker swarm.