На прошлой неделе я клонировал репозиторий CNTK и построил его с помощью Nvidia-docker, работающего на экземпляре p2.8xlarge на AWS. Кажется, все работает, за исключением того, что я не получаю ускорения от запуска нескольких графических процессоров при включении 1-битного SGD. Я запускаю пример CMUDict Sequence2Sequence_distributed.py. Вот моя стенограмма, когда я запускаю ее на одном графическом процессоре:
root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# python Sequence2Sequence_Distributed.py
Selected GPU[0] Tesla K80 as the process wide default device.
ping [requestnodes (before change)]: 1 nodes pinging each other
ping [requestnodes (after change)]: 1 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 1 out of 1 MPI nodes on a single host (1 requested); we (0) are in (participating)
ping [mpihelper]: 1 nodes pinging each other
-------------------------------------------------------------------
Build info:
Built time: Jun 2 2017 19:46:11
Last modified date: Fri Jun 2 19:21:14 2017
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local/cudnn
Build Branch: master
Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
Build Path: /cntk
MPI distribution: Open MPI
MPI version: 1.10.3
-------------------------------------------------------------------
Finished Epoch[1 of 160]: [Training] loss = 4.234002 * 64, metric = 98.44% * 64 3.014s ( 21.2 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.231473 * 71, metric = 85.92% * 71 1.013s ( 70.1 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.227827 * 61, metric = 81.97% * 61 0.953s ( 64.0 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.227088 * 68, metric = 86.76% * 68 0.970s ( 70.1 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.222957 * 62, metric = 88.71% * 62 0.922s ( 67.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.221479 * 63, metric = 84.13% * 63 0.950s ( 66.3 samples/s);
Вот стенограмма, когда я запускаю два графических процессора:
root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# mpiexec --allow-run-as-root --npernode 2 python Sequence2Sequence_Distributed.py -q 1 Selected GPU[0] Tesla K80 as the process wide default device. Selected CPU as the process wide default device. ping [requestnodes (before change)]: 2 nodes pinging each other ping [requestnodes (before change)]: 2 nodes pinging each other ping [requestnodes (after change)]: 2 nodes pinging each other ping [requestnodes (after change)]: 2 nodes pinging each other requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (0) are in (participating) ping [mpihelper]: 2 nodes pinging each other requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (1) are in (participating) ping [mpihelper]: 2 nodes pinging each other ------------------------------------------------------------------- Build info: Built time: Jun 2 2017 19:46:11 Last modified date: Fri Jun 2 19:21:14 2017 Build type: release Build target: GPU With 1bit-SGD: yes With ASGD: yes Math lib: mkl CUDA_PATH: /usr/local/cuda CUB_PATH: /usr/local/cub-1.4.1 CUDNN_PATH: /usr/local/cudnn Build Branch: master Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72 Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133 Build Path: /cntk MPI distribution: Open MPI MPI version: 1.10.3 ------------------------------------------------------------------- ------------------------------------------------------------------- Build info: Built time: Jun 2 2017 19:46:11 Last modified date: Fri Jun 2 19:21:14 2017 Build type: release Build target: GPU With 1bit-SGD: yes With ASGD: yes Math lib: mkl CUDA_PATH: /usr/local/cuda CUB_PATH: /usr/local/cub-1.4.1 CUDNN_PATH: /usr/local/cudnn Build Branch: master Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72 Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133 Build Path: /cntk MPI distribution: Open MPI MPI version: 1.10.3 -------------------------------------------------------------------
Вот сообщение об ошибке - означает ли это, что графические процессоры не используются, когда я запускаю задание как два процесса MPI? Как мне это исправить?
NcclComm: disabled, at least one rank using CPU device NcclComm: disabled, at least one rank using CPU device
Вы можете видеть, что количество выборок/с уменьшилось:
Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.377s ( 13.6 samples/s); Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.877s ( 12.4 samples/s); Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.196s ( 30.5 samples/s); Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.197s ( 30.5 samples/s); Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s); Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s); Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s); Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s); Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.667s ( 27.6 samples/s); Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.666s ( 27.6 samples/s); Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s); Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);