Ведущей реплики 0 закончился диск. Во время Sklearn Fit в Google Cloud ML Engine

Я пытаюсь использовать Google Cloud ML Engine для sklearn LDA gridsearch данных объемом 500 МБ (10000 строк x 26000 столбцов), чтобы определить, какое количество тем лучше всего подходит для моего проекта тематического моделирования.

Максимальное количество итераций для каждой свертки CV было установлено равным 100. После 47 итераций задание завершалось неудачей, ссылаясь на ошибку ниже. Я пробовал использовать BASIC уровень, STANDARD уровень и CUSTOM уровень с masterType=complex_model_m, и каждый раз возникала одна и та же ошибка.

Я не могу найти больше в stackoverflow, где говорится об этой проблеме, хотя я наткнулся на эта ссылка, которая, кажется, имеет какое-то отношение. Решение было предоставлено исходным спрашивающим:

Solved : This error was coming not because of Storage Space instead coming because of shared memory tmfs. The sklearn fit was consuming all the shared memory while training. Solution : setting JOBLIB_TEMP_FOLDER environment variable , to /tmp solved the problem.

Боюсь, я не совсем уверен, как интерпретировать или реализовать решение.

Вот три строки, которые указывают на источник проблемы:

lda = LatentDirichletAllocation(learning_method='batch', max_iter=100, n_jobs=-1, verbose=1)
gscv = GridSearchCV(lda, tuned_parameters, cv=3, verbose=10, n_jobs=1)
gscv.fit(data)

и я бы назвал работу в формате:

gcloud ai-platform jobs submit training $JOB_NAME \
        --package-path $TRAINER_PACKAGE_PATH \
        --module-name $MAIN_TRAINER_MODULE \
        --job-dir $JOB_DIR \
        --region $REGION \
        --config config.yaml

Это абсолютно отвратительное сообщение об ошибке в журналах:

sklearn.externals.joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/externals/loky/backend/queues.py", line 150, in _feed obj_ = dumps(obj, reducers=reducers) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 243, in dumps dump(obj, buf, reducers=reducers, protocol=protocol) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 236, in dump _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 284, in dump return Pickler.dump(self, obj) File "/usr/lib/python3.5/pickle.py", line 408, in dump self.save(obj) File "/usr/lib/python3.5/pickle.py", line 520, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/python3.5/pickle.py", line 623, in save_reduce save(state) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 810, in save_dict self._batch_setitems(obj.items()) File "/usr/lib/python3.5/pickle.py", line 836, in _batch_setitems save(v) File "/usr/lib/python3.5/pickle.py", line 520, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/python3.5/pickle.py", line 623, in save_reduce save(state) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 810, in save_dict self._batch_setitems(obj.items()) File "/usr/lib/python3.5/pickle.py", line 841, in _batch_setitems save(v) File "/usr/lib/python3.5/pickle.py", line 520, in save self.save_reduce(obj=obj, *rv) File "/usr/lib/python3.5/pickle.py", line 623, in save_reduce save(state) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 810, in save_dict self._batch_setitems(obj.items()) File "/usr/lib/python3.5/pickle.py", line 836, in _batch_setitems save(v) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 770, in save_list self._batch_appends(obj) File "/usr/lib/python3.5/pickle.py", line 797, in _batch_appends save(tmp[0]) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple save(element) File "/usr/lib/python3.5/pickle.py", line 475, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple save(element) File "/usr/lib/python3.5/pickle.py", line 481, in save rv = reduce(obj) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/_memmapping_reducer.py", line 339, in __call__ for dumped_filename in dump(a, filename): File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 502, in dump NumpyPickler(f, protocol=protocol).dump(value) File "/usr/lib/python3.5/pickle.py", line 408, in dump self.save(obj) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 289, in save wrapper.write_array(obj, self) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 104, in write_array pickler.file_handle.write(chunk.tostring('C')) OSError: [Errno 28] No space left on device """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/experiment_trainer/experiment.py", line 87, in <module> gscv.fit(data) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 722, in fit self._run_search(evaluate_candidates) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 1191, in _run_search evaluate_candidates(ParameterGrid(self.param_grid)) File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates cv.split(X, y, groups))) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 917, in __call__ if self.dispatch_one_batch(iterator): File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 549, in __init__ self.results = batch() File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__ for func, args, kwargs in self.items] File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp> for func, args, kwargs in self.items] File "/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_validation.py", line 526, in _fit_and_score estimator.fit(X_train, **fit_params) File "/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py", line 570, in fit batch_update=True, parallel=parallel) File "/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py", line 453, in _em_step parallel=parallel) File "/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py", line 406, in _e_step for idx_slice in gen_even_slices(X.shape[0], n_jobs)) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 930, in __call__ self.retrieve() File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py", line 833, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/_parallel_backends.py", line 521, in wrap_future_result return future.result(timeout=timeout) File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result return self.__get_result() File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result raise self._exception _pickle.PicklingError: Could not pickle the task to send it to the workers.

Наиболее важные из которых я могу только предположить:

OSError: [Errno 28] No space left on device

Сообщение: The replica master 0 ran out of disk. было официальной причиной проблемы согласно консоли.

Я могу запустить это на своем рабочем столе до 100 итераций без проблем. Мы очень ценим любое понимание того, что происходит. Спасибо!


person S. Davis    schedule 26.08.2019    source источник
comment
Можете ли вы попробовать установить эту переменную с помощью сценария-оболочки: stackoverflow.com/questions/52616287/ JOBLIB_TEMP_FOLDER = / tmp (NB: НЕПРОВЕРЕННЫЙ КОД)   -  person gogasca    schedule 02.09.2019