Я запускаю скрипты парсинга веб-сайтов с использованием Python, Selenium и Geckodriver. Проблема в том, что когда я запускаю тест задачи с использованием airflow test scrap_dag scrap_data 2020-01-01
, все работает нормально, и нужный мне файл загружается правильно. Однако, когда я запускаю DAG-запуск в сети Airflow, он не запускается.
Сначала Airflow не мог получить доступ к geckodriver.log, поэтому я изменил путь на доступный. Таким образом, ошибка изменилась на невозможность найти Firefox. После этого я понял, что мой исполняемый файл не является исполняемым. Я все еще ищу возможные решения для любого из этих шагов.
Код PythonOperator, который отлично работает при тестировании, следующий:
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
driver = webdriver.Firefox(executable_path='geckodriver path', log_path='log path', options=options)
driver.get(url)
РЕДАКТИРОВАТЬ: добавление сообщений об ошибках для каждой ситуации.
Установка пути к исполняемому файлу и пути журнала: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
.
Также пробовал это:
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
firefox_binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='log path', options=options)
driver.get(url)
И я получаю 'geckodriver' executable needs to be in PATH.
, который я уже добавил в путь, но, возможно, я что-то делаю не так. Использование этого метода также приводит к нарушению кода при запуске airflow test
.
Трассировки стека:
При запуске с использованием geckodriver (исходная реализация):
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:05:06.663528+00:00/1.log
[2021-02-04 13:05:17,474] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,486] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:05:17,486] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,497] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:05:06.663528+00:00
[2021-02-04 13:05:17,503] {standard_task_runner.py:54} INFO - Started process 5151 to run task
[2021-02-04 13:05:17,542] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:05:06.663528+00:00', '--job_id', '2281', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpgzryaz90']
[2021-02-04 13:05:17,543] {standard_task_runner.py:78} INFO - Job 2281: Subtask scrap_dados
[2021-02-04 13:05:17,565] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [running]> ci-dobser-51091
[2021-02-04 13:05:19,534] {taskinstance.py:1150} ERROR - Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 47, in scrap_dados
driver = webdriver.Firefox(executable_path='/home/observatorio/projetos/Chico-2.0/utils/drivers/geckodriver', log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
keep_alive=True)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
[2021-02-04 13:05:19,570] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T160506, start_date=20210204T160517, end_date=20210204T160519
[2021-02-04 13:05:22,470] {local_task_job.py:102} INFO - Task exited with return code 1
Запуск с использованием двоичного файла firefox в качестве аргумента для веб-драйвера:
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:48:29.730315+00:00/1.log
[2021-02-04 13:48:38,335] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,346] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:48:38,346] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,355] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:48:29.730315+00:00
[2021-02-04 13:48:38,357] {standard_task_runner.py:54} INFO - Started process 31798 to run task
[2021-02-04 13:48:38,371] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:48:29.730315+00:00', '--job_id', '2284', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpz3klaxu5']
[2021-02-04 13:48:38,371] {standard_task_runner.py:78} INFO - Job 2284: Subtask scrap_dados
[2021-02-04 13:48:38,391] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [running]> ci-dobser-51091
[2021-02-04 13:48:39,291] {taskinstance.py:1150} ERROR - Message: 'geckodriver' executable needs to be in PATH.
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 48, in scrap_dados
driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
self.service.start()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
[2021-02-04 13:48:39,310] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T164829, start_date=20210204T164838, end_date=20210204T164839
[2021-02-04 13:48:43,340] {local_task_job.py:102} INFO - Task exited with return code 1
Передача пути к драйверу и двоичного пути Firefox:
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T17:27:33.734991+00:00/1.log
[2021-02-04 14:27:46,858] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,876] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 14:27:46,876] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,888] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T17:27:33.734991+00:00
[2021-02-04 14:27:46,890] {standard_task_runner.py:54} INFO - Started process 58187 to run task
[2021-02-04 14:27:46,907] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T17:27:33.734991+00:00', '--job_id', '2302', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpyetrn10i']
[2021-02-04 14:27:46,907] {standard_task_runner.py:78} INFO - Job 2302: Subtask scrap_dados
[2021-02-04 14:27:46,930] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [running]> ci-dobser-51091
[2021-02-04 14:27:47,719] {logging_mixin.py:112} INFO -
[2021-02-04 14:27:47,720] {logging_mixin.py:112} INFO -
[2021-02-04 14:27:47,720] {logging_mixin.py:112} WARNING - [WDM] - ====== WebDriver manager ======
[2021-02-04 14:27:47,720] {logger.py:22} INFO - ====== WebDriver manager ======
[2021-02-04 14:27:48,073] {logging_mixin.py:112} WARNING - [WDM] - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,073] {logger.py:12} INFO - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,181] {taskinstance.py:1150} ERROR - Message: Process unexpectedly closed with status 127
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 50, in scrap_dados
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
keep_alive=True)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 127
[2021-02-04 14:27:48,183] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T172733, start_date=20210204T172746, end_date=20210204T172748
[2021-02-04 14:27:51,862] {local_task_job.py:102} INFO - Task exited with return code 1
python web_scraping.py
, и она работает нормально. - person Liadz   schedule 04.02.2021