I'm having a problem with an airflow server where any time I try and run a dag I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: 'airflow': 'airflow'
All dags stay in in a queued state unless I set them to a running state or mark the previous task as successful. The previous tasks don't determine if the task fails.
Everything was working working before yesterday (3/23) and this issue is fairly recent. It started after a restart of Airflow's Webserver and Airflow's Scheduler. Those were stopped to troubleshoot a dag issue and since nothing was changed shouldn't have any relevance except for the fact that the issue started after the restart of those two. Also, there have been no changes to Airflow, the dags, or the database, aside from Airflow's use of it, after the previous start up of the two and the restart.
I'm not sure if this is the cause of the dags and tasks staying in queued state or if the problem lies elsewhere, but this is the best info I have to start.
I've included all that I can think of.
The system is Red Hat Enterprise Linux 8 with Airflow 2.2.4 and using MySQL 8 as the database. It has Python 3.6, I'm not sure if this is the root issue. Airflow is run under as an airflow user that has sudo access.
The scheduler is running LocalExecutor with StandardTaskRunner. I have tried combinations of LocalExecutor, SequentialExecutor, StandardTaskRunner, and CGroupTaskRunner. I don't have Celery, Dask, or Kubernetes setup as this is an initial test of Airflow.
This is what shows up in the dag_process_manager.log:
DAG File Processing Stats
File Path PID Runtime # DAGs # Errors Last Runtime Last Run
------------------------------------------------------------------------------------------------ ------ --------- -------- ---------- -------------- -------------------
/opt/airflow/dags/e_transaction_scheduler.py 1 0 0.55s 2022-03-24T16:59:49
/opt/airflow/dags/authdb_transaction_scheduler.py 1 0 0.90s 2022-03-24T16:59:49
/opt/airflow/dags/c_transaction_scheduler.py 1 0 0.62s 2022-03-24T16:59:49
/home/airflow/.local/lib/python3.6/site-packages/airflow/smart_sensor_dags/smart_sensor_group.py 5 0 0.70s 2022-03-24T16:59:49
/opt/airflow/dags/edge_transaction_history.py 316700 0.01s 13 0 0.25s 2022-03-24T16:59:51
================================================================================
[2022-03-24 09:59:56,090] {manager.py:663} INFO - Searching for files in /opt/airflow/dags
[2022-03-24 09:59:56,095] {manager.py:666} INFO - There are 6 files in /opt/airflow/dags
[2022-03-24 09:59:59,045] {manager.py:784} INFO -
================================================================================
The scheduler log for this dag just contains this over and over. Right now I'm using a single dag for troubleshooting.
[2022-03-24 10:47:24,444] {processor.py:163} INFO - Started process (PID=336612) to work on /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:24,446] {processor.py:642} INFO - Processing file /opt/airflow/dags/authdb_transaction_scheduler.py for tasks to queue
[2022-03-24 10:47:24,447] {logging_mixin.py:109} INFO - [2022-03-24 10:47:24,447] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:24,489] {processor.py:654} INFO - DAG(s) dict_keys(['authdb_transaction_scheduler']) retrieved from /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:24,506] {logging_mixin.py:109} INFO - [2022-03-24 10:47:24,505] {dag.py:2398} INFO - Sync 1 DAGs
[2022-03-24 10:47:24,533] {logging_mixin.py:109} INFO - [2022-03-24 10:47:24,533] {dag.py:2937} INFO - Setting next_dagrun for authdb_transaction_scheduler to None
[2022-03-24 10:47:24,542] {processor.py:171} INFO - Processing /opt/airflow/dags/authdb_transaction_scheduler.py took 0.102 seconds
[2022-03-24 10:47:29,751] {processor.py:163} INFO - Started process (PID=336647) to work on /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:29,752] {processor.py:642} INFO - Processing file /opt/airflow/dags/authdb_transaction_scheduler.py for tasks to queue
[2022-03-24 10:47:29,753] {logging_mixin.py:109} INFO - [2022-03-24 10:47:29,753] {dagbag.py:500} INFO - Filling up the DagBag from /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:29,794] {processor.py:654} INFO - DAG(s) dict_keys(['authdb_transaction_scheduler']) retrieved from /opt/airflow/dags/authdb_transaction_scheduler.py
[2022-03-24 10:47:29,807] {logging_mixin.py:109} INFO - [2022-03-24 10:47:29,806] {dag.py:2398} INFO - Sync 1 DAGs
[2022-03-24 10:47:29,830] {logging_mixin.py:109} INFO - [2022-03-24 10:47:29,830] {dag.py:2937} INFO - Setting next_dagrun for authdb_transaction_scheduler to None
[2022-03-24 10:47:29,838] {processor.py:171} INFO - Processing /opt/airflow/dags/authdb_transaction_scheduler.py took 0.090 seconds
The contents of the wait_for_previous_run:
[2022-03-24 10:04:11,960] {taskinstance.py:1037} INFO - Dependencies all met for <TaskInstance: authdb_transaction_scheduler.wait_for_previous_run manual__2022-03-24T17:04:09.925464+00:00 [queued]>
[2022-03-24 10:04:11,960] {taskinstance.py:1243} INFO -
--------------------------------------------------------------------------------
[2022-03-24 10:04:11,960] {taskinstance.py:1244} INFO - Starting attempt 1 of 154
[2022-03-24 10:04:11,960] {taskinstance.py:1245} INFO -
--------------------------------------------------------------------------------
[2022-03-24 10:04:11,973] {taskinstance.py:1264} INFO - Executing <Task(SqlSensor): wait_for_previous_run> on 2022-03-24 17:04:09.925464+00:00
[2022-03-24 10:04:11,975] {base_task_runner.py:136} INFO - Running on host: airflow.xyz.com
[2022-03-24 10:04:11,976] {base_task_runner.py:137} INFO - Running: ['airflow', 'tasks', 'run', 'authdb_transaction_scheduler', 'wait_for_previous_run', 'manual__2022-03-24T17:04:09.925464+00:00', '--job-id', '61611', '--raw', '--subdir', 'DAGS_FOLDER/authdb_transaction_scheduler.py', '--cfg-path', '/tmp/tmpuei73qnh', '--error-file', '/tmp/tmpvqxdj593']
The log from the next task extract_merchant_account (this just keeps getting rescheduled after marking the wait task as successful):
[2022-03-24 10:23:06,831] {taskinstance.py:1037} INFO - Dependencies all met for <TaskInstance: authdb_transaction_scheduler.extract_merchant_account manual__2022-03-24T17:04:09.925464+00:00 [queued]>
[2022-03-24 10:23:06,831] {taskinstance.py:1243} INFO -
--------------------------------------------------------------------------------
[2022-03-24 10:23:06,831] {taskinstance.py:1244} INFO - Starting attempt 4 of 4
[2022-03-24 10:23:06,831] {taskinstance.py:1245} INFO -
--------------------------------------------------------------------------------
[2022-03-24 10:23:06,844] {taskinstance.py:1264} INFO - Executing <Task(SSHOperator): extract_merchant_account> on 2022-03-24 17:04:09.925464+00:00
[2022-03-24 10:23:06,846] {base_task_runner.py:136} INFO - Running on host: airflow.xyz.com
[2022-03-24 10:23:06,846] {base_task_runner.py:137} INFO - Running: ['airflow', 'tasks', 'run', 'authdb_transaction_scheduler', 'extract_merchant_account', 'manual__2022-03-24T17:04:09.925464+00:00', '--job-id', '61731', '--raw', '--subdir', 'DAGS_FOLDER/authdb_transaction_scheduler.py', '--cfg-path', '/tmp/tmpe8za_ayj', '--error-file', '/tmp/tmpeb3agndg']
Here's the dag file authdb_transactions_scheduler:
from airflow import DAG
#from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.providers.ssh.operators.ssh import SSHOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.sensors.sql_sensor import SqlSensor
from airflow.operators.python_operator import PythonOperator
import time
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 3, 23),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False
}
with DAG('authdb_transaction_scheduler', schedule_interval= None, catchup=False, default_args=default_args) as dag:
wait_for_previous_run = SqlSensor(
task_id='wait_for_previous_run',
sql='''
select poll from ( select poll from( select case when state = "running" then 0 else 1 end as poll, execution_date from dag_run
where dag_id = "{{ dag.dag_id }}" and execution_date < REPLACE( SUBSTRING_INDEX("{{execution_date}}", "+", 1), 'T', ' ' )
group by execution_date, state order by execution_date desc limit 1 ) a
) a union all ( SELECT 1 as poll FROM (
SELECT count(1) AS count FROM dag_run WHERE dag_id = "{{ dag.dag_id }}" and execution_date < REPLACE( SUBSTRING_INDEX("{{execution_date}}", "+", 1), 'T', ' ' ) ) a
WHERE count = 0
) order by poll desc
''',
poke_interval=300,
retries=153,
conn_id='airflow_dag',
dag=dag
)
# delay_task = PythonOperator(
# task_id="delay_task",
# dag=dag,
# python_callable=lambda: time.sleep(120)
# )
merchant_account = """
cd /script_location/scripts/;
bash extract_mongoDB_data.sh mongoDB_data_extraction_config_dev.csv AuthDB.MERCHANTACCOUNTS
"""
extract_merchant_account = SSHOperator(
ssh_conn_id='ssh_extract',
task_id='extract_merchant_account',
command=merchant_account,
dag=dag
)
ingest_merchant_account = """
cd /script_location/scripts/;
bash data_ingestion.sh AuthDB.merchant_accounts.?????????????????.json N;
"""
ingestion_merchant_account = SSHOperator(
ssh_conn_id='ssh_load',
task_id='ingestion_merchant_account',
command=ingest_merchant_account,
dag=dag
)
loading_merchant_account = """
cd /script_location/scripts;
bash trigger_bqload.sh AuthDB.merchant_accounts append;
"""
load_merchant_account = SSHOperator(
ssh_conn_id='ssh_load',
task_id='load_merchant_account',
command=loading_merchant_account,
dag=dag
)
wait_for_previous_run >> extract_merchant_account >> ingestion_merchant_account >> load_merchant_account
Both the webserver and the scheduler have been setup to be ran by systemd. Here's the services file for airflow-scheduler:
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
[Unit]
Description=Airflow scheduler daemon
After=network.target mysqld.service
Wants=mysqld.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/home/airflow/.local/bin/airflow scheduler \
--pid $AIRFLOW_SCHDLRPID \
--log-file $AIRFLOW_SCHDLRLOG \
--subdir $AIRFLOW_DAGS
ExecStop=/bin/kill '/bin/cat $AIRFLOW_SCHDLRPID'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Here's the airflow-webserver:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
[Unit]
Description=Airflow webserver daemon
After=network.target mysqld.service airflow-scheduler.service
Wants=mysqld.service airflow-scheduler.service
[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/home/airflow/.local/bin/airflow webserver \
--port 8443 \
--ssl-cert $AIRFLOW_SSLCRT \
--ssl-key $AIRFLOW_SSLKEY \
--pid $AIRFLOW_PID \
--log-file $AIRFLOW_LOGFILE \
--access-logfile $AIRFLOW_LOGACCS \
--error-logfile $AIRFLOW_LOGERR
ExecStop=/bin/kill '/bin/cat $AIRFLOW_PID'
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
Here are the environmental variables set:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# This file is the environment file for Airflow. Put this file in /etc/sysconfig/airflow per default
# configuration of the systemd unit files.
#
AIRFLOW_HOME=/opt/airflow
AIRFLOW_PID=/run/airflow/webserver.pid
AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
AIRFLOW_SSLCRT=/ssl_location/xyz.com.crt
AIRFLOW_SSLKEY=/ssl_location/xyz.com.key
AIRFLOW_SSLCA=/ssl_location/xyz-ca.crt
AIRFLOW_LOGACCS=/var/log/airflow/webserver-access.log
AIRFLOW_LOGERR=/var/log/airflow/webserver-error.log
AIRFLOW_LOGFILE=/var/log/airflow/webserver.log
AIRFLOW_SCHDLRPID=/run/airflow/scheduler.pid
AIRFLOW_SCHDLRLOG=/var/log/airflow/scheduler/scheduler.log
AIRFLOW_DAGS=/opt/airflow/dags
Here is the output from running airflow config list:
[core]
dags_folder = /opt/airflow/dags
hostname_callable = socket.getfqdn
default_timezone = PST8PDT
executor = LocalExecutor
sql_alchemy_conn = mysql+mysqldb://airflowdbuser:some_password#localhost:3306/airflow_db
sql_engine_encoding = utf-8
sql_alchemy_pool_enabled = True
sql_alchemy_pool_size = 5
sql_alchemy_max_overflow = -1
sql_alchemy_pool_recycle = 1800
sql_alchemy_pool_pre_ping = True
sql_alchemy_schema = airflow_db
parallelism = 200
max_active_tasks_per_dag = 100
dags_are_paused_at_creation = True
max_active_runs_per_dag = 4
load_examples = False
load_default_connections = False
plugins_folder = /opt/airflow/plugins
execute_tasks_new_python_interpreter = False
fernet_key = some_fernet_key
donot_pickle = True
dagbag_import_timeout = 30.0
dagbag_import_error_tracebacks = True
dagbag_import_error_traceback_depth = 5
dag_file_processor_timeout = 50
task_runner = StandardTaskRunner
default_impersonation = airflow
security =
unit_test_mode = False
enable_xcom_pickling = False
killed_task_cleanup_time = 60
dag_run_conf_overrides_params = True
dag_discovery_safe_mode = True
default_task_retries = 3
default_task_weight_rule = downstream
min_serialized_dag_update_interval = 5
min_serialized_dag_fetch_interval = 5
max_num_rendered_ti_fields_per_task = 30
check_slas = True
xcom_backend = airflow.models.xcom.BaseXCom
lazy_load_plugins = False
lazy_discover_providers = False
max_db_retries = 3
hide_sensitive_var_conn_fields = True
sensitive_var_conn_names =
default_pool_task_slot_count = 128
sql_engine_collation_for_ids = utf8mb4_unicode_ci
max_queued_runs_per_dag = 50
[logging]
base_log_folder = /var/log/airflow
remote_logging = False
remote_log_conn_id =
google_key_path =
remote_base_log_folder =
encrypt_s3_logs = False
logging_level = INFO
fab_logging_level = INFO
logging_config_class =
colored_console_log = True
colored_log_format = [%(blue)s%(asctime)s%(reset)s] {%(blue)s%(filename)s:%(reset)s%(lineno)d} %(log_color)s%(levelname)s%(reset)s - %(log_color)s%(message)s%(reset)s
colored_formatter_class = airflow.utils.log.colored_log.CustomTTYColoredFormatter
log_format = [%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s
simple_log_format = %(asctime)s %(levelname)s - %(message)s
task_log_prefix_template = {{ti.dag_id}}-{{ti.task_id}}-{{execution_date}}-{{try_number}}
log_filename_template = {{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log
log_processor_filename_template = {{ filename }}.log
dag_processor_manager_log_location = /var/log/airflow/dag_processor_manager/dag_processor_manager.log
task_log_reader = task
extra_logger_names =
worker_log_server_port = 8793
[metrics]
statsd_on = False
statsd_host = localhost
statsd_port = 8125
statsd_prefix = airflow
statsd_allow_list =
stat_name_handler =
statsd_datadog_enabled = False
statsd_datadog_tags =
[secrets]
backend =
backend_kwargs =
[cli]
api_client = airflow.api.client.local_client
endpoint_url = http://airflow.xyz.com:8443
[debug]
fail_fast = True
[api]
enable_experimental_api = False
auth_backend = airflow.api.auth.backend.default
maximum_page_limit = 100
fallback_page_limit = 100
google_oauth2_audience =
google_key_path =
access_control_allow_headers =
access_control_allow_methods =
access_control_allow_origins =
access_control_allow_origin =
[lineage]
backend =
[atlas]
sasl_enabled = False
host =
port = 21000
username =
password =
[operators]
default_owner = airflow
default_cpus = 2
default_ram = 512
default_disk = 512
default_gpus = 0
default_queue = default
allow_illegal_arguments = True
[hive]
default_hive_mapred_queue =
mapred_job_name_template =
[webserver]
base_url = http://airflow.xyz.com:8443
default_ui_timezone = PST8PDT
web_server_host = 0.0.0.0
web_server_port = 8443
web_server_ssl_cert = /ssl_location/xyz.com.crt
web_server_ssl_key = /ssl_location/xyz.com.key
session_backend = database
web_server_master_timeout = 120
web_server_worker_timeout = 120
worker_refresh_batch_size = 1
worker_refresh_interval = 6000
reload_on_plugin_change = True
secret_key = a_secret_key
workers = 4
worker_class = sync
access_logfile = /var/log/airflow/webserver-access.log
error_logfile = /var/log/airflow/webserver-error.log
access_logformat =
expose_config = True
expose_hostname = True
expose_stacktrace = True
dag_default_view = tree
dag_orientation = LR
log_fetch_timeout_sec = 5
log_fetch_delay_sec = 2
log_auto_tailing_offset = 30
log_animation_speed = 1000
hide_paused_dags_by_default = False
page_size = 100
navbar_color = #fff
default_dag_run_display_number = 30
enable_proxy_fix = False
proxy_fix_x_for = 1
proxy_fix_x_proto = 1
proxy_fix_x_host = 1
proxy_fix_x_port = 1
proxy_fix_x_prefix = 1
cookie_secure = False
cookie_samesite = Lax
default_wrap = False
x_frame_enabled = False
show_recent_stats_for_completed_runs = True
update_fab_perms = True
session_lifetime_minutes = 43200
auto_refresh_interval = 3
[email]
email_backend = airflow.utils.email.send_email_smtp
email_conn_id = smtp_default
default_email_on_retry = True
default_email_on_failure = True
[smtp]
smtp_host = mail.xyz.com
smtp_starttls = False
smtp_ssl = False
smtp_port = 25
smtp_mail_from = airflow#xyz.com
smtp_timeout = 30
smtp_retry_limit = 5
[sentry]
sentry_on = False
sentry_dsn =
[celery_kubernetes_executor]
kubernetes_queue = kubernetes
[celery]
celery_app_name = airflow.executors.celery_executor
worker_concurrency = 16
worker_umask = 0o077
broker_url = mysql+mysqldb://airflowdbuser:some_password#localhost:3306/airflow_db
result_backend = mysql+mysqldb://airflowdbuser:some_password#localhost:3306/airflow_db
flower_host = 0.0.0.0
flower_url_prefix =
flower_port = 5555
flower_basic_auth =
sync_parallelism = -1
celery_config_options = airflow.config_templates.default_celery.DEFAULT_CELERY_CONFIG
ssl_active = False
ssl_key =
ssl_cert =
ssl_cacert =
pool = prefork
operation_timeout = 10.0
task_track_started = True
task_adoption_timeout = 600
task_publish_max_retries = 3
worker_precheck = True
worker_autoscale =
[celery_broker_transport_options]
[dask]
cluster_address = 127.0.0.1:8786
tls_ca =
tls_cert =
tls_key =
[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
num_runs = -1
scheduler_idle_sleep_time = 2
min_file_process_interval = 5
dag_dir_list_interval = 60
print_stats_interval = 5
pool_metrics_interval = 5.0
scheduler_health_check_threshold = 15
orphaned_tasks_check_interval = 300.0
child_process_log_directory = /opt/airflow/log/scheduler
scheduler_zombie_task_threshold = 300
catchup_by_default = False
max_tis_per_query = 512
use_row_level_locking = True
max_dagruns_to_create_per_loop = 20
max_dagruns_per_loop_to_schedule = 20
schedule_after_task_execution = True
parsing_processes = 4
file_parsing_sort_mode = modified_time
use_job_schedule = False
allow_trigger_in_future = False
dependency_detector = airflow.serialization.serialized_objects.DependencyDetector
trigger_timeout_check_interval = 5
[triggerer]
default_capacity = 1000
[kerberos]
ccache = /tmp/airflow_krb5_ccache
principal = airflow
reinit_frequency = 3600
kinit_path = kinit
keytab = airflow.keytab
forwardable = True
include_ip = True
[github_enterprise]
api_rev = v3
[elasticsearch]
host =
log_id_template = {{dag_id}}-{{task_id}}-{{execution_date}}-{{try_number}}
end_of_log_mark = end_of_log
frontend =
write_stdout = False
json_format = False
json_fields = asctime, filename, lineno, levelname, message
host_field = host
offset_field = offset
[elasticsearch_configs]
use_ssl = False
verify_certs = True
[kubernetes]
pod_template_file =
worker_container_repository =
worker_container_tag =
namespace = default
delete_worker_pods = True
delete_worker_pods_on_failure = False
worker_pods_creation_batch_size = 1
multi_namespace_mode = False
in_cluster = True
kube_client_request_args =
delete_option_kwargs =
enable_tcp_keepalive = True
tcp_keep_idle = 120
tcp_keep_intvl = 30
tcp_keep_cnt = 6
verify_ssl = False
worker_pods_pending_timeout = 300
worker_pods_pending_timeout_check_interval = 120
worker_pods_queued_check_interval = 60
worker_pods_pending_timeout_batch_size = 100
[smart_sensor]
use_smart_sensor = True
shard_code_upper_limit = 10000
shards = 5
sensors_enabled = NamedHivePartitionSensor
[code_editor]
enabled = True
git_enabled = True
git_cmd = /bin/git
git_default_args = -c color.ui=true
git_init_repo = False
root_directory = /opt/airflow/dags
line_length = 120
string_normalization = True
mount_name = Airflow_Logs
mount_path = /opt/airflow/logs
I am trying to export .har file using firefox-selenium-browsermob-proxy-python. Using the below code.
bmp_loc = "/Users/project/browsermob-proxy-2.1.4/bin/browsermob-proxy"
server = Server(bmp_loc)
server.start()
proxy = server.create_proxy(params={'trustAllServers': 'true'})
selenium_proxy = proxy.selenium_proxy()
caps = webdriver.DesiredCapabilities.FIREFOX
caps['marionette'] = False
proxy_settings = {
"proxyType": "MANUAL",
"httpProxy": selenium_proxy.httpProxy,
"sslProxy": selenium_proxy.sslProxy,
}
caps['proxy'] = proxy_settings
driver = webdriver.Firefox(desired_capabilities=caps)
proxy.new_har("generated_har",options={'captureHeaders': True})
driver.get("someurl")
browser_logs = proxy.har
I am interested to get _transferSize in the .har file to perform some analysis but unable to get that, instead I am getting that as 'comment':
"redirectURL": "", "headersSize": 1023, "bodySize": 38, "comment": ""
whereas manually downloading the .har file using firefox I am getting _transferSize
Version used:
browsermob_proxy==2.1.4
selenium==4.0.0
Can anybody please help me to resolve this?
You can get _transferSize by adding headersSize and bodySize from the har file itself.
urls = ["https://google.com"]
for ur in urls:
server = proxy.start_server()
client = proxy.start_client()
client.new_har("demo.com")
# print(client.proxy)
options = webdriver.ChromeOptions()
options.add_argument("--disk-cache-size=0")
options = {
'enable_har': True
}
driver = webdriver.Chrome(seleniumwire_options=options)
driver.request_interceptor = proxy.interceptor
driver.get(ur)
time.sleep(40)
row_list = []
json_dictionary = json.loads(driver.har)
repeat_url_list = []
repeat_urls = defaultdict(lambda:[])
resp_size = 0
count_url = 0
url_time = 0
status_list = []
status_url = defaultdict(lambda:[])
a_list = []
with open("network_log2.har", "w", encoding="utf-8") as f:
# f.write(json.dumps(driver.har))
for i in json_dictionary['log']['entries']:
f.write(str(i))
f.write("\n")
url = i['request']['url']
a_list.append(url)
timing = i['time']
if timing>2000:
timing = round(timing/2000,1)
url_time += 1
status = i['response']['status']
if status in status_list:
status_url[status] = status_url[status] + 1
else:
status_url[status] = 1
status_list.append(status)
size = i['response']['headersSize'] + i['response']['bodySize']
if size//1000 > 500:
resp_size += 1
if url in repeat_url_list:
repeat_urls[url] = 1
else:
repeat_url_list.append(url)
rurl_count = len(repeat_urls)
I have script which have to build an new .xml file from .xml file on some URL, when I run script on my PC, everything is okay. But after I try to run on my server, it throws urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
I don't know why and I didn't find any good source where this problem was described, so I want this script running on my server automatically as CRON, but I can't.. Does not works..
Code:
#!/usr/bin/env python3
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
url = 'http://eshop.cobi.cz/modules/xmlfeeds/xml_files/feed_5.xml'
uh = urllib.request.urlopen(url, timeout=30)
tree = ET.parse(uh)
root = tree.getroot()
data = ET.Element('SHOP')
for r in root.findall('SHOPITEM'):
name = r.find('PRODUCT').text
try:
description = r.find('DESCRIPTION').text
except AttributeError:
description = ""
continue
price = r.find('WHOLESALEPRICE').text
new_price = (float(price) * 0.8)
final_price = round(new_price, 2)
final_price = str(final_price)
ean = r.find('EAN').text
sku = r.find('PRODUCTNO').text
for i in r.findall('IMAGES'):
img = i.find('IMGURL').text
status = r.find('ACTIVE').text
if status == '1':
status = '1'
else:
status = '0'
continue
element2 = ET.SubElement(data, 'SHOPITEM')
s_elem2_1 = ET.SubElement(element2, 'PRODUCTNAME')
s_elem2_2 = ET.SubElement(element2, 'PRODUCT')
s_elem2_3 = ET.SubElement(element2, 'DESCRIPTION')
s_elem2_4 = ET.SubElement(element2, 'PRICE_VAT')
s_elem2_5 = ET.SubElement(element2, 'EAN')
s_elem2_6 = ET.SubElement(element2, 'PRODUCTNO')
s_elem2_7 = ET.SubElement(element2, 'IMGURL')
s_elem2_8 = ET.SubElement(element2, 'DELIVERY_DATE')
s_elem2_9 = ET.SubElement(element2, 'CATEGORYTEXT')
s_elem2_10 = ET.SubElement(element2, 'MANUFACTURER')
s_elem2_11 = ET.SubElement(element2, 'ACTIVE')
s_elem2_1.text = "Cobi " + name
s_elem2_2.text = "Cobi " + name
s_elem2_3.text = description
s_elem2_4.text = final_price
s_elem2_5.text = ean
s_elem2_6.text = sku
s_elem2_7.text = img
s_elem2_8.text = "7"
s_elem2_9.text = "Heureka.cz | Dětské zboží | Hračky | Stavebnice | Stavebnice Cobi"
s_elem2_10.text = "Cobi"
s_elem2_11.text = status
xml_content = ET.tostring(data)
with open('cobifeed.xml', 'wb') as f:
f.write(xml_content)
f.close()
try ping shop.cobi.cz from cmd/bash and wget http://eshop.cobi.cz/modules/xmlfeeds/xml_files/feed_5.xml
Probably You dont have network access to server eshop.cobi.cz
The python3 (pygsheets 2.0.1) script below will bold all the cells starting at A2.
Is there an easy way (i.e., in one command) to ask for all these cells not to be bolded?
Code:
import boto3, botocore
import datetime
import json
import pygsheets
currentDT = str(datetime.datetime.now())
def create_spreadsheet(outh_file, spreadsheet_name = "jSonar AWS usage"):
client = pygsheets.authorize(outh_file=outh_file, outh_nonlocal=True)
spread_sheet = client.create(spreadsheet_name)
return spread_sheet
def get_regions():
region = "us-west-1"
regions = dict()
ec2 = boto3.client("ec2", region_name=region)
ec2_responses = ec2.describe_regions()
ssm_client = boto3.client('ssm', region_name=region)
for resp in ec2_responses['Regions']:
region_id = resp['RegionName']
tmp = '/aws/service/global-infrastructure/regions/%s/longName' % region_id
ssm_response = ssm_client.get_parameter(Name = tmp)
region_name = ssm_response['Parameter']['Value']
regions[region_id] = region_name
return(regions)
def rds_worksheet_creation(spread_sheet, regions, spreadsheet_index):
worksheet = spread_sheet.add_worksheet("RDS", rows=100, cols=26, src_tuple=None, src_worksheet=None, index=spreadsheet_index)
worksheet.cell('A1').set_text_format('bold', True).value = 'DBInstanceIdentifier'
worksheet.cell('B1').set_text_format('bold', True).value = 'MasterUsername'
worksheet.cell('C1').set_text_format('bold', True).value = 'Region'
worksheet.cell('D1').set_text_format('bold', False).value = 'Sent Query to (Name)'
worksheet.cell('E1').set_text_format('bold', False).value = 'Sent Query to (email)'
worksheet.cell('F1').set_text_format('bold', False).value = 'WorksheetCreated: %s' % currentDT
cells_data = list()
for region, region_h in sorted(regions.items()):
client = boto3.client('rds', region_name=region)
clnt = boto3.client('ssm', region_name=region)
db_instances = client.describe_db_instances()
for instance in db_instances['DBInstances']:
MasterUsername = instance['MasterUsername']
DBInstanceIdentifier = instance['DBInstanceIdentifier']
cells_data.append([DBInstanceIdentifier, MasterUsername, region_h])
worksheet.append_table(cells_data, start='A2')
if __name__ == "__main__":
spread_sheet = create_spreadsheet(spreadsheet_name = "jSonar AWS usage",
outh_file = '/home/qa/.aws/client_secret.json')
regions = get_regions()
rds_worksheet_creation(spread_sheet, regions, 0)
spread_sheet.share("me#corp.com")
Output:
If i understand correctly you want to un-bold multiple cells in single command.
To set format to a range of cells create a Datarange and use apply_format.
model_cell = Cell('A1')
model_cell.set_text_format('bold', False)
Datarange('A1','A10', worksheet=wks).apply_format(model_cell)
docs
Can't search using variable
#api.multi
def send_email(self,invoice_id):
invoice_data = self.env['account.invoice'].browse(invoice_id)
email_template_obj = self.env['email.template']
template_id = self.env.ref('multi_db.email_template_subscription_invoice', False)
report_id = self.env.ref('account.account_invoices', False)
print'invoice_id',str(invoice_data.id) #Here prints invoice_id
attach_obj = self.pool.get('ir.attachment')
attachment_id = self.env['ir.attachment'].search([('res_id','=',invoice_data.id),('res_model','=','account.invoice')])
print'attachment_id1234',attachment_id
if template_id:
values = email_template_obj.generate_email(template_id.id,invoice_id)
values['subject'] = 'Invoice for AMS registration'
values['email_to'] = invoice_data.partner_id.email
values['partner_to'] = invoice_data.partner_id
# values['attachment_ids'] = [(6, 0, report_id.id)]
values['attachment_ids'] = [(6, 0, [attachment_id.id])]
# print'values',values
mail_obj = self.env['mail.mail']
msg_id = mail_obj.create(values)
if msg_id:
mail_obj.send([msg_id])
return True
It returns ir.attachment()
But
If i hard code that value it will return id:
attachment_id = self.env['ir.attachment'].search([('res_id','=',60),('res_model','=','account.invoice')])
It returns ir.attachment(53).
How can i use variable instead of a static value?