Apache Airflow with Python Exception: Expected state: SUCCESS. Actual state: failed - python-3.x

I'm running apache-airflow on my Mac m1 to insert data from a staging table(the temporary table was created from S3) to AWS redshift.
The following error was raised, and it doesn't tell me what the source issue is. Could anybody please help me debug?
[2022-07-30 16:16:58,226] {subdag.py:186} INFO - Execution finished. State is failed
[2022-07-30 16:16:58,252] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1471, in _run_raw_task
self._execute_task_with_callbacks(context, test_mode)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1621, in _execute_task_with_callbacks
self.task.post_execute(context=context, result=result)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/subdag.py", line 189, in post_execute
raise AirflowException(f"Expected state: SUCCESS. Actual state: {dag_run.state}")
airflow.exceptions.AirflowException: Expected state: SUCCESS. Actual state: failed
[2022-07-30 16:16:58,268] {taskinstance.py:1415} INFO - Marking task as FAILED. dag_id=sparkify_dag, task_id=load_artist_dim_table, execution_date=20220729T174100, start_date=20220730T161557, end_date=20220730T161658
[2022-07-30 16:16:58,298] {standard_task_runner.py:92} ERROR - Failed to execute job 399 for task load_artist_dim_table (Expected state: SUCCESS. Actual state: failed; 55013)
[2022-07-30 16:16:58,356] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-30 16:16:58,426] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

Related

Why if dag is turned off, then SparkSubmitOperator kills itself 5 minutes after startup?

I brought out the dag with SparkSubmitOperator. I turned it on so that it would create a task. And then turned it off so that it wouldn't create new ones. After that, I started running the task. And 5 minutes after the launch, he always sent himself a SIGTERM with log:
[2023-01-27 12:50:04,783] {local_task_job.py:188} WARNING - State of this instance has been externally set to None. Terminating instance.
[2023-01-27 12:50:04,798] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 968
[2023-01-27 12:50:04,802] {taskinstance.py:1265} ERROR - Received SIGTERM. Terminating subprocesses.
[2023-01-27 12:50:04,804] {spark_submit.py:657} INFO - Sending kill signal to spark-submit
[2023-01-27 12:50:15,985] {spark_submit.py:674} INFO - YARN app killed with return code: 0
[2023-01-27 12:50:16,122] {taskinstance.py:1482} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 183, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 440, in submit
self._process_spark_submit_log(iter(self._submit_sp.stdout)) # type: ignore
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 494, in _process_spark_submit_log
for line in itr:
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1267, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
This error was reproduced, tried more than 5 times. But then I turned on the dag, launched it, and it worked. Why is this happening?
DAGs are supposed to stay unpaused for their tasks to run correctly. :)
You can limit how many DAG runs are created by setting the DAG parameter max_active_runs to 1 and limit how many instances of a specifc task can run at any time by setting the task level parameter max_active_tis_per_dag to 1. Also make sure you set the DAG parameter catchup to False, to avoid many runs to be scheduled if your start_date is a while back in the past.
This DAG should only create one running t1 task every day (based on the #daily schedule).
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
with DAG(
dag_id="simple_classic_dag",
start_date=datetime(2022,12,10),
schedule="#daily",
catchup=False,
max_active_runs=1,
tags=["simple", "debug"]
) as dag:
t1 = EmptyOperator(task_id="t1", max_active_tis_per_dag=1)
You might find these guides helpful:
DAG scheduling and timetables in Airflow explains scheduling basics
Scaling Airflow to optimize performance goes over scaling and limit parameters

Airflow Logs BrokenPipeException

I'm using a clustered Airflow environment where I have four AWS ec2-instances for the servers.
ec2-instances
Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database
Server 2: Webserver
Server 3: Worker
Server 4: Worker
My setup has been working perfectly fine for three months now but sporadically about once a week I get a Broken Pipe Exception when Airflow is attempting to log something.
*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-13T00:00:00/1.log
[2018-07-16 00:00:15,521] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:00:15,698] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1407} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------
[2018-07-16 00:00:15,719] {models.py:1428} INFO - Executing <Task(OmegaFileSensor): task_1> on 2018-07-13 00:00:00
[2018-07-16 00:00:15,720] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run foobar task_1 2018-07-13T00:00:00 --job_id 1320 --raw -sd DAGS_FOLDER/datalake_digitalplatform_arl_workflow_schedule_test_2.py']
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {default_celery.py:41} WARNING - Celery Executor will run without SSL
[2018-07-16 00:00:16,534] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,533] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-07-16 00:00:16,597] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,597] {models.py:189} INFO - Filling up the DagBag from /home/ec2-user/airflow/dags/datalake_digitalplatform_arl_workflow_schedule_test_2.py
[2018-07-16 00:00:16,768] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - --- Logging error ---
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - Traceback (most recent call last):
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - File "/usr/lib64/python3.6/logging/__init__.py", line 996, in emit
self.flush()
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush
self.stream.flush()
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - BrokenPipeError: [Errno 32] Broken pipe
[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - Call stack:
[2018-07-16 00:16:24,933] {logging_mixin.py:84} WARNING - File "/usr/bin/airflow", line 27, in <module>
args.func(args)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 392, in run
pool=args.pool,
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
result = task_copy.execute(context=context)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
directory = os.listdir(full_path)
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 36, in handle_timeout
self.log.error("Process timed out")
[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - Message: 'Process timed out'
Arguments: ()
[2018-07-16 00:16:24,942] {models.py:1595} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
while not self.poke(context):
File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
directory = os.listdir(full_path)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout
[2018-07-16 00:16:24,942] {models.py:1624} INFO - Marking task as FAILED.
[2018-07-16 00:16:24,956] {models.py:1644} ERROR - Timeout
Sometimes the error will also say
*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log
I'm not sure why the logs are working ~95% of the time but are randomly failing at other times. Here are my log settings in my Airflow.cfg file,
# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /home/ec2-user/airflow/logs
# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply an Airflow connection id that provides access to the storage
# location.
remote_log_conn_id =
encrypt_s3_logs = False
# Logging level
logging_level = INFO
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =
# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s
# Name of handler to read task instance logs.
# Default to use file task handler.
task_log_reader = file.task
# Log files for the gunicorn webserver. '-' means log to stderr.
access_logfile = -
error_logfile =
# The amount of time (in secs) webserver will wait for initial handshake
# while fetching logs from other worker machine
log_fetch_timeout_sec = 5
# When you start an airflow worker, airflow starts a tiny web server
# subprocess to serve the workers local log files to the airflow main
# web server, who then builds pages and sends them to users. This defines
# the port on which the logs are served. It needs to be unused, and open
# visible from the main web server to connect into the workers.
worker_log_server_port = 8793
# How often should stats be printed to the logs
print_stats_interval = 30
child_process_log_directory = /home/ec2-user/airflow/logs/scheduler
I'm wondering if maybe I should try a different technique for my logging such as writing to an S3 Bucket or if there is something else I can do to fix this issue.
Update:
Writing the logs to S3 did not resolve this issue. Also, the error is more consistent now (still sporadic). It's happening more like 50% of the time now. One thing I noticed is that the task it's happening on is my AWS EMR creation task. Starting an AWS EMR cluster takes about 20 minutes and then the task has to wait for the Spark commands to run on the EMR cluster. So the single task is running for about 30 minutes. I'm wondering if this is too long for an Airflow task to be running and if that's why it starts to fail writing to the logs. If this is the case then I could breakup the EMR task so that there is one task for the EMR creation, then another task for the Spark commands on the EMR cluster.
Note:
I've also created a new bug ticket on Airflow's Jira here https://issues.apache.org/jira/browse/AIRFLOW-2844
This issue is a symptom of another issue I just resolved here AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname.
I didn't see the AirflowException: Celery command failed for a while because it showed up on the airflow worker logs. It wasn't until I watched the airflow worker logs in real time that I saw when the error is thrown I also got the BrokenPipeException in my task.
It gets somewhat weirder though. I would only see the BrokenPipeException thrown if I did print("something to log") and the AirflowException: Celery command failed... error happened on the Worker node. When I changed all of my print statements to use import logging ... logging.info("something to log") then I would not see the BrokenPipeException but the task would still fail because of the AirflowException: Celery command failed... error. But had I not seen the BrokenPipeException being thrown in my Airflow task logs I wouldn't have known why the task was failing because once I eliminated the print statements I never saw any error in the Airflow task logs (only on the $airflow worker logs)
So long story short there are a few take aways.
Don't do print("something to log") use Airflow's built in logging by importing logging and then using the logging class like import logging then logging.info("something to log")
If you're using an AWS EC2-Instance as your server for Airflow then you may be experiencing this issue: https://github.com/apache/incubator-airflow/pull/2484 a fix to this issue has already been integrated into Airflow Version 1.10 (I'm currently using Airflow Version 1.9). So upgrade your Airflow version to 1.10. You can also use the command here pip install git+git://github.com/apache/incubator-airflow.git#v1-10-stable. Also, if you don't want to upgrade your Airflow version then you could follow the steps on the github issue to either manually update the file with the fix or fork Airflow and cherry pick the commit that fixes it.

TypeError: must be str, not NoneType when running distributed locust with taurus

I am trying to create a configuration for distributed locust run, I have a .py script with defined tasks, and I have simple taurus configuration just to make it working:
execution:
executor: locust
master: true
slaves: 1
scenario: tns
concurrency: 10
ramp-up: 10s
iterations: 100
hold-for: 10s
scenarios:
tns:
script: /usr/src/app/scenarios/locust_scenarios/sample.py
reporting:
- module: final-stats
dump-csv: test_result.csv
- module: console
- module: passfail
criteria:
- avg-rt>250ms for 30s, continue as failed
- failures>5% for 5s, continue as failed
- failures>50% for 10s, stop as failed
then I start locust slave node:
python -m locust.main -f scenarios/locust_scenarios/sample.py --slave --master-host=localhost
and execute test, here is the log
$ bzt -o modules.console.screen=gui locust_tests_execution_config.yaml
12:38:54 INFO: Taurus CLI Tool v1.12.0
12:38:54 INFO: Starting with configs: ['locust_tests_execution_config.yaml']
12:38:54 INFO: Configuring...
12:38:54 INFO: Artifacts dir: /Users/usr/Projects/load/2018-06-20_12-38-54.391229
12:38:54 WARNING: at path 'execution': 'execution' should be a list
12:38:54 INFO: Preparing...
12:38:54 WARNING: Module 'console' can be only used once, will merge all new instances into single
12:38:54 INFO: Starting...
12:38:54 INFO: Waiting for results...
12:38:55 WARNING: Please wait for graceful shutdown...
12:38:55 INFO: Shutting down...
12:38:56 INFO: Terminating process PID 54419 with signal Signals.SIGTERM (59 tries left)
12:38:57 INFO: Terminating process PID 54419 with signal Signals.SIGTERM (58 tries left)
12:38:57 ERROR: TypeError: must be str, not NoneType
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/cli.py", line 250, in perform
self.engine.run()
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/engine.py", line 222, in run
reraise(exc_info)
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/six/py3.py", line 84, in reraise
raise exc
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/engine.py", line 204, in run
self._wait()
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/engine.py", line 243, in _wait
while not self._check_modules_list():
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/engine.py", line 230, in _check_modules_list
finished = bool(module.check())
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/aggregator.py", line 635, in check
for point in self.datapoints():
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/aggregator.py", line 401, in datapoints
for datapoint in self._calculate_datapoints(final_pass):
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/aggregator.py", line 664, in _calculate_datapoints
self._process_underlings(final_pass)
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/aggregator.py", line 649, in _process_underlings
for data in underling.datapoints(final_pass):
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/aggregator.py", line 401, in datapoints
for datapoint in self._calculate_datapoints(final_pass):
File "/Users/usr/.virtualenvs/stfw/lib/python3.6/site-packages/bzt/modules/locustio.py", line 221, in _calculate_datapoints
self.read_buffer += self.file.get_bytes(size=1024 * 1024, last_pass=final_pass)
12:38:57 INFO: Post-processing...
12:38:57 INFO: Test duration: 0:00:03
12:38:57 INFO: Test duration: 0:00:03
12:38:57 INFO: Artifacts dir: /Users/usr/Projects/load/2018-06-20_12-38-54.391229
12:38:57 WARNING: Done performing with code: 1
locust log shows that locust slave was connected and ready to swarm.
What should I do to make it running?
Thanks
It seems that there is a defect in bzt library, based on this thread:
https://groups.google.com/forum/#!searchin/codename-taurus/locust%7Csort:date/codename-taurus/woBeH1JeBFo/pHhoGUSoAwAJ
there will be a fix in new release:
https://github.com/Blazemeter/taurus/pull/871

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

I was building an application on Apache Spark 2.00 with Python 3.4 and trying to load some CSV files from HDFS (Hadoop 2.7) and process some KPI out of those CSV data.
I use to face "Failed to get broadcast_1_piece0 of broadcast_1" error randomly in my application and it stopped.
After searching a lot google and stakeoverflow, I found only how to get rid of it by deleting spark app created files manually from /tmp directory. It happens generally when an application is running for long and it's not responding properly but related files are in /tmp directory.
Though I don't declare any variable for broadcast but may be spark is doing at its own.
In my case, the error occurs when it is trying to load csv from hdfs.
I have taken low level logs for my application and attached herewith for support and suggestions/best practice so that I can resolve the problem.
Sample (details are Attached here):
Traceback (most recent call last): File
"/home/hadoop/development/kpiengine.py", line 258, in
df_ho_raw =
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(HDFS_BASE_URL
+ HDFS_WORK_DIR + filename) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 147, in load File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in call File
"/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.load. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.26.7.192):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1
You should to extends Serializable for your class
Your code Framework error, you can test it
$SPARK_HOME/examples/src/main/scala/org/apache/spark/examples/
If it's ok, you should check your code.

module error in multi-node spark job on google cloud cluster

This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.
Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.

Resources