Airflow Scheduler liveness probe crashing (version 2.0)

Airflow Scheduler liveness probe crashing (version 2.0) - python-3.x

I have just upgraded my Airflow from 1.10.13 to 2.0. I am running it in Kubernetes (AKS Azure) with Kubernetes Executor. Unfortunately, I see my Scheduler getting killed every 15-20 mins due to Liveness probe failing. Hence my pod keeps restarting.
I had no issues in 1.10.13.
This is my Liveness probe:
import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'
from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys
with create_session() as session:
job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
SchedulerJob.latest_heartbeat.desc()).limit(1).first()
sys.exit(0 if job.is_alive() else 1)
When I look in the scheduler logs I see the following:
[2021-02-16 12:18:21,883] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor489-Process' pid=12812 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:22,228] {scheduler_job.py:933} DEBUG - No tasks to consider for execution.
[2021-02-16 12:18:22,232] {base_executor.py:147} DEBUG - 0 running task instances
[2021-02-16 12:18:22,232] {base_executor.py:148} DEBUG - 0 in queue
[2021-02-16 12:18:22,232] {base_executor.py:149} DEBUG - 32 open slots
[2021-02-16 12:18:22,232] {base_executor.py:158} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-16 12:18:22,233] {kubernetes_executor.py:337} DEBUG - Syncing KubernetesExecutor
[2021-02-16 12:18:22,233] {kubernetes_executor.py:263} DEBUG - KubeJobWatcher alive, continuing
[2021-02-16 12:18:22,234] {dag_processing.py:383} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,234] {dag_processing.py:383} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,236] {dag_processing.py:383} DEBUG - Received message of type DagParsingStat
[2021-02-16 12:18:22,246] {scheduler_job.py:1390} DEBUG - Next timed event is in 0.143059
[2021-02-16 12:18:22,246] {scheduler_job.py:1392} DEBUG - Ran scheduling loop in 0.05 seconds
[2021-02-16 12:18:22,422] {scheduler_job.py:933} DEBUG - No tasks to consider for execution.
[2021-02-16 12:18:22,426] {base_executor.py:147} DEBUG - 0 running task instances
[2021-02-16 12:18:22,426] {base_executor.py:148} DEBUG - 0 in queue
[2021-02-16 12:18:22,426] {base_executor.py:149} DEBUG - 32 open slots
[2021-02-16 12:18:22,427] {base_executor.py:158} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-16 12:18:22,427] {kubernetes_executor.py:337} DEBUG - Syncing KubernetesExecutor
[2021-02-16 12:18:22,427] {kubernetes_executor.py:263} DEBUG - KubeJobWatcher alive, continuing
[2021-02-16 12:18:22,439] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
[2021-02-16 12:18:22,452] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12819)
[2021-02-16 12:18:22,460] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor490-Process' pid=12819 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,009] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12826)
[2021-02-16 12:18:23,017] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor491-Process' pid=12826 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,594] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12833)
... Many of these Disposing DB connection pool entries here
[2021-02-16 12:20:08,212] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor675-Process' pid=14146 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:08,916] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14153)
[2021-02-16 12:20:08,924] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor676-Process' pid=14153 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:09,475] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14160)
[2021-02-16 12:20:09,484] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor677-Process' pid=14160 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,044] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14167)
[2021-02-16 12:20:10,053] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor678-Process' pid=14167 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,610] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14180)
[2021-02-16 12:23:42,287] {scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
[2021-02-16 12:23:43,290] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,494] {process_utils.py:201} INFO - Waiting up to 5 seconds for processes to exit...
[2021-02-16 12:23:43,503] {process_utils.py:61} INFO - Process psutil.Process(pid=14180, status='terminated', started='12:20:09') (14180) terminated with exit code None
[2021-02-16 12:23:43,503] {process_utils.py:61} INFO - Process psutil.Process(pid=9286, status='terminated', exitcode=0, started='12:13:35') (9286) terminated with exit code 0
[2021-02-16 12:23:43,506] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,506] {scheduler_job.py:1296} INFO - Exited execute loop
[2021-02-16 12:23:43,523] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2021-02-16 12:23:43,525] {settings.py:290} DEBUG - Disposing DB connection pool (PID 7)

I managed to fix my restart by setting up the following configs:
[kubernetes]
...
delete_option_kwargs = {"grace_period_seconds": 10}
enable_tcp_keepalive = True
tcp_keep_idle = 30
tcp_keep_intvl = 30
tcp_keep_cnt = 30
I have another Airflow instance running in AWS - Kubernetes. That one runs fine with any version, I realized the problem is with Azure Kubernetes, the rest api calls to the api server.
Just in case this helps someone else....

For mine case the problem was with the workers. Which had a db connection issues. Fixing it solved the issue for scheduler as well.
Note: Check the workers logs as well.

Related

flutter: Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)

I'm working with flutter in ubuntu 18.04. I can't run my project although flutter doctor is no problem. I stuck in the problem of Gradle daemon. I have tried the method in gradle user guide and other questions in SO to add org.gradle.daemon=false in gradle.properties, but the same problem. What's wrong?
Launching lib/main.dart on M2006C3LC in debug mode...
The message received from the daemon indicates that the daemon has disappeared.
Build request sent: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
Attempting to read last messages from the daemon log...
Daemon pid: 3422
log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
----- Last 20 lines from daemon log file - daemon-3422.out.log -----
11:38:04.753 [DEBUG] [org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler] Starting executing command: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android} with connection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484.
11:38:04.755 [ERROR] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Command execution: started DaemonCommandExecution[command = Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}, connection = DefaultDaemonConnection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484] after 0.0 minutes of idle
11:38:04.756 [INFO] [org.gradle.launcher.daemon.server.DaemonRegistryUpdater] Marking the daemon as busy, address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.763 [DEBUG] [org.gradle.launcher.daemon.registry.PersistentDaemonRegistry] Marking busy by address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.790 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire exclusive lock on daemon addresses registry.
11:38:04.791 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
11:38:04.804 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] resetting idle timer
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon is running. Sleeping until state changes.
11:38:04.810 [INFO] [org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy] Daemon is about to start building Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}. Dispatching build started information...
11:38:04.811 [DEBUG] [org.gradle.launcher.daemon.server.SynchronizedDispatchConnection] thread 17: dispatching class org.gradle.launcher.daemon.protocol.BuildStarted
11:38:04.812 [DEBUG] [org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment] Configuring env variables: {PATH=/home/byhuang/src/FlutterSDK/flutter/bin/cache/dart-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/byhuang/.pub-cache/bin:~/src/FlutterSDK/flutter/bin, XAUTHORITY=/run/user/1000/gdm/Xauthority, INVOCATION_ID=bb1359976cdb403b93f2e6bd5fa0d6e7, XMODIFIERS=#im=ibus, GDMSESSION=ubuntu, XDG_DATA_DIRS=/usr/share/ubuntu:/usr/local/share:/usr/share:/var/lib/snapd/desktop, TEXTDOMAINDIR=/usr/share/locale/, GTK_IM_MODULE=ibus, DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, PUB_HOSTED_URL=https://pub.flutter-io.cn, XDG_CURRENT_DESKTOP=ubuntu:GNOME, JOURNAL_STREAM=9:49578, SSH_AGENT_PID=1797, COLORTERM=truecolor, QT4_IM_MODULE=xim, SESSION_MANAGER=local/byhuang-virtual-machine:#/tmp/.ICE-unix/1702,unix/byhuang-virtual-machine:/tmp/.ICE-unix/1702, USERNAME=byhuang, LOGNAME=byhuang, PWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, MANAGERPID=1667, IM_CONFIG_PHASE=2, LANGUAGE=zh_CN:zh, LESSOPEN=| /usr/bin/lesspipe %s, SHELL=/bin/bash, OLDPWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, GNOME_DESKTOP_SESSION_ID=this-is-deprecated, GTK_MODULES=gail:atk-bridge, GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/1c43d352_e494_406b_b4b1_8e1a6bf95400, CLUTTER_IM_MODULE=xim, TEXTDOMAIN=im-config, DBUS_STARTER_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, FLUTTER_STORAGE_BASE_URL=https://storage.flutter-io.cn, XDG_SESSION_DESKTOP=ubuntu, LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:, SHLVL=2, LESSCLOSE=/usr/bin/lesspipe %s %s, QT_IM_MODULE=xim, JAVA_HOME=/home/byhuang/software/android-studio/jre, TERM=xterm-256color, FLUTTER_ROOT=/home/byhuang/src/FlutterSDK/flutter, XDG_CONFIG_DIRS=/etc/xdg/xdg-ubuntu:/etc/xdg, GNOME_TERMINAL_SERVICE=:1.59, LANG=zh_CN.UTF-8, XDG_SESSION_TYPE=x11, XDG_SESSION_ID=4, DISPLAY=:1, FLUTTER_SUPPRESS_ANALYTICS=true, GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1, DESKTOP_SESSION=ubuntu, USER=byhuang, XDG_MENU_PREFIX=gnome-, VTE_VERSION=5202, WINDOWPATH=1, QT_ACCESSIBILITY=1, XDG_SEAT=seat0, SSH_AUTH_SOCK=/run/user/1000/keyring/ssh, FLUTTER_ALREADY_LOCKED=true, GNOME_SHELL_SESSION_MODE=ubuntu, XDG_RUNTIME_DIR=/run/user/1000, XDG_VTNR=1, DBUS_STARTER_BUS_TYPE=session, HOME=/home/byhuang}
11:38:04.826 [DEBUG] [org.gradle.launcher.daemon.server.exec.LogToClient] About to start relaying all logs to the client via the connection.
11:38:04.826 [INFO] [org.gradle.launcher.daemon.server.exec.LogToClient] The client will now receive all logging from the daemon (pid: 3422). The daemon log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.exec.RequestStopIfSingleUsedDaemon] Requesting daemon stop after processing Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
11:38:04.832 [LIFECYCLE] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Daemon will be stopped at the end of the build stopping after processing
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Stop as soon as idle requested. The daemon is busy: true
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon stop has been requested. Sleeping until state changes.
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] The daemon has started executing the build.
11:38:04.839 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] Executing build with daemon context: DefaultDaemonContext[uid=cbbde454-54e4-47ec-b258-28179c1fddfe,javaHome=/home/byhuang/software/android-studio/jre,daemonRegistryDir=/home/byhuang/.gradle/daemon,pid=3422,idleTimeout=120000,priority=NORMAL,daemonOpts=-Xmx1536M,-Dfile.encoding=UTF-8,-Duser.country=CN,-Duser.language=zh,-Duser.variant]
----- End of the daemon log -----
FAILURE: Build failed with an exception.
* What went wrong:
Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Running Gradle task 'assembleDebug'...
Running Gradle task 'assembleDebug'... Done 29.2s
Exception: Gradle task assembleDebug failed with exit code 1

I faced the same error today in Ubuntu. To temporarily bypass the error, adding --no-daemon to whichever gradle command I was running worked. Deleting the daemon subfolder in /home/{username}/.gradle folder permanently fixed the error.

go to project file -> android-> delete .gradle file.

As the other answers suggest delete the .gradle folder. But that didn't work for me, I had to do that and then shut down and restart my ubuntu.

Gracefully killing container and pods in kubernetes with spark exception

What is the recommended way to gracefully kill a container and driver pods in kubernetes when an application fails or reaches an exception. Currently, when my application runs into an exception my pods and executors continue to run and I noticed that my container doesn't get killed unless an explicit exit 1 is used.For some reason my spark application doesn't cause an exit 1 status or sigterm signal to be sent to the container or pod.
Tried to add following to yaml specs based on recommendations but still not getting pod driver and executors to terminate:
spec:
terminationGracePeriodSeconds: 0
driver:
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- touch /var/run/killspark && sleep 65

The preStop LiveCycle hook you added won't have any effect, since it is only triggered
before a container is terminated due to an API request or management
event such as liveness probe failure, preemption, resource contention
and others
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
I suspect, what you really have to figure out is why your container main process keeps running despite the exception.

could not run codedeployagent on windows 10

I am following this note: https://docs.aws.amazon.com/codedeploy/latest/userguide/register-on-premises-instance-iam-user-arn.html however could not run AWS CodeDeployAgent on Windows 10 machine.
Log file:
2019-11-23T20:28:51 INFO [codedeploy-agent(2988)]: Version file found in C:/ProgramData/Amazon/CodeDeploy/.version with agent version OFFICIAL_1.0.1.1597_msi.
2019-11-23T20:28:51 ERROR [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Missing credentials - please check if this instance was started with an IAM instance profile
2019-11-23T20:28:51 DEBUG [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Sleeping 89 seconds.
2019-11-23T20:30:03 INFO [codedeploy-agent(2988)]: CodeDeploy Instance Agent Service: stopping the agent
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Gracefully shutting down agent child threads now, will wait up to 7200 seconds
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: All agent child threads have been shut down
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: CodeDeploy Instance Agent Service: command execution threads shutdown, agent exiting now
Any idea please?

The CodeDeploy agent is not tested on Windows 10 for EC2 or OnPrem installation. Please refer to this table for supported OS list:
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent.html#codedeploy-agent-supported-operating-systems-ec2
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent.html#codedeploy-agent-supported-operating-systems-on-premises

Airflow Task Fails

New to Airflow.
Airflow running using docker image with LocalExecutor, and execute a task that gets data from MySQL to Google Cloud Storage with below task. About 1 million records is expected to pull.
# Extract and Load customer table
extract_customer_table_from_mysql = MySqlToGoogleCloudStorageOperator(
task_id='InitialExtractCustomerToGCS',
mysql_conn_id='iprocure_staging_db_conn',
sql='SELECT * FROM iProcureMain.customer',
bucket=bucket,
filename='iprocure-bigquery-bucket/customer/{{ ts_nodash }}/iprocure-bigquery-bucket-customer.json',
schema_filename='iprocure-bigquery-bucket/customer_schema.json',
google_cloud_storage_conn_id='iprocure_gcs_conn',
dag=dag)
After sometime it fails. Following is logs for this task execution
[2019-05-15 10:12:13,392] {{models.py:1595}} INFO - Executing <Task(MySqlToGoogleCloudStorageOperator): InitialExtractCustomerToGCS> on 2019-05-15T09:05:58.731452+00:00
[2019-05-15 10:12:13,393] {{base_task_runner.py:118}} INFO - Running: ['bash', '-c', 'airflow run MySQLtoBQInitalLoad InitialExtractCustomerToGCS 2019-05-15T09:05:58.731452+00:00 --job_id 19 --raw -sd DAGS_FOLDER/initial_load.py --cfg_path /tmp/tmp_f96d99z']
[2019-05-15 10:12:19,852] {{base_task_runner.py:101}} INFO - Job 19: Subtask InitialExtractCustomerToGCS [2019-05-15 10:12:19,849] {{settings.py:174}} INFO - setting.configure_orm(): Using pool settings. pool_size=5, pool_recycle=1800
[2019-05-15 10:12:22,540] {{base_task_runner.py:101}} INFO - Job 19: Subtask InitialExtractCustomerToGCS [2019-05-15 10:12:22,538] {{__init__.py:51}} INFO - Using executor LocalExecutor
[2019-05-15 10:12:27,379] {{base_task_runner.py:101}} INFO - Job 19: Subtask InitialExtractCustomerToGCS [2019-05-15 10:12:27,365] {{models.py:271}} INFO - Filling up the DagBag from /usr/local/airflow/dags/initial_load.py
[2019-05-15 10:12:28,528] {{base_task_runner.py:101}} INFO - Job 19: Subtask InitialExtractCustomerToGCS [2019-05-15 10:12:28,524] {{cli.py:484}} INFO - Running <TaskInstance: MySQLtoBQInitalLoad.InitialExtractCustomerToGCS 2019-05-15T09:05:58.731452+00:00 [running]> on host 3c7603479eef
[2019-05-15 10:12:28,728] {{logging_mixin.py:95}} INFO - [2019-05-15 10:12:28,718] {{base_hook.py:83}} INFO - Using connection to: datawarehousereplica.crgjkux43gqm.us-west-2.rds.amazonaws.com
[2019-05-15 10:32:12,569] {{logging_mixin.py:95}} INFO - [2019-05-15 10:32:12,493] {{jobs.py:2627}} INFO - Task exited with return code -9

Here was a similar question: Airflow kills my tasks after 1 minute
It looks like Airflow is out of memory.

Why did Spark standalone Worker node-1 terminate after RECEIVED SIGNAL 15: SIGTERM?

Note: This error was thrown before the components were executed by spark.
Logs
Worker Node1:
17/05/18 23:12:52 INFO Worker: Successfully registered with master spark://spark-master-1.com:7077
17/05/18 23:58:41 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM
Master Node:
17/05/18 23:12:52 INFO Master: Registering worker spark-worker-1com:56056 with 2 cores, 14.5 GB RAM
17/05/18 23:14:20 INFO Master: Registering worker spark-worker-2.com:53986 with 2 cores, 14.5 GB RAM
17/05/18 23:59:42 WARN Master: Removing spark-worker-1com-56056 because we got no heartbeat in 60 seconds
17/05/18 23:59:42 INFO Master: Removing spark-worker-2.com:56056
17/05/19 00:00:03 ERROR Master: RECEIVED SIGNAL 15: SIGTERM
Worker Node2:
17/05/18 23:14:20 INFO Worker: Successfully registered with master spark://spark-master-node-2.com:7077
17/05/18 23:59:40 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM

TL;DR I think someone has explicitly called kill command or sbin/stop-worker.sh.
"RECEIVED SIGNAL 15: SIGTERM" is reported by a shutdown hook to log TERM, HUP, INT signals on UNIX-like systems:
/** Register a signal handler to log signals on UNIX-like systems. */
def registerLogger(log: Logger): Unit = synchronized {
if (!loggerRegistered) {
Seq("TERM", "HUP", "INT").foreach { sig =>
SignalUtils.register(sig) {
log.error("RECEIVED SIGNAL " + sig)
false
}
}
loggerRegistered = true
}
}
In your case it means that the process received SIGTERM to stop itself:
The SIGTERM signal is a generic signal used to cause program termination. Unlike SIGKILL, this signal can be blocked, handled, and ignored. It is the normal way to politely ask a program to terminate.
That's what is sent when you execute KILL or use ./sbin/stop-master.sh or ./sbin/stop-worker.sh shell scripts that in turn call sbin/spark-daemon.sh with stop command that kills a JVM process for a master or a worker:
kill "$TARGET_ID" && rm -f "$pid"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Airflow Scheduler liveness probe crashing (version 2.0) - python-3.x

For mine case the problem was with the workers. Which had a db connection issues. Fixing it solved the issue for scheduler as well. Note: Check the workers logs as well.

Related

flutter: Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)

Gracefully killing container and pods in kubernetes with spark exception

could not run codedeployagent on windows 10

Airflow Task Fails

Why did Spark standalone Worker node-1 terminate after RECEIVED SIGNAL 15: SIGTERM?

Categories

Resources