We are running spark jobs on Kubernetes (EKS non EMR) using Spark operator.
After some time some executors get SIGNAL TERM, an example log from executor:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
On the driver side, after 2 minutes the driver stops receiving heartbeats and then decides to kill the executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
I tried to understand if we are crossing some resource limit on Kubernetes level but couldn't find something like that.
What can I look for to understand the reason for Kubernetes killing executors?
FOLLOW UP:
I missed a log message on the driver side:
Mar 01 21:04:23.471 Disabling executor 50.
and then on the executor side:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
I looked at what class is writing the Disabling executor log message and found this class KubernetesDriverEndpoint, it seems that the onDisconnected method is called for all these executors and this method calls disableExecutor in DriverEndpoint
So the question now is why these executors are considered disconnected.
Looking at the explanation from this site
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
it is said there
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
But I couldn't find any WARN logs on the driver side, any suggestions?
The reason for the executors being killed was that we were running them on spot instances in AWS, and this is how it looks like, the first sign we saw for an executor being killed is this line in it's log
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Once we moved to on demand instances for the executors as well not a single executor was terminated in 20 hour jobs
Related
I think my webapp is pretty cool. It's a natural language playlist generator. It takes in a description of a playlist, like:
"midwest emo songs to cry to in the shower because my girlfriend broke up with me"
and converts it into an embedding generated by a NLP transformer model (specifically SentenceTransformers) and does recommender system stuff to return songs in a playlist for a user.
My website hangs after the user has submitted their description, and I get a 504 load balancer error after 5 minutes. After tracing where the code hangs, it seems to stop during model.encode(text), which runs the user's query through the ML model to get the embedding.
This code runs no problem on my local machine, and when I run it in the console it also has no problem processing the text through the ML model.
What should I do? Add more workers? Free up space in the program? Let me know.
Below are my server logs after model.encode() is run.
2022-11-26 07:53:26 entered the get embedding function
2022-11-26 07:53:27 announcing my loyalty to the Emperor...
2022-11-26 07:54:11 Sat Nov 26 07:54:10 2022 - HARAKIRI ON WORKER 4 (pid: 18, try: 1)
2022-11-26 07:54:11 Sat Nov 26 07:54:10 2022 - HARAKIRI !!! worker 4 status !!!
2022-11-26 07:54:11 Sat Nov 26 07:54:10 2022 - HARAKIRI [core 0] 10.0.0.75 - POST / since 1669448649
2022-11-26 07:54:11 Sat Nov 26 07:54:10 2022 - HARAKIRI !!! end of worker 4 status !!!
2022-11-26 07:54:11 DAMN ! worker 4 (pid: 18) died, killed by signal 9 :( trying respawn ...
2022-11-26 07:54:11 Respawned uWSGI worker 4 (new pid: 33)
2022-11-26 07:54:11 spawned 2 offload threads for uWSGI worker 4
2022-11-26 08:03:28 Sat Nov 26 08:03:27 2022 - HARAKIRI ON WORKER 3 (pid: 15, try: 1)
2022-11-26 08:03:28 Sat Nov 26 08:03:27 2022 - HARAKIRI !!! worker 3 status !!!
2022-11-26 08:03:28 Sat Nov 26 08:03:27 2022 - HARAKIRI [core 0] 10.0.0.75 - POST / since 1669449206
2022-11-26 08:03:28 Sat Nov 26 08:03:27 2022 - HARAKIRI !!! end of worker 3 status !!!
2022-11-26 08:03:28 DAMN ! worker 3 (pid: 15) died, killed by signal 9 :( trying respawn ...
2022-11-26 08:03:28 Respawned uWSGI worker 3 (new pid: 36)
2022-11-26 08:03:28 spawned 2 offload threads for uWSGI worker 3
I tried running this code in the console of pythonanywhere, and it ran just fine. I'm stuck!
I used an always on task to run the queries through the model and spit them back into the main script.
Works like a dream!
cassandra.service: main process exited, code=killed, status=11/SEGV
env:
apache-cassandra-4.0.0
jdk-11.0.12
ZGC
jvm:
-Xms31G
-Xmx31G
host:
16core 128G
/var/log/message:
Jul 4 13:57:10 iZ2zec1q29sosy4bdv893qZ systemd-logind: Removed session 277.
Jul 4 13:57:12 iZ2zec1q29sosy4bdv893qZ cassandra: INFO [CompactionExecutor:4] 2022-07-04 13:57:12,074 CompactionTask.java:245 - Compacted (24af5250-fb5e-11ec-aa2a-6b96728ba428)
4 sstables to [/data/cassandra/data/data/spaceport/xm_coupon_code_realtime1-d77e7f10ebcc11ecae252faeea3c28c4/nb-6494-big,] to level=0. 27.414MiB to 27.412MiB (~99% of original) i
n 1,812ms. Read Throughput = 15.127MiB/s, Write Throughput = 15.126MiB/s, Row Throughput = ~123,625/s. 32,718 total partitions merged to 32,689. Partition merge counts were {1:
32663, 2:23, 3:3, }
Jul 4 13:57:12 iZ2zec1q29sosy4bdv893qZ cassandra: INFO [NonPeriodicTasks:1] 2022-07-04 13:57:12,083 SSTable.java:111 - Deleting sstable: /data/cassandra/data/data/spaceport/xm_c
oupon_code_realtime1-d77e7f10ebcc11ecae252faeea3c28c4/nb-6490-big
Jul 4 13:57:12 iZ2zec1q29sosy4bdv893qZ cassandra: INFO [NonPeriodicTasks:1] 2022-07-04 13:57:12,084 SSTable.java:111 - Deleting sstable: /data/cassandra/data/data/spaceport/xm_c
oupon_code_realtime1-d77e7f10ebcc11ecae252faeea3c28c4/nb-6493-big
Jul 4 13:57:12 iZ2zec1q29sosy4bdv893qZ cassandra: INFO [NonPeriodicTasks:1] 2022-07-04 13:57:12,085 SSTable.java:111 - Deleting sstable: /data/cassandra/data/data/spaceport/xm_c
oupon_code_realtime1-d77e7f10ebcc11ecae252faeea3c28c4/nb-6491-big
Jul 4 13:57:12 iZ2zec1q29sosy4bdv893qZ cassandra: INFO [NonPeriodicTasks:1] 2022-07-04 13:57:12,085 SSTable.java:111 - Deleting sstable: /data/cassandra/data/data/spaceport/xm_c
oupon_code_realtime1-d77e7f10ebcc11ecae252faeea3c28c4/nb-6492-big
Jul 4 14:00:01 iZ2zec1q29sosy4bdv893qZ systemd: Started Session 293 of user root.
Jul 4 14:01:01 iZ2zec1q29sosy4bdv893qZ systemd: Started Session 294 of user root.
Jul 4 14:01:59 iZ2zec1q29sosy4bdv893qZ systemd: cassandra.service: main process exited, code=killed, status=11/SEGV
Jul 4 14:02:00 iZ2zec1q29sosy4bdv893qZ systemd: Unit cassandra.service entered failed state.
Jul 4 14:02:00 iZ2zec1q29sosy4bdv893qZ systemd: cassandra.service failed.
Jul 4 14:02:05 iZ2zec1q29sosy4bdv893qZ systemd: cassandra.service holdoff time over, scheduling restart.
Jul 4 14:02:05 iZ2zec1q29sosy4bdv893qZ systemd: Stopped Cassandra Server Service.
Jul 4 14:02:05 iZ2zec1q29sosy4bdv893qZ systemd: Started Cassandra Server Service.
Jul 4 14:02:55 iZ2zec1q29sosy4bdv893qZ cassandra: CompileCommand: dontinline org/apache/cassandra/db/Columns$Serializer.deserializeLargeSubset(Lorg/apache/cassandra/io/util/DataI
nputPlus;Lorg/apache/cassandra/db/Columns;I)Lorg/apache/cassandra/db/Columns;
The log entries you posted on their own don't explain what the problem is. You will need to review the Cassandra system.log for clues.
A friendly reminder that Stack Overflow is for getting help with coding, algorithm, or programming language problems. For future reference, you should post DB admin/ops questions on dba.stackexchange.com. If you post it there, I'd be happy to help. Cheers!
As the title says, I'm having a weird behaviour with at:
It put my jobs on queue, and run then correctly, but only after another job is scheduled:
This is the situation, when I add ad new job then job 17 gets executed.
atd service is running fine. My system is: Linux 5.10.98-1-MANJARO.
PS: I already tried telling it to not email me -M, or using absolute/relative paths, etc... jobs are executed, only when atd is "triggered" or gets awaked by scheduling a new job or restarting the systemd service.
PPS: don't know if this could help, its the log for atd.service when I check the status
feb 22 09:12:01 sant-nuc systemd[1]: Starting Deferred execution scheduler...
feb 22 09:12:01 sant-nuc systemd[1]: Started Deferred execution scheduler.
feb 22 14:16:44 sant-nuc atd[157517]: pam_unix(atd:session): session opened for user santiago(uid=1000) by (uid=2)
feb 22 14:16:44 sant-nuc atd[157517]: pam_env(atd:setcred): deprecated reading of user environment enabled
feb 22 14:16:44 sant-nuc atd[157517]: pam_env(atd:setcred): deprecated reading of user environment enabled
feb 22 14:16:44 sant-nuc atd[157517]: pam_unix(atd:session): session closed for user santiago
This is a bug in at 3.2.4:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1004972
Apparently this will be fixed in version 3.2.5. It's not available yet through the archlinux repo so I have downgraded to 3.2.2, which does not have this issue.
I am trying to add another node to a Production cassandra cluster as the disc space utilization across nodes is reaching over 90%. However, the node is in joining state for over 2 days. I also noticed that one of the node went down(DN) as it is at 100% disc space utilization. Cassandra server is unable to run on this instance!!
Will this affect bootstrapping completion of the new node?
Any immediate solutions for restoring space on the node that went down?
If I remove this out of the ring, this may add more stress of data load and increase disc space on the other nodes.
Can I remove any SSTable(like the list of files) temporarily out of the instance, bring up the server, perform clean-up and then add back these files?
-rw-r--r--. 1 polkitd input 5551459 Sep 17 2020 mc-572-big-CompressionInfo.db
-rw-r--r--. 1 polkitd input 15859691072 Sep 17 2020 mc-572-big-Data.db
-rw-r--r--. 1 polkitd input 8 Sep 17 2020 mc-572-big-Digest.crc32
-rw-r--r--. 1 polkitd input 22608920 Sep 17 2020 mc-572-big-Filter.db
-rw-r--r--. 1 polkitd input 5634549206 Sep 17 2020 mc-572-big-Index.db
-rw-r--r--. 1 polkitd input 12538 Sep 17 2020 mc-572-big-Statistics.db
-rw-r--r--. 1 polkitd input 44510338 Sep 17 2020 mc-572-big-Summary.db
-rw-r--r--. 1 polkitd input 92 Sep 17 2020 mc-572-big-TOC.txt
If you are using vnodes then downed node will surelyimpact bootstrapping. For immediate relife, identify tables which are not used in traffic and move sstables to backup from that table.
I resolved this by temporarily increasing the EBS volume(disc space)on that node, brought up the server, then removed the node out of the cluster, cleared out cassandra data folders, decreased the EBS Volume and then added back the node to the cluster.
One thing that I noticed was removing the node out of the cluster, increased disc space on the other nodes. So I added additional nodes to distribute the load, then ran clean up on all other nodes before moving on to removing the node out of the cluster.
We have a Xtradb cluster with three nodes. There is one node, which was not properly stopped and won't start. The other two nodes are correctly working and responding. The only thing in logs is this:
-- Unit mysql.service has begun starting up.
Aug 25 04:40:45 percona-prod-perconaxtradb-vm-0 /etc/init.d/mysql[2503]: MySQL PID not found, pid_file detected/guessed: /var/run/mysql
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 mysql[2462]: Starting MySQL (Percona XtraDB Cluster) database server: mysqld . . . . .
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 mysql[2462]: failed!
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 systemd[1]: mysql.service: control process exited, code=exited status=1
Aug 25 04:40:52 percona-prod-perconaxtradb-vm-0 systemd[1]: Failed to start LSB: Start and stop the mysql (Percona XtraDB Cluster) daem
-- Subject: Unit mysql.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
In /var/lib/mysql/wsrep_recovery.qEEkjd we found this:
2018-08-25T05:49:31.055887Z 0 [ERROR] Found 20 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions.
2018-08-25T05:49:31.055892Z 0 [ERROR] Aborting
2018-08-25T05:49:31.055901Z 0 [Note] Binlog end
We would like to completely drop these 20 prepared transactions.
The other two nodes are consistent and working, so it would be enough to tell this node "ignore your state and sync with other nodes".
In the end we removed the /data folder on the dead node and restarted the node. The node then started SST replication - which takes a long time and the only progress one can see is checking the growing size of the folder. But then it worked.