We are running spark jobs on Kubernetes (EKS non EMR) using Spark operator.
After some time some executors get SIGNAL TERM, an example log from executor:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
On the driver side, after 2 minutes the driver stops receiving heartbeats and then decides to kill the executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
I tried to understand if we are crossing some resource limit on Kubernetes level but couldn't find something like that.
What can I look for to understand the reason for Kubernetes killing executors?
FOLLOW UP:
I missed a log message on the driver side:
Mar 01 21:04:23.471 Disabling executor 50.
and then on the executor side:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
I looked at what class is writing the Disabling executor log message and found this class KubernetesDriverEndpoint, it seems that the onDisconnected method is called for all these executors and this method calls disableExecutor in DriverEndpoint
So the question now is why these executors are considered disconnected.
Looking at the explanation from this site
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
it is said there
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
But I couldn't find any WARN logs on the driver side, any suggestions?
The reason for the executors being killed was that we were running them on spot instances in AWS, and this is how it looks like, the first sign we saw for an executor being killed is this line in it's log
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Once we moved to on demand instances for the executors as well not a single executor was terminated in 20 hour jobs
According to the documentation
https://nodered.org/docs/creating-nodes/node-js
when Node-red (or the specific node in question) closes down,the "close" event is called and if a listener is registered with a parameter it should wait for done() before completely stopping.
this.on('close', function(done) {
doSomethingWithACallback(function() {
done();
});
});
It doesn't work for me though. My mistake, I'm sure, but I don't see where. The following code displays the first "Closing" entry in the log, but not the second entry "Waited enough. Actually finishing now.":
node.on("close", function(done) {
node.log('Closing.');
setTimeout(function(){
node.log('Waited enough.Actually finishing now.');
done();
},5000);
});
Can someone please give me a pointer ?
Using:
Node-red 0.17.5
node.js 6.14.1
Edit: output log added below
pi#raspberrypi:~ $ node-red-start
Start Node-RED
Once Node-RED has started, point a browser at http://192.168.1.17:1880
On Pi Node-RED works better with the Firefox or Chrome browser
Use node-red-stop to stop Node-RED
Use node-red-start to start Node-RED again
Use node-red-log to view the recent log output
Use sudo systemctl enable nodered.service to autostart Node-RED at every boot
Use sudo systemctl disable nodered.service to disable autostart on boot
To find more nodes and example flows - go to http://flows.nodered.org
Starting as a systemd service.
Started Node-RED graphical event wiring tool..
16 Apr 10:11:27 - [info]
Welcome to Node-RED
===================
16 Apr 10:11:27 - [info] Node-RED version: v0.17.5
16 Apr 10:11:27 - [info] Node.js version: v6.14.1
16 Apr 10:11:27 - [info] Linux 4.14.30-v7+ arm LE
16 Apr 10:11:30 - [info] Loading palette nodes
16 Apr 10:11:47 - [info] Dashboard version 2.7.0 started at /ui
16 Apr 10:11:50 - [info] Settings file : /home/pi/.node-red/settings.js
16 Apr 10:11:50 - [info] User directory : /home/pi/.node-red
16 Apr 10:11:50 - [info] Flows file : /home/pi/.node-red/flows_raspberrypi.json
16 Apr 10:11:50 - [info] Server now running at http://127.0.0.1:1880/
16 Apr 10:11:51 - [info] Starting flows
16 Apr 10:11:51 - [info] Started flows
Stopping Node-RED graphical event wiring tool....
16 Apr 10:12:06 - [info] Stopping flows
16 Apr 10:12:06 - [info] [simple-queue:queue1] Closing.
Stopped Node-RED graphical event wiring tool..
You are hitting a bug that was fixed in Node-RED 0.18.
Prior to Node-RED 0.18, the code that handled the shutdown of the runtime did not wait for the all of the node close handlers to complete before the process was terminated.
My node-red installation does not display the nodes needed for accessing beaglebone IOs (node-red-node-beaglebone).
In my opinion error is actually caused because node-red-node-beaglebone loads octalbonescript which needs serialport, and something requests serialPort instead of serialport there.
I already tried the preinstalled, stable and lts nodejs version. Additionally npm#2 and npm#3 for installing the nodes. The .node-red folder was also already deleted by me and the node-red-node-beaglebone packaged got installed in the .node-red/node_modules folder started from scrach.
root#beaglebone:/etc# cat debian_version
8.7
root#beaglebone:~# npm -v
2.15.11
root#beaglebone:~# uname -a
Linux beaglebone 4.4.30-ti-r64 #1 SMP Fri Nov 4 21:23:33 UTC 2016 armv7l GNU/Linux
root#beaglebone:~# export AUTO_LOAD_CAPE=0 #optional
root#beaglebone:~# node-red-pi
1487183133432 Board Looking for connected device
15 Feb 19:25:34 - [info]
Welcome to Node-RED
===================
15 Feb 19:25:34 - [info] Node-RED version: v0.16.2
15 Feb 19:25:34 - [info] Node.js version: v6.9.5
15 Feb 19:25:34 - [info] Linux 4.4.30-ti-r64 arm LE
15 Feb 19:25:44 - [info] Loading palette nodes
15 Feb 19:25:57 - [warn] ------------------------------------------------------
15 Feb 19:25:57 - [warn] [bbb] ReferenceError: serialPort is not defined +seems to be the problem
15 Feb 19:25:57 - [warn] ------------------------------------------------------
15 Feb 19:25:57 - [info] Settings file : /root/.node-red/settings.js
15 Feb 19:25:57 - [info] User directory : /root/.node-red
15 Feb 19:25:57 - [info] Flows file : /root/.node-red/flows_beaglebone.json
15 Feb 19:25:57 - [info] Creating new flow file
15 Feb 19:25:57 - [debug] loaded flow revision: 513fd923d68021b8ee98fcb250470340
15 Feb 19:25:57 - [debug] red/runtime/nodes/credentials.load : no user key present
15 Feb 19:25:57 - [debug] red/runtime/nodes/credentials.load : using default key
15 Feb 19:25:57 - [info] Starting flows
15 Feb 19:25:57 - [info] Started flows
15 Feb 19:25:57 - [info] Server now running at http://127.0.0.1:1880/
after upgrading a node in cassandra, these error in log occured:
I want to investigate but dont have a direction, any clue will help
thanks
2016/09/19 06:24:49 [INFO] core: post-unseal setup starting
2016/09/19 06:24:49 [INFO] core: mounted backend of type generic at secret/
2016/09/19 06:24:49 [INFO] core: mounted backend of type cubbyhole at cubbyhole/
2016/09/19 06:24:49 [INFO] core: mounted backend of type system at sys/
2016/09/19 06:24:49 [INFO] core: mounted backend of type cassandra at cassandra/
2016/09/19 06:24:49 [INFO] rollback: starting rollback manager
2016/09/19 06:24:50 [INFO] expire: restored 2 leases
2016/09/19 06:24:50 [INFO] core: post-unseal setup complete
2016/09/19 06:24:55 gocql: unable to dial control conn node-0.cassandra-app.mesos:9042: dial tcp 10.0.2.42:9042: getsockopt: connection refused
2016/09/19 06:25:12 error: failed to connect to 10.0.2.42:9042 due to error: gocql: no response to connection startup within timeout
Cassandra cluster is not cross-version compatible, you can not upgrade a node only, you have to upgrade the cluster. This is a common mistake people tend to do, please see this video here it mentiones this problem, also it is very very useful with lots of good info.
When I launch my jhipster app using "mvn spring-boot:run", it takes up to 60 seconds to start...
First part of my log is :
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building jhipster 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster >>>
[INFO]
[INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) # jhipster ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) # jhipster ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 4 resources
[INFO] Copying 22 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) # jhipster ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) # jhipster ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) # jhipster ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] <<< spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster <<<
[INFO]
[INFO] --- spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster ---
[INFO] Attaching agents: []
Listening for transport dt_socket at address: 5005
--> Then it hangs for around 30 seconds before continuing :
[INFO] com.mycompany.myapp.Application - Starting Application on MacBook-Pro.local with PID 5130 (/Users/othomas/Developpement/jhipster-1.9.0/target/classes started by othomas in /Users/othomas/Developpement/jhipster-1.9.0)
[DEBUG] com.mycompany.myapp.Application - Running with Spring Boot v1.1.9.RELEASE, Spring v4.0.8.RELEASE
[DEBUG] org.jboss.logging - Logging Provider: org.jboss.logging.Log4jLoggerProvider
...
I remember having used older versions of jhipster generator (0.17 etc) et it started in 15-20 seconds.
Is it normal or is there a problem on my side ? Where to look for ?
Thanks,
O.
I've been suffering slow startup times myself and wondering what the cause was. I get all the console messages saying various things have started and then it hangs just before the final message to say the app has loaded.
Eventually I found I could use Java VisualVM as part of the JDK to see what was going on. If you have the jdk installed its jvisualvm.exe in the bin folder. Then when I select to debug as Application.java the tomcat process pops up and you can track what's going on.
I took a couple of thread dumps where it hangs and it always seemed to be where the swagger API docs are being generated. A bit more digging and this is configured in a class called MetricsConfiguration which is excluded if you run with a profile called "fast".
In eclipse I edited my debug configuration to include a program argument of:
--spring.profiles.active=dev,fast
This cuts down the startup time from 230 seconds to just 25!
I had a quick scan and fast seems to disable all sorts of things. It mainly looks like the stuff under the admin menu which you'll probably not need during development anyway. Personally I would prefer a fast bootup to being able to see the rest docs during development.
Swagger being such a hog made me wonder if it's such a good idea after all. Is it worth the cost? i then read this http://java.dzone.com/articles/swagger-great and I'm considering just removing it altogether. It's a nice idea but seems to add 33mb to the build + for me was causing really slow startup times.
For info I have around 16 entities. So not small but not excessively large either.
Make sure you aren't running the server in debug mode and have a breakpoint set. This reduced the startup time of one of my applications from 3 min to 22 sec.
This is weird.
Indeed, it should start in 5-15 seconds depending on your machine and specific setup.
But it should not hang for 30 seconds: the line you show is a bit new, it's because we launch the application in debug mode when you use the dev profile -> you can attach a debugger on it.
It looks like it's waiting for you to connect a debugger: I've never seen it myself, so maybe you have some specific JVM option for attaching a debugger at start up, with a timeout of 30 seconds?
Thanks for your feedback. I investigated and put more logs in the app (Application.java).
Actually the problem does not come from the debug mode, the application does not hang here.
The first big "pause" comes from the scanning of liquibase packages (addLiquibaseScanPackages(); in Application.java ) : 26 seconds !
My second pause is still related to Liquibase (log "Configuring Liquibase" ) : 20 seconds. During that time, if I put Liquibase log level to DEBUG, I see that a lock is set and then released but it happens very quickly.
I really don't understand. I am using h2 in-memory database, jdk 1.7.0_25 and Maven 3.0.5, running on MacBook Pro with SSD.
Here is my full log when I run with "mvn spring-boot:run".
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building jhipster 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster >>>
[INFO]
[INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) # jhipster ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) # jhipster ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 4 resources
[INFO] Copying 22 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) # jhipster ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) # jhipster ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) # jhipster ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] <<< spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster <<<
[INFO]
[INFO] --- spring-boot-maven-plugin:1.1.9.RELEASE:run (default-cli) # jhipster ---
[INFO] Attaching agents: []
Listening for transport dt_socket at address: 5005
Wed Nov 26 16:32:23 CET 2014 Added log : Application is about to start
Wed Nov 26 16:32:28 CET 2014Added log : Application started, now we set banner to false
Wed Nov 26 16:32:28 CET 2014Added log : About to add Default profile
Wed Nov 26 16:32:28 CET 2014Added log : Default Profile added. Now we scan liquibase packages
Wed Nov 26 16:32:28 CET 2014Added log : Liquibase pakages scanned. Now we run the app
2014-11-26 16:32:54,564 [INFO] com.mycompany.myapp.Application - Starting Application on MacBook-Pro.local with PID 25452 (/Users/othomas/Developpement/jhipster-1.9.0/target/classes started by othomas in /Users/othomas/Developpement/jhipster-1.9.0)
2014-11-26 16:32:54,567 [DEBUG] com.mycompany.myapp.Application - Running with Spring Boot v1.1.9.RELEASE, Spring v4.0.8.RELEASE
2014-11-26 16:32:57,429 [DEBUG] org.jboss.logging - Logging Provider: org.jboss.logging.Log4jLoggerProvider
2014-11-26 16:32:57,559 [DEBUG] com.mycompany.myapp.config.AsyncConfiguration - Creating Async Task Executor
2014-11-26 16:32:58,305 [DEBUG] com.mycompany.myapp.config.MetricsConfiguration - Registering JVM gauges
2014-11-26 16:32:58,379 [INFO] com.mycompany.myapp.config.MetricsConfiguration - Initializing Metrics JMX reporting
2014-11-26 16:32:58,445 [DEBUG] com.mycompany.myapp.config.DatabaseConfiguration - Configuring Datasource
2014-11-26 16:32:59,353 [DEBUG] com.mycompany.myapp.config.DatabaseConfiguration - Configuring Liquibase
2014-11-26 16:33:19,489 [DEBUG] com.mycompany.myapp.config.CacheConfiguration - Starting Ehcache
2014-11-26 16:33:19,491 [DEBUG] com.mycompany.myapp.config.CacheConfiguration - Registering Ehcache Metrics gauges
2014-11-26 16:33:23,419 [DEBUG] com.mycompany.myapp.config.MailConfiguration - Configuring mail server
2014-11-26 16:33:24,559 [INFO] com.mycompany.myapp.config.WebConfigurer - Web application configuration, using profiles: [dev]
2014-11-26 16:33:24,560 [DEBUG] com.mycompany.myapp.config.WebConfigurer - Initializing Metrics registries
2014-11-26 16:33:24,564 [DEBUG] com.mycompany.myapp.config.WebConfigurer - Registering Metrics Filter
2014-11-26 16:33:24,565 [DEBUG] com.mycompany.myapp.config.WebConfigurer - Registering Metrics Servlet
2014-11-26 16:33:24,567 [DEBUG] com.mycompany.myapp.config.WebConfigurer - Registering GZip Filter
2014-11-26 16:33:24,569 [DEBUG] com.mycompany.myapp.config.WebConfigurer - Initialize H2 console
2014-11-26 16:33:24,570 [INFO] com.mycompany.myapp.config.WebConfigurer - Web application fully configured
2014-11-26 16:33:29,753 [INFO] com.mycompany.myapp.Application - Running with Spring profile(s) : [dev]
2014-11-26 16:33:30,012 [INFO] com.mycompany.myapp.config.ThymeleafConfiguration - loading non-reloadable mail messages resources
2014-11-26 16:33:30,896 [DEBUG] com.mycompany.myapp.aop.logging.LoggingAspect - Enter: com.mycompany.myapp.repository.CustomAuditEventRepository.auditEventRepository() with argument[s] = []
2014-11-26 16:33:30,905 [DEBUG] com.mycompany.myapp.aop.logging.LoggingAspect - Exit: com.mycompany.myapp.repository.CustomAuditEventRepository.auditEventRepository() with result = com.mycompany.myapp.repository.CustomAuditEventRepository$1#1edce963
2014-11-26 16:33:37,229 [INFO] com.mycompany.myapp.Application - Started Application in 68.311 seconds (JVM running for 73.972)
Wed Nov 26 16:33:37 CET 2014Added log : App is running
Thanks,
Olivier
It is advised to start the application with the debug points disabled unless you want to debug while starting up
you can just modify xmx like java -jar -Xmx1024m.
Because when Spring boot started, it loads lots of spring bean. You can add heap memory to improve it's performance.