Did start spark job server properly and out of memory response - apache-spark

I am using spark-jobserver-0.6.2-spark-1.6.1
(1) export OBSERVER_CONFIG = /custom-spark-jobserver-config.yml
(2)./server_start.sh
Execution of the above start shell file returns without error. However, it created a pid file: spark-jobserver.pid
When I cat spark-jobserver.pid, the pid file shows pid=126633
However when I ran
lsof -i :9999 | grep LISTEN
It shows
java 126634 spark 17u IPv4 189013362 0t0 TCP *:distinct (LISTEN)
I deployed my scala application to job server below, it returned with OK
curl --data-binary #analytics_2.10-1.0.jar myhost:8090/jars/myservice
OK
When I ran the following curl command to test REST service deployed on job server
curl -d "{data.label.index:15, data.label.field:ROOT_CAUSE,input.stri ng:[\"tt: Getting operation. \"]}" 'myhost:8090/jobs? appName=myservice&classPath=com.test.Test&sync=true&timeout=400'
I got the following out of memory returned response
{
"status": "ERROR",
"result": {
"errorClass": "java.lang.RuntimeException",
"cause": "unable to create new native thread",
"stack": ["java.lang.Thread.start0(Native Method)", "java.lang.Thread.start(Thread.java:714)", "org.spark-project.jetty.util.thread.QueuedThreadP ool.startThread(QueuedThreadPool.java:441)", "org.spark-project.jetty.util.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:108)", "org.spark-pr oject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)", "org.spark-project.jetty.util.component.AggregateLifeCycle.doStart(Ag gregateLifeCycle.java:81)", "org.spark-project.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58)", "org.spark-project.jetty.serve r.handler.HandlerWrapper.doStart(HandlerWrapper.java:96)", "org.spark-project.jetty.server.Server.doStart(Server.java:282)", "org.spark-project.jetty .util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)", "org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(Jetty Utils.scala:252)", "org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)", "org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUti ls.scala:262)", "org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)", "scala.collection.immutable.Range.foreac h$mVc$sp(Range.scala:141)", "org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)", "org.apache.spark.ui.JettyUtils$.startJettyServer(Je ttyUtils.scala:262)", "org.apache.spark.ui.WebUI.bind(WebUI.scala:137)", "org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)", " org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.SparkContext.(SparkContext.scala:481)", "spark.jobserver.context.DefaultSparkContextFactory$$anon$1.(SparkContextFactory.scala:53)", "spark.jobserver.co ntext.DefaultSparkContextFactory.makeContext(SparkContextFactory.scala:53)", "spark.jobserver.context.DefaultSparkContextFactory.makeContext(SparkCon textFactory.scala:48)", "spark.jobserver.context.SparkContextFactory$class.makeContext(SparkContextFactory.scala:37)", "spark.jobserver.context.Defau ltSparkContextFactory.makeContext(SparkContextFactory.scala:48)", "spark.jobserver.JobManagerActor.createContextFromConfig(JobManagerActor.scala:378) ", "spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:122)", "scala.runtime.AbstractPartialFunction$mcVL$sp .apply$mcVL$sp(AbstractPartialFunction.scala:33)", "scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)", "scala.ru ntime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)", "ooyala.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorSt ack.scala:33)", "scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)", "scala.runtime.AbstractPartialFuncti on$mcVL$sp.apply(AbstractPartialFunction.scala:33)", "scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)", "ooyala .common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:26)", "ooyala.common.akka.Slf4jLogging$class.ooya la$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:35)", "ooyala.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogg ing.scala:25)", "scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)", "scala.runtime.AbstractPartialFuncti on$mcVL$sp.apply(AbstractPartialFunction.scala:33)", "scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)", "ooyala .common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:24)", "akka.actor.Actor$class.aroundReceive(Actor.scala:467)", "ooyala.co mmon.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)", "akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)", "akka.actor.ActorC ell.invoke(ActorCell.scala:487)", "akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)", "akka.dispatch.Mailbox.run(Mailbox.scala:220)", "akka.di spatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)", "scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask .java:260)", "scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)", "scala.concurrent.forkjoin.ForkJoinPool.runWorker(Fo rkJoinPool.java:1979)", "scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"],
"causingClass": "java.lang.OutOfMemoryError",
"message": "java.lang.OutOfMemoryError: unable to create new native thread"
My question
(1) Why processID is different as shown in pid file ? 126633 vs 126634 ?
(2) Why spark-jobserver.pid is created ? Does this mean spark job server is not started properly ?
(3) How to start job server properly ?
(4) What causes out of memory response ? How to resolve it ? Is this because I did not set Heap Size or memory correctly ? How to resolve it ?

Jobserver binds to 8090 and not to 9999, may be you should look for that process id.
Spark jobserver pid is created for tracking purpose. It does not mean that job server is not started properly.
You are starting spark-jobserver properly.
May be try increasing value of JOBSERVER_MEMORY, default is 1G. Did you check on Spark UI whether application started properly?

Related

"Address in use" error while running nodejs in jupyter

When i'm using ijavascript in Jupyter to perform fiebase query, an error regarding address in use pop out.
i tried to restart my machine and shut down all other processes that might use port 8888
... \npm\node_modules\ijavascript\node_modules\zeromq\lib\index.js:451
this._zmq.bindSync(addr);
^
Error: Address in use
at exports.Socket.Socket.bindSync
...
kernel 6d2aa3bc-854c-42e9-9edf-816b3a1dc878 restarted

Stopping a job in spark

I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

Unable to run nodetool on remote and local server in cassandra

nodetool -h <ipaddress> -p 7199 status
Error connecting to remote Jmx agent!
java.rmi.NoSuchObectException: no such object in the table
Am getting the above error when I tried to run the nodetool status or any other nodetool command. Cassandra is running fine and nodetool status on other nodes in the cluster shows it is UN state. I tried to add the below entry in cassandra-env.sh file but still I got the same error
JVM_OPTS = "$JVM_OPTS -Djava.rmi.server.hostname="
You have to use your listen_address for nodetool host ip.
nodetool -h <listen_address> -p 7199 status
or if it is not working, try with sudo.
In cassandra.yaml file it is written that jmx by default will only work from localhost. To run it from a remote host you need to uncomment and provide values of the parameters written in that file.
also try
cat /var/lob/cassandra/cassandra.log | grep Error
see if it gives you any error regarding JMX connectivity

Address already in use error when Apache Felix starts

I deleted the directory felix-cache. When I started again the Felix framework I get this error:
ERROR: transport error 202: bind failed: Address already in use
ERROR: JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510)
JDWP exit error AGENT_ERROR_TRANSPORT_INIT(197): No transports initialized [debugInit.c:750]
FATAL ERROR in native method: JDWP No transports initialized, jvmtiError=AGENT_ERROR_TRANSPORT_INIT(197)
Any idea how I can fix this?
Another process is still running using the specific port. Check remaining processes using ps -ef | grep java and kill it.
You seem to be launching the JVM in remote debugging mode, but there is another JVM running that is also in remote debug mode using the same port number. You cannot share the port number between multiple processes. If you need to debug two Java programs simultaneously then you will have to configure them to use different ports.

Resources