Spark Java Application no cores and waiting

Spark Java Application no cores and waiting - apache-spark

I'm new to spark and I'm trying to develop my first application. I'm only trying to count lines in a file but I got the error:
2015-11-28 10:21:34 WARN TaskSchedulerImpl:71 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have enough cores and enough memory. I read that can be a firewall problem but I'm getting this error both on my server and on my macbook and for sure on the macbook there is no firewall. If I open the UI it says that the application is WAITING and apparently the application is getting no cores at all:
Application ID Name Cores Memory per Node State
app-20151128102116-0002 (kill) New app 0 1024.0 MB WAITING
My code is very simple:
SparkConf sparkConf = new SparkConf().setAppName(new String("New app"));
sparkConf.setMaster("spark://MacBook-Air.local:7077");
JavaRDD<String> textFile = sc.textFile("/Users/mattiazeni/Desktop/test.csv.bz2");
if(logger.isInfoEnabled()){
logger.info(textFile.count());
}
if I try to run the same program from the shell in scala it works great.
Any suggestion?

Check that workers are running - there should be at least one worker listed on the Spark UI at http://:8080
If not are running, use /sbin/start-slaves.sh (or start-all.sh)

Related

Apache spark error while running with pyspark in a cluster [duplicate]

Here is what I am trying to do.
I have created two nodes of DataStax enterprise cluster,on top of which I have created a java program to get the count of one table (Cassandra database table).
This program was built in eclipse which is actually from a windows box.
At the time of running this program from windows it's failing with the following error at runtime:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
The same code has been compiled & run successfully on those clusters without any issue. What could be the reason why am getting above error?
Code:
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.cassandra.CassandraSQLContext;
import com.datastax.bdp.spark.DseSparkConfHelper;
public class SparkProject {
public static void main(String[] args) {
SparkConf conf = DseSparkConfHelper.enrichSparkConf(new SparkConf()).setMaster("spark://10.63.24.14X:7077").setAppName("DatastaxTests").set("spark.cassandra.connection.host","10.63.24.14x").set("spark.executor.memory", "2048m").set("spark.driver.memory", "1024m").set("spark.local.ip","10.63.24.14X");
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraSQLContext cassandraContext = new CassandraSQLContext(sc.sc());
SchemaRDD employees = cassandraContext.sql("SELECT * FROM portware_ants.orders");
//employees.registerTempTable("employees");
//SchemaRDD managers = cassandraContext.sql("SELECT symbol FROM employees");
System.out.println(employees.count());
sc.stop();
}
}

I faced similar issue and after some online research and trial-n-error, I narrowed down to 3 causes for this (except for the first the other two are not even close to the error message):
As indicated by the error, probably you are allocating the resources more than that is available. => This was not my issue
Hostname & IP Address mishaps: I took care of this by specifying the SPARK_MASTER_IP and SPARK_LOCAL_IP in spark-env.sh
Disable Firewall on the client : This was the solution that worked for me. Since I was working on a prototype in-house code, I disabled the firewall on the client node. For some reason the worker nodes, were not able to talk back to the client for me. For production purposes, you would want to open-up certain number of ports required.

My problem was that I was assigning too much memory than my slaves had available. Try reducing the memory size of the spark submit. Something like the following:
~/spark-1.5.0/bin/spark-submit --master spark://my-pc:7077 --total-executor-cores 2 --executor-memory 512m
with my ~/spark-1.5.0/conf/spark-env.sh being:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2

Please look at Russ's post
Specifically this section:
This is by far the most common first error that a new Spark user will
see when attempting to run a new application. Our new and excited
Spark user will attempt to start the shell or run their own
application and be met with the following message
...
The short term solution to this problem is to make sure you aren’t
requesting more resources from your cluster than exist or to shut down
any apps that are unnecessarily using resources. If you need to run
multiple Spark apps simultaneously then you’ll need to adjust the
amount of cores being used by each app.

In my case, the problem was that I had the following line in $SPARK_HOME/conf/spark-env.sh:
SPARK_EXECUTOR_MEMORY=3g
of each worker,
and the following line in $SPARK_HOME/conf/spark-default.sh
spark.executor.memory 4g
in the "master" node.
The problem went away once I changed 4g to 3g. I hope that this will help someone with the same issue. The other answers helped me spot this.

I have faced this issue few times even though the resource allocation was correct.
The fix was to restart the mesos services.
sudo service mesos-slave restart
sudo service mesos-master restart

PySpark logging from the executor in a standalone cluster

This question has answers related to how to do this on a YARN cluster. But what if I am running a standalone spark cluster? How can I log from executors? Logging from the driver is easy using the log4j logger that we can derive from spark-context.
But how can I log from within an RDD's foreach or a foreachPartition? Is there any way I can collect these logs and print?

The answer to this is to import python logging and to write the messages using logging and the logged messages will be in the work directory which is created under the spark installation location
There is nothing else which is needed
I went crazy modifying log4j.properties file and adding driver-java-option and spakrk.executor.extraJavaOptions
In your spark program, import logging add log messages straightaway as
logging.warning(whatever is your message and variable values you want to check)
Then if you navigate to the work directory - if i have installed spark at /home/vagrant/spark then we are talking about /home/vagrant/spark/work directory
There will be a directory for each application.And the workers used for the application will have numbers 0, 1, 2, 3 etc.You have to check in each worker.And whichever worker your executor was created to execute the task in the stderr you will see the logging messages
Hope this helps to see the user logged messages on the executor when using the spark standalone cluster mode

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?

I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John

Spark: Exception in thread "main" akka.actor.ActorNotFound:

I am submitting my spark jobs from a local laptop to a remote standalone Spark cluster (spark://IP:7077). It is submitted successfully. However, I do not get any output and it fails after some time. When i check the workers on my cluster, I find the following exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
When I run the same code on my local system (local[*]), it runs successfully and gives the output.
Note that I run it in spark notebook. The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit
Am I missing something in the configuration of notebook? Any other possible causes?
The code is very simple.
Detailed exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Sample code
val logFile = "hdfs://hostname/path/to/file"
val conf = new SparkConf()
.setMaster("spark://hostname:7077") // as appears on hostname:8080
.setAppName("myapp")
.set("spark.executor.memory", "20G")
.set("spark.cores.max", "40")
.set("spark.executor.cores","20")
.set("spark.driver.allowMultipleContexts","true")
val sc2 = new SparkContext(conf)
val logData = sc2.textFile(logFile)
val numAs = logData.filter(line => line.contains("hello")).count()
val numBs = logData.filter(line => line.contains("hi")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

Update:
The above issue can be avoided by including the IP address of driver (i.e., local laptop's public IP) within the application code. This can be done by adding the following line in the spark context:
.set("spark.driver.host",YourSystemIPAddress)
However, there can be issue if the driver's IP address is behind the NAT. In this case the workers will not be able to find the IP.

When you say "spark notebook" I am assuming you mean the github project https://github.com/andypetrella/spark-notebook?
I would have to look into specifics of notebook but I notice your worker is trying to connect to a master on "localhost".
For normal Spark configuration, on the worker set SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh and see if that helps, Even if you are running on a single machine in standalone mode, set this. In my experience Spark doesn't always resolve hostnames properly so starting from a baseline of all IPs is a good idea.
The rest is general info, see if it helps with your specific issue:
If you are submitting to a cluster from your laptop you use --deploy-mode to cluster to tell your driver to run on one of the worker nodes. This creates an extra consideration of how you setup your classpath because you don't know which worker the driver will run on.
Here's some general info in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.
So in your spark-env.sh get rid of SPARK_LOCAL_IP and change SPARK_MASTER_IP to an IP address, not a hostname.
I have treated this more at length in this answer.
For more completeness here's part of that answer:
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.
Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.

Stopping a Running Spark Application

I'm running a Spark cluster in standalone mode.
I've submitted a Spark application in cluster mode using options:
--deploy-mode cluster –supervise
So that the job is fault tolerant.
Now I need to keep the cluster running but stop the application from running.
Things I have tried:
Stopping the cluster and restarting it. But the application resumes
execution when I do that.
Used Kill -9 of a daemon named DriverWrapper but the job resumes again after that.
I've also removed temporary files and directories and restarted the cluster but the job resumes again.
So the running application is really fault tolerant.
Question:
Based on the above scenario can someone suggest how I can stop the job from running or what else I can try to stop the application from running but keep the cluster running.
Something just accrued to me, if I call sparkContext.stop() that should do it but that requires a bit of work in the code which is OK but can you suggest any other way without code change.

If you wish to kill an application that is failing repeatedly, you may do so through:
./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
You can find the driver ID through the standalone Master web UI at http://:8080.
From Spark Doc

Revisiting this because I wasn't able to use the existing answer without debugging a few things.
My goal was to programmatically kill a driver that runs persistently once a day, deploy any updates to the code, then restart it. So I won't know ahead of time what my driver ID is. It took me some time to figure out that you can only kill the drivers if you submitted your driver with the --deploy-mode cluster option. It also took me some time to realize that there was a difference between application ID and driver ID, and while you can easily correlate an application name with an application ID, I have yet to find a way to divine the driver ID through their api endpoints and correlate that to either an application name or the class you are running. So while run-class org.apache.spark.deploy.Client kill <master url> <driver ID> works, you need to make sure you are deploying your driver in cluster mode and are using the driver ID and not the application ID.
Additionally, there is a submission endpoint that spark provides by default at http://<spark master>:6066/v1/submissions and you can use http://<spark master>:6066/v1/submissions/kill/<driver ID> to kill your driver.
Since I wasn't able to find the driver ID that correlated to a specific job from any api endpoint, I wrote a python web scraper to get the info from the basic spark master web page at port 8080 then kill it using the endpoint at port 6066. I'd prefer to get this data in a supported way, but this is the best solution I could find.
#!/usr/bin/python
import sys, re, requests, json
from selenium import webdriver
classes_to_kill = sys.argv
spark_master = 'masterurl'
driver = webdriver.PhantomJS()
driver.get("http://" + spark_master + ":8080/")
for running_driver in driver.find_elements_by_xpath("//*/div/h4[contains(text(), 'Running Drivers')]"):
for driver_id in running_driver.find_elements_by_xpath("..//table/tbody/tr/td[contains(text(), 'driver-')]"):
for class_to_kill in classes_to_kill:
right_class = driver_id.find_elements_by_xpath("../td[text()='" + class_to_kill + "']")
if len(right_class) > 0:
driver_to_kill = re.search('^driver-\S+', driver_id.text).group(0)
print "Killing " + driver_to_kill
result = requests.post("http://" + spark_master + ":6066/v1/submissions/kill/" + driver_to_kill)
print json.dumps(json.loads(result.text), indent=4)
driver.quit()

https://community.cloudera.com/t5/Support-Questions/What-is-the-correct-way-to-start-stop-spark-streaming-jobs/td-p/30183
according this link use to stop if your master use yarn
yarn application -list
yarn application -kill application_id

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string