I am submitting my spark jobs from a local laptop to a remote standalone Spark cluster (spark://IP:7077). It is submitted successfully. However, I do not get any output and it fails after some time. When i check the workers on my cluster, I find the following exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
When I run the same code on my local system (local[*]), it runs successfully and gives the output.
Note that I run it in spark notebook. The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit
Am I missing something in the configuration of notebook? Any other possible causes?
The code is very simple.
Detailed exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Sample code
val logFile = "hdfs://hostname/path/to/file"
val conf = new SparkConf()
.setMaster("spark://hostname:7077") // as appears on hostname:8080
.setAppName("myapp")
.set("spark.executor.memory", "20G")
.set("spark.cores.max", "40")
.set("spark.executor.cores","20")
.set("spark.driver.allowMultipleContexts","true")
val sc2 = new SparkContext(conf)
val logData = sc2.textFile(logFile)
val numAs = logData.filter(line => line.contains("hello")).count()
val numBs = logData.filter(line => line.contains("hi")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
Update:
The above issue can be avoided by including the IP address of driver (i.e., local laptop's public IP) within the application code. This can be done by adding the following line in the spark context:
.set("spark.driver.host",YourSystemIPAddress)
However, there can be issue if the driver's IP address is behind the NAT. In this case the workers will not be able to find the IP.
When you say "spark notebook" I am assuming you mean the github project https://github.com/andypetrella/spark-notebook?
I would have to look into specifics of notebook but I notice your worker is trying to connect to a master on "localhost".
For normal Spark configuration, on the worker set SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh and see if that helps, Even if you are running on a single machine in standalone mode, set this. In my experience Spark doesn't always resolve hostnames properly so starting from a baseline of all IPs is a good idea.
The rest is general info, see if it helps with your specific issue:
If you are submitting to a cluster from your laptop you use --deploy-mode to cluster to tell your driver to run on one of the worker nodes. This creates an extra consideration of how you setup your classpath because you don't know which worker the driver will run on.
Here's some general info in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.
So in your spark-env.sh get rid of SPARK_LOCAL_IP and change SPARK_MASTER_IP to an IP address, not a hostname.
I have treated this more at length in this answer.
For more completeness here's part of that answer:
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.
Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.
Related
Here is what I am trying to do.
I have created two nodes of DataStax enterprise cluster,on top of which I have created a java program to get the count of one table (Cassandra database table).
This program was built in eclipse which is actually from a windows box.
At the time of running this program from windows it's failing with the following error at runtime:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
The same code has been compiled & run successfully on those clusters without any issue. What could be the reason why am getting above error?
Code:
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.cassandra.CassandraSQLContext;
import com.datastax.bdp.spark.DseSparkConfHelper;
public class SparkProject {
public static void main(String[] args) {
SparkConf conf = DseSparkConfHelper.enrichSparkConf(new SparkConf()).setMaster("spark://10.63.24.14X:7077").setAppName("DatastaxTests").set("spark.cassandra.connection.host","10.63.24.14x").set("spark.executor.memory", "2048m").set("spark.driver.memory", "1024m").set("spark.local.ip","10.63.24.14X");
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraSQLContext cassandraContext = new CassandraSQLContext(sc.sc());
SchemaRDD employees = cassandraContext.sql("SELECT * FROM portware_ants.orders");
//employees.registerTempTable("employees");
//SchemaRDD managers = cassandraContext.sql("SELECT symbol FROM employees");
System.out.println(employees.count());
sc.stop();
}
}
I faced similar issue and after some online research and trial-n-error, I narrowed down to 3 causes for this (except for the first the other two are not even close to the error message):
As indicated by the error, probably you are allocating the resources more than that is available. => This was not my issue
Hostname & IP Address mishaps: I took care of this by specifying the SPARK_MASTER_IP and SPARK_LOCAL_IP in spark-env.sh
Disable Firewall on the client : This was the solution that worked for me. Since I was working on a prototype in-house code, I disabled the firewall on the client node. For some reason the worker nodes, were not able to talk back to the client for me. For production purposes, you would want to open-up certain number of ports required.
My problem was that I was assigning too much memory than my slaves had available. Try reducing the memory size of the spark submit. Something like the following:
~/spark-1.5.0/bin/spark-submit --master spark://my-pc:7077 --total-executor-cores 2 --executor-memory 512m
with my ~/spark-1.5.0/conf/spark-env.sh being:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2
Please look at Russ's post
Specifically this section:
This is by far the most common first error that a new Spark user will
see when attempting to run a new application. Our new and excited
Spark user will attempt to start the shell or run their own
application and be met with the following message
...
The short term solution to this problem is to make sure you aren’t
requesting more resources from your cluster than exist or to shut down
any apps that are unnecessarily using resources. If you need to run
multiple Spark apps simultaneously then you’ll need to adjust the
amount of cores being used by each app.
In my case, the problem was that I had the following line in $SPARK_HOME/conf/spark-env.sh:
SPARK_EXECUTOR_MEMORY=3g
of each worker,
and the following line in $SPARK_HOME/conf/spark-default.sh
spark.executor.memory 4g
in the "master" node.
The problem went away once I changed 4g to 3g. I hope that this will help someone with the same issue. The other answers helped me spot this.
I have faced this issue few times even though the resource allocation was correct.
The fix was to restart the mesos services.
sudo service mesos-slave restart
sudo service mesos-master restart
Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?
For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.
I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.
I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?
I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John
First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.