I'm newbie in Spark. I have one simple query about using worker nodes.
I have simple spark program - for example "WordCount" from Spark examples. I create EMR cluster with driver node and 2 workers. Connect to driver by ssh and submit task to Spark:
spark-submit --class mytest.application --master local[*] --deploy-mode client mytest.jar
Related
My question is regarding monitoring Amazon EMR using prometheus and grafana while deploying in cluster mode.
While running a job in standalone mode on EMR, metrics are pushed on endpoint for master, driver and executors, however, in cluster mode none of the metrics are there.
I also tried to export on other types of metric sinks like csv and console which work perfectly fine in standalone but nothing happens on cluster mode.
I am using PrometheusServlet for exporting metrics and same endpoints as explained in here.
spark-submit --files metrics.properties --conf spark.metrics.conf=metrics.properties --conf spark.ui.prometheus.enabled=true --deploy-mode cluster --master yarn testCluster.py
I have to problem in spark-submit with cluster deploy mode and standalone mode:
How to specify a node as a driver node in spark cluster
in my case, the driver node was assigned dynamically by spark
How to distribute the app automatic from local
in my case, i must deploy the jar of app to every node,because i don't know which node will be the driver node .
PS : My submit command is :
spark-submit --master spark://master_ip:6066 --class appMainClass --deploy-mode cluster file:///tmp/spark_app/sparkrun
The --deploy-mode flag determines if the job will be submitted in cluster or client mode.
In cluster mode all the nodes will act as executors. One node will submit the JAR and then you can track the execution using web UI. That particular node will also act as an executor.
In client mode, the node where the spark-submit is invoked will act as the driver. Note that this node will not execute the DAG as this it is designated as a driver for your cluster. All the other nodes will be executors. Again, Web UI will help to see the execution of jobs and other useful information like RDD partitions, cached RDDs size etc.
Can any one please let me know how to submit spark Job from locally and connect to Cassandra cluster.
Currently I am submitting the Spark job after I login to Cassandra node through putty and submit the below dse-spark-submit Job command.
Command:
dse spark-submit --class ***** --total-executor-cores 6 --executor-memory 2G **/**/**.jar --config-file build/job.conf --args
With the above command, my spark Job able to connect to cluster and its executing, but sometimes facing issues.
So I want to submit spark job from my local machine. Can any one please guide me how to do this.
There are several things you could mean by "run my job locally"
Here are a few of my interpretations
Run the Spark Driver on a Local Machine but access a remote Cluster's resources
I would not recommend this for a few reasons, the biggest being that all of your job management will still be handled between your remote machine and the executors in the cluster. This would be equivalent of having a Hadoop Job Tracker running in a different cluster than the rest of the Hadoop distribution.
To accomplish this though you need to run a spark submit with a specific master uri. Additionally you would need to specify a Cassandra node via spark.cassandra.connection.host
dse spark-submit --master spark://sparkmasterip:7077 --conf spark.cassandra.connection.host aCassandraNode --flags jar
It is important that you keep the jar LAST. All arguments after the jar are interpreted as arguments for the application and not spark-submit parameters.
Run Spark Submit on a local Machine but have the Driver run in the Cluster (Cluster Mode)
Cluster mode means your local machine sends the jar and environment string over to the Spark Master. The Spark Master then chooses a worker to actually run the driver and the driver is started as a separate JVM by the worker. This is triggered using the --deploy-mode cluster flag. In addition to specifying the Master and Cassandra Connection Host.
dse spark-submit --master spark://sparkmasterip:7077 --deploy-mode cluster --conf spark.cassandra.connection.host aCassandraNode --flags jar
Run the Spark Driver in Local Mode
Finally there exists a Local mode for Spark which starts the entire Spark Framework in a single JVM. This is mainly used for testing. Local mode is activated by passing `--master local``
For more information check out the Spark Documentation on submitting applications
http://spark.apache.org/docs/latest/submitting-applications.html
I am runnning my Spark Streaming application in the Yarn cluster mode. I want to limit the number of executors to just one node? How to do this in Spark?
You can control the number of executors by two ways:
Option 1: Directly with Spark Submit Command:
spark-submit -class ClassName --num-executors 1 .... other parameters
Option 2: Inside Conf flag with Spark Submit command:
spark-submit --class ClassName --conf "spark.executor.instances=1" .... other parameters.
You can get more detailed information about the same from Submitting Applications.
I have a Spark standalone cluster running on a few machines. All workers are using 2 cores and 4GB of memory. I can start a job server with ./server_start.sh --master spark://ip:7077 --deploy-mode cluster --conf spark.driver.cores=2 --conf spark.driver.memory=4g, but whenever I try to start a server with more than 2 cores, the driver's state gets stuck at "SUBMITTED" and no worker takes the job.
I tried starting the spark-shell on 4 cores with ./spark-shell --master spark://ip:7077 --conf spark.driver.cores=4 --conf spark.driver.memory=4g and the job gets shared between 2 workers (2 cores each). The spark-shell gets launched as an application and not a driver though.
Is there any way to run a driver split between multiple workers? Or can I run the job server as an application rather than a driver?
The problem was resolved in the chat
You have to change your JobServer .conf file to set the master parameter to point to your cluster:
master = "spark://ip:7077"
Also, the memory that JobServer program uses can be set in the settings.sh file.
After setting these parameters, you can start JobServer with a simple call:
./server_start.sh
Then, once the service is running, you can create your context via REST, which will ask the cluster for resources and will receive an appropriate number of excecutors/cores:
curl -d "" '[hostname]:8090/contexts/cassandra-context?context-factory=spark.jobserver.context.CassandraContextFactory&num-cpu-cores=8&memory-per-node=2g'
Finally, every job sent via POST to JobServer on this created context will be able to use the executors allocated to the context and will be able to run in a distributed way.