Seting the number of executors for my Spark Streaming application

Seting the number of executors for my Spark Streaming application - apache-spark

I am runnning my Spark Streaming application in the Yarn cluster mode. I want to limit the number of executors to just one node? How to do this in Spark?

You can control the number of executors by two ways:
Option 1: Directly with Spark Submit Command:
spark-submit -class ClassName --num-executors 1 .... other parameters
Option 2: Inside Conf flag with Spark Submit command:
spark-submit --class ClassName --conf "spark.executor.instances=1" .... other parameters.
You can get more detailed information about the same from Submitting Applications.

Related

Why is the executors entry not visible in spark web ui

I am running a spark job and even though I've set the the --num-executors parameter to 3 i can't see any executors in the in the web ui executors tab why is happening

Spark in local mode is non-distributed. Spark process will run on single JVM and driver will also behave as an executor.
You can only define number of threads in master URL.
You can switch to standalone mode.
Start the master using below command:
spark-class org.apache.spark.deploy.master.Master
And the worker using:
spark-class org.apache.spark.deploy.worker.Worker spark://<host>:7077
Now run the spark-submit command.
If you have 6 cores, just specifying --executor-cores 2 will create 3 executors and you can check the on spark UI.

Multiple executors for spark applcaition

Can one worker have multiple executors for the same Spark application in standalone and yarn mode? If no, then what is the reason for that (for both standalone and yarn mode).

Yes, you can specify resources which Spark will use. For example, you can use these properties for configuration:
--num-executors 3
--driver-memory 4g
--executor-memory 2g
--executor-cores 2
If your node has enough resources cluster assigns more than one executors to the same node.
You can read more information about Spark resources configuration here.

How to submit spark Job from locally and connect to Cassandra cluster

Can any one please let me know how to submit spark Job from locally and connect to Cassandra cluster.
Currently I am submitting the Spark job after I login to Cassandra node through putty and submit the below dse-spark-submit Job command.
Command:
dse spark-submit --class ***** --total-executor-cores 6 --executor-memory 2G **/**/**.jar --config-file build/job.conf --args
With the above command, my spark Job able to connect to cluster and its executing, but sometimes facing issues.
So I want to submit spark job from my local machine. Can any one please guide me how to do this.

There are several things you could mean by "run my job locally"
Here are a few of my interpretations
Run the Spark Driver on a Local Machine but access a remote Cluster's resources
I would not recommend this for a few reasons, the biggest being that all of your job management will still be handled between your remote machine and the executors in the cluster. This would be equivalent of having a Hadoop Job Tracker running in a different cluster than the rest of the Hadoop distribution.
To accomplish this though you need to run a spark submit with a specific master uri. Additionally you would need to specify a Cassandra node via spark.cassandra.connection.host
dse spark-submit --master spark://sparkmasterip:7077 --conf spark.cassandra.connection.host aCassandraNode --flags jar
It is important that you keep the jar LAST. All arguments after the jar are interpreted as arguments for the application and not spark-submit parameters.
Run Spark Submit on a local Machine but have the Driver run in the Cluster (Cluster Mode)
Cluster mode means your local machine sends the jar and environment string over to the Spark Master. The Spark Master then chooses a worker to actually run the driver and the driver is started as a separate JVM by the worker. This is triggered using the --deploy-mode cluster flag. In addition to specifying the Master and Cassandra Connection Host.
dse spark-submit --master spark://sparkmasterip:7077 --deploy-mode cluster --conf spark.cassandra.connection.host aCassandraNode --flags jar
Run the Spark Driver in Local Mode
Finally there exists a Local mode for Spark which starts the entire Spark Framework in a single JVM. This is mainly used for testing. Local mode is activated by passing `--master local``
For more information check out the Spark Documentation on submitting applications
http://spark.apache.org/docs/latest/submitting-applications.html

Running a distributed Spark Job Server with multiple workers in a Spark standalone cluster

I have a Spark standalone cluster running on a few machines. All workers are using 2 cores and 4GB of memory. I can start a job server with ./server_start.sh --master spark://ip:7077 --deploy-mode cluster --conf spark.driver.cores=2 --conf spark.driver.memory=4g, but whenever I try to start a server with more than 2 cores, the driver's state gets stuck at "SUBMITTED" and no worker takes the job.
I tried starting the spark-shell on 4 cores with ./spark-shell --master spark://ip:7077 --conf spark.driver.cores=4 --conf spark.driver.memory=4g and the job gets shared between 2 workers (2 cores each). The spark-shell gets launched as an application and not a driver though.
Is there any way to run a driver split between multiple workers? Or can I run the job server as an application rather than a driver?

The problem was resolved in the chat
You have to change your JobServer .conf file to set the master parameter to point to your cluster:
master = "spark://ip:7077"
Also, the memory that JobServer program uses can be set in the settings.sh file.
After setting these parameters, you can start JobServer with a simple call:
./server_start.sh
Then, once the service is running, you can create your context via REST, which will ask the cluster for resources and will receive an appropriate number of excecutors/cores:
curl -d "" '[hostname]:8090/contexts/cassandra-context?context-factory=spark.jobserver.context.CassandraContextFactory&num-cpu-cores=8&memory-per-node=2g'
Finally, every job sent via POST to JobServer on this created context will be able to use the executors allocated to the context and will be able to run in a distributed way.

Submit application to spark EC2 (AWS EMR) cluster

I'm newbie in Spark. I have one simple query about using worker nodes.
I have simple spark program - for example "WordCount" from Spark examples. I create EMR cluster with driver node and 2 workers. Connect to driver by ssh and submit task to Spark:
spark-submit --class mytest.application --master local[*] --deploy-mode client mytest.jar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string