Why Spark Drivers don't run when Applications run? - python-3.x

I'm a beginner trying to learn about the behavior of applications and drivers by going through some examples. I'm starting off with:
Running a standalone cluster manager
Running a single master calling ./sbin/start-master.sh
Running a single worker calling ./sbin/start-slave.sh spark://localhost:7077
Launching a test application in client mode by calling:
./bin/spark-submit \
--master spark://localhost:7077 \
./examples/src/main/python/pi.py
According to the Docs:
The process running the main() function of the application and creating the SparkContext
My takeaway from this is there should be at least one driver program that runs when an application runs. However, I'm not seeing this in the web UI for the master:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 1 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Shouldn't I expect to see 1 driver running or completed? I've included some config details below.
./conf/spark-defaults.conf:
spark.master=spark://localhost:7077
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=.tmp/spark-events/
spark.driver.memory=5g

If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode.
Client mode does not triggers Driver program but cluster mode does.
NOTE : AFAIK you cannot run pyspark or any other interactive shell in cluster mode.
So try running the application in cluster mode using --deploy-mode cluster
./bin/spark-submit \
--master spark://localhost:7077 \
--deploy-mode cluster
./examples/src/main/python/pi.py

Related

dse spark-submit to specific work pool instead of "default"

I am able to successfully build the example project from https://github.com/datastax/SparkBuildExamples/tree/master/scala/sbt/dse/src/main/scala/com/datastax/spark/example
I am also successful in submitting dse spark-submit. The program runs fine and results are good as expected
dse spark-submit --class com.datastax.spark.example.WriteRead target/writeRead-0.1.jar
I now wish to submit it the above job to an existing pool as configured in dse.yaml
resource_manager_options:
worker_options:
cores_total: 6
memory_total: 32G
workpools:
- name: alwayson_sql
cores: 2
memory: 4G
- name: pool_1
cores: 2
memory: 16G
I am unable to determine how/what changes in code or spark-submit that I should do in order to submit the application to the pool "pool_1"
The application is submitted to the default pool and I am unable to submit it to "pool_1".
Please help.
After some additional research I figured out the correct way to dse spark-submit to use the pool "pool_1"
bin/dse spark-submit \
--master dse://?workpool=pool_1 \
--conf spark.network.timeout=500 \
--class com.datastax.spark.example.WriteRead target/writeRead-0.1.jar
(Per input from Alex)DSE Documentation:
Documentation link

How to understand spark-submit script master is YARN?

We have all 6 machine, hdfs and yarn service on all node, 1 master and 6 slaves.
And we install Spark on 3 machine, 1 master, 3 workers ( 1 node master + worker) .
We know when --master spark://[host]:[port], the job will run only 3 node use standalone mode.
And when use spark-submit --master yarn submit a jar, it's would use all 6 server cpu and memory or just use 3 spark worker node machine ?
And if can run all 6 node, How left 3 server can know it's the Spark job?
Spark: 2.3.1
Hadoop: 2.7.3
In yarn mode, spark-submit send resource allocation resource to yarn and the containers will be launched on different node managers based on resource availability.

Spark Executors - Are they java processes?

I am new to spark. When I try to run spark-submit in client mode with 3 executors , I expect 3 java processes (since there are 3 executors ) to show up when I execute ps -ef
$SPARK_HOME/bin/spark-submit --num-executors 3 --class AverageCalculation --master local[1] /home/customer/SimpleETL/target/SimpleETL-0.1.jar hdfs://node1:9000/home/customer/SimpleETL/standard_input.csv
But, I dont see 3 java processes. My undrstanding is that each executor process is a java process. Please advise. Thanks.
Because you use local mode (--master local[1]) executor settings are not applicable. In this case, spark starts only a single JVM to emulate all components, and allocates number of threads specified in local definition (1) as executor threads.
In other modes, exectuors are separate JVM instances.
Each executors are a java process. Each executors comprises a jvm.
jps
Number of java process is same as the number of executors. If the executors are distributed across the worker nodes. Need to check the process the corresponding worker nodes. We can get the information about executors and where it has been launched from spark history server web UI.
In Spark, there are master nodes and worker nodes. Executors run on worker nodes in their own java processes.
In your spark-submit you can add --deploy-mode cluster and see that executors are running on worker nodes in their own JVM instances.
You can check this answer for detailed workflow of Apache Spark.
/home/spark/spark-2.2.1-bin-hadoop2.7/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--num-executors 1000 \
--master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 \
--queue default /home/spark/spark-2.2.1-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.2.1.jar
~
I executed the command above. and checked ps -ef | grep java. But I dont see a lot of java processes . Any easy way to identify the executors ?

SPARK_WORKER_INSTANCES setting not working in Spark Standalone Windows

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to run 8 workers, with a single core per each worker. However, the Spark Master/Worker UI doesn't seem to be reflecting my configuration.
I'm using :
Standalone Spark 2.0
8 Cores 24gig RAM
windows server 2008
pyspark
spark-env.sh file is configured as follows:
SPARK_WORKER_INSTANCES = 8
SPARK_WORKER_CORES = 1
SPARK_WORKER_MEMORY = 2g
spark-defaults.conf is configured as follows:
spark.cores.max = 8
I start the master:
spark-class org.apache.spark.deploy.master.Master
I start the workers by running this command 8 times within a batch file:
spark-class org.apache.spark.deploy.worker.Worker spark://10.0.0.10:7077
The problem is that the UI shows up as follows:
As you can see each worker has 8 cores instead of the 1 core I have assigned it via the SPARK_WORKER_CORES setting. Also the memory is reflective of the entire machine memory not the 2g assigned to each worker. How can I configure Spark to run with 1 core/2g per each worker in standalone mode?
I fixed this to adding the cores and memory arguments to the worker itself.
start spark-class org.apache.spark.deploy.worker.Worker --cores 1 --memory 2g spark://10.0.0.10:7077

Spark runs endlessly for Pi example

I just setup Spark and ran the command
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
However, it just keeps endlessly printing out messages like
16/04/25 17:34:46 INFO Client: Application report for application_1460481694166_0125 (state: ACCEPTED)
I read somewhere that I could try to kill the application. But I'm not sure what
When I try
yarn application -list
I see
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1460481694166_0118 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0124 Spark shell SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0120 Spark shell ...
Zeppelin SPARK zeppelin default RUNNING UNDEFINED 10% http://10.0.2.15:4040
application_1460481694166_0117 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0123 Spark shell
...
I'm not sure why Zeppelin is showing up because I closed it in my web browser
What do I need to do now?
I'm guessing Zeppelin is still running even though you closed your browser. Closing the browser is not the same as stopping the hosting process. Stopping the hosting process is done in the CLI tab that started the process. As a last ditch, you can yarn application -kill any of the running processes in any tab.
yarn application -kill application_1460481694166_0118
That will kill the (first) spark application.

Resources