Spark runs endlessly for Pi example - apache-spark

I just setup Spark and ran the command
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
However, it just keeps endlessly printing out messages like
16/04/25 17:34:46 INFO Client: Application report for application_1460481694166_0125 (state: ACCEPTED)
I read somewhere that I could try to kill the application. But I'm not sure what
When I try
yarn application -list
I see
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1460481694166_0118 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0124 Spark shell SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0120 Spark shell ...
Zeppelin SPARK zeppelin default RUNNING UNDEFINED 10% http://10.0.2.15:4040
application_1460481694166_0117 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0123 Spark shell
...
I'm not sure why Zeppelin is showing up because I closed it in my web browser
What do I need to do now?

I'm guessing Zeppelin is still running even though you closed your browser. Closing the browser is not the same as stopping the hosting process. Stopping the hosting process is done in the CLI tab that started the process. As a last ditch, you can yarn application -kill any of the running processes in any tab.
yarn application -kill application_1460481694166_0118
That will kill the (first) spark application.

Related

Why Spark Drivers don't run when Applications run?

I'm a beginner trying to learn about the behavior of applications and drivers by going through some examples. I'm starting off with:
Running a standalone cluster manager
Running a single master calling ./sbin/start-master.sh
Running a single worker calling ./sbin/start-slave.sh spark://localhost:7077
Launching a test application in client mode by calling:
./bin/spark-submit \
--master spark://localhost:7077 \
./examples/src/main/python/pi.py
According to the Docs:
The process running the main() function of the application and creating the SparkContext
My takeaway from this is there should be at least one driver program that runs when an application runs. However, I'm not seeing this in the web UI for the master:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 1 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Shouldn't I expect to see 1 driver running or completed? I've included some config details below.
./conf/spark-defaults.conf:
spark.master=spark://localhost:7077
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=.tmp/spark-events/
spark.driver.memory=5g
If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode.
Client mode does not triggers Driver program but cluster mode does.
NOTE : AFAIK you cannot run pyspark or any other interactive shell in cluster mode.
So try running the application in cluster mode using --deploy-mode cluster
./bin/spark-submit \
--master spark://localhost:7077 \
--deploy-mode cluster
./examples/src/main/python/pi.py

pyspark job execution in yarn cluster

I am trying to understand how spark job work in yarn cluster
I am using below commands to submit job
spark-submit --master yarn --deploy-mode cluster sparksessionexample.py
After submitting job console shows below console log
2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip
I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?
Can anyone help understand this?
The "Spark client" is used to bootstrap the Spark job execution.
In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:
the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate
In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).
If you want to see some feedback from the driver, then run your job in client execution mode -- it means the driver will run in the same JVM as the "client", in your local machine, and spit its logs in your console.

Running python spark on EMR

We're having a hard time running a python spark job on EMR.
aws emr add-steps --cluster-id j-XXXXXXXX --steps \
Type=CUSTOM_JAR,Name="Spark Program",\
Jar="command-runner.jar",ActionOnFailure=CONTINUE,\
Args=["spark-submit",--deploy-mode,cluster,--master,yarn,s3://XXXXXXX/pi.py,2]
We're running the same pyspark compute pi script as the AWS page suggests
This script runs, but it runs forever calculating pi. On local machine it takes seconds to finish. We've tried client mode as well. On client mode it makes us transfer the files locally.
16/09/20 15:20:32 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1474384831795
final status: UNDEFINED
tracking URL: http://XXXXXXX.ec2.internal:20888/proxy/application_1474381572045_0002/
user: hadoop
16/09/20 15:20:33 INFO Client: Application report for application_1474381572045_0002 (state: ACCEPTED)
Repeats this last command over and over...
Does anyone know how to run the example python spark pi script on EMR without it running forever?
When you see the job in ACCEPTED state forever, it means that it is not actually running but rather is waiting for YARN to have enough resources to run the application. Usually this is because you already have some other YARN application running and taking up resources. The easiest way to find out if this is the case is to look at the YARN ResourceManager on port 8088 of the master node. You can also run the command "yarn application -list" if you have ssh'ed to the master node.

Spark cluster set up error

With some research over the internet, I can use
sbin/start-master.sh
to start the spark master server spark service over my Ubuntu Linux computers
and use
bin/spark-class org.apache.spark.deploy.worker.Worker spark://...
for the slave nodes service up and running.
The good news was I can see the local web page with works found alive.
However, after such, I tried to launch the shell to work ...
MASTER=spark://localhost:7077 bin/spark-shell
but it returned:
sparkMaster#localhost:7077 ...
And therefore I modified the code to
MASTER=spark://sparkuser#localhost:7077 bin/spark-shell
where the sparkuser is the one connected to the two nodes
However, with such modification, I got:
ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
and when I tried
MASTER=local-cluster[3,2,1024] bin/spark-shell
It did come out with the spark logo in the shell but I was afraid the slave nodes were not binding in.
Did I miss anything for the Spark cluster setting?
Just launch spark-shell on cluster with --master flag as follows
./spark-shell --master spark://localhost:7077 bin/spark-shell

Cannot submit Spark app to cluster, stuck on "UNDEFINED"

I use this command to summit spark application to yarn cluster
export YARN_CONF_DIR=conf
bin/spark-submit --class "Mining"
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
In Web UI, it stuck on UNDEFINED
In console, it stuck to
<code>14/11/12 16:37:55 INFO yarn.Client: Application report from ASM:
application identifier: application_1415704754709_0017
appId: 17
clientToAMToken: null
appDiagnostics:
appMasterHost: example.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1415784586000
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://example.com:8088/proxy/application_1415704754709_0017/
appUser: rain
</code>
Update:
Dive into Logs for container in Web UI http://example.com:8042/node/containerlogs/container_1415704754709_0017_01_000001/rain/stderr/?start=0, I found this
14/11/12 02:11:47 WARN YarnClusterScheduler: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain sending #24418
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain got value #24418
I found this problem have had solution here http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
The Hadoop cluster must have sufficient memory for the request.
For example, submitting the following job with 1GB memory allocated for
executor and Spark driver fails with the above error in the HDP 2.1 Sandbox.
Reduce the memory asked for the executor and the Spark driver to 512m and
re-start the cluster.
I'm trying this solution and hopefully it will work.
Solutions
Finally I found that it caused by memory problem
It worked when I change yarn.nodemanager.resource.memory-mb to 3072 (its value was 2048) in Web UI of interface and restarted cluster.
I'm very happy to see this
With 3GB in yarn nodemanager, my summit is
bin/spark-submit
--class "Mining"
--master yarn-cluster
--executor-memory 512m
--driver-memory 512m
--num-executors 2
--executor-cores 1
./target/scala-2.10/mining-assembly-0.1.jar`

Resources