Running python spark on EMR - apache-spark

We're having a hard time running a python spark job on EMR.
aws emr add-steps --cluster-id j-XXXXXXXX --steps \
Type=CUSTOM_JAR,Name="Spark Program",\
Jar="command-runner.jar",ActionOnFailure=CONTINUE,\
Args=["spark-submit",--deploy-mode,cluster,--master,yarn,s3://XXXXXXX/pi.py,2]
We're running the same pyspark compute pi script as the AWS page suggests
This script runs, but it runs forever calculating pi. On local machine it takes seconds to finish. We've tried client mode as well. On client mode it makes us transfer the files locally.
16/09/20 15:20:32 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1474384831795
final status: UNDEFINED
tracking URL: http://XXXXXXX.ec2.internal:20888/proxy/application_1474381572045_0002/
user: hadoop
16/09/20 15:20:33 INFO Client: Application report for application_1474381572045_0002 (state: ACCEPTED)
Repeats this last command over and over...
Does anyone know how to run the example python spark pi script on EMR without it running forever?

When you see the job in ACCEPTED state forever, it means that it is not actually running but rather is waiting for YARN to have enough resources to run the application. Usually this is because you already have some other YARN application running and taking up resources. The easiest way to find out if this is the case is to look at the YARN ResourceManager on port 8088 of the master node. You can also run the command "yarn application -list" if you have ssh'ed to the master node.

Related

Why Spark Drivers don't run when Applications run?

I'm a beginner trying to learn about the behavior of applications and drivers by going through some examples. I'm starting off with:
Running a standalone cluster manager
Running a single master calling ./sbin/start-master.sh
Running a single worker calling ./sbin/start-slave.sh spark://localhost:7077
Launching a test application in client mode by calling:
./bin/spark-submit \
--master spark://localhost:7077 \
./examples/src/main/python/pi.py
According to the Docs:
The process running the main() function of the application and creating the SparkContext
My takeaway from this is there should be at least one driver program that runs when an application runs. However, I'm not seeing this in the web UI for the master:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 15.0 GB Total, 0.0 B Used
Applications: 0 Running, 1 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Shouldn't I expect to see 1 driver running or completed? I've included some config details below.
./conf/spark-defaults.conf:
spark.master=spark://localhost:7077
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=.tmp/spark-events/
spark.driver.memory=5g
If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode.
Client mode does not triggers Driver program but cluster mode does.
NOTE : AFAIK you cannot run pyspark or any other interactive shell in cluster mode.
So try running the application in cluster mode using --deploy-mode cluster
./bin/spark-submit \
--master spark://localhost:7077 \
--deploy-mode cluster
./examples/src/main/python/pi.py

pyspark job execution in yarn cluster

I am trying to understand how spark job work in yarn cluster
I am using below commands to submit job
spark-submit --master yarn --deploy-mode cluster sparksessionexample.py
After submitting job console shows below console log
2020-05-29 20:52:48,668 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_libs__2908230569257238890.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_libs__2908230569257238890.zip
2020-05-29 20:53:14,164 INFO yarn.Client: Uploading resource file:/home/hadoop/pythonprojects/Python/src/spark_jobs/sparksessionexample.py -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/sparksessionexample.py
2020-05-29 20:53:14,610 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/pyspark.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/pyspark.zip
2020-05-29 20:53:15,984 INFO yarn.Client: Uploading resource file:/home/hadoop/clouderaapp/apache-spark/python/lib/py4j-0.10.7-src.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/py4j-0.10.7-src.zip
2020-05-29 20:53:18,362 INFO yarn.Client: Uploading resource file:/tmp/spark-bcd415f0-a22e-46b2-951c-5b6e4385a0c6/__spark_conf__7123551182035223076.zip -> hdfs://localhost:9000/user/hadoop/.sparkStaging/application_1590759398715_0003/__spark_conf__.zip
I just to understand how yarn execute sparksessionexample.py file, i mean whether it create python virtual env on node? as above log shows only uploading lib, confs but what about python client to execute sparksessionexample.py?
Can anyone help understand this?
The "Spark client" is used to bootstrap the Spark job execution.
In your case it is the only thing that runs on your local machine, because you requested cluster execution mode:
the "client" contacts the cluster manager (here YARN Resource Manager, could be Kubernetes Master, etc.) to start the Spark driver inside an AppMaster container
then the driver contacts again the cluster manager to request some containers for the executors
then the driver runs your Python code and distributes the work to the executors
finally the driver de-allocates its executors and itself
at this point the "client" notices that the YARN job has reached success or failure status, and can terminate
In short, the "client" never gets any kind of useful information from the driver running inside the cluster. You must inspect the YARN logs for the container running the driver (it's the AppMaster, typically number 00001).
If you want to see some feedback from the driver, then run your job in client execution mode -- it means the driver will run in the same JVM as the "client", in your local machine, and spit its logs in your console.

apache_beam spark runner with python can't be implemented on remote spark cluster?

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark,
and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.
But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'
and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:
parser.add_argument('--spark_master_url',
default='local[4]',
help='Spark master URL (spark://HOST:PORT). '
'Use "local" (single-threaded) or "local[*]" '
'(multi-threaded) to start a local cluster for '
'the execution.')
it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.
and the every ./gradlew command blocked at 98%,but the jab server started with info like that:
19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE
So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)
We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.
I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.
and the every ./gradlew command blocked at 98%
That's expected behavior. The job server will stay running until canceled.

Can't access to SparkUI though YARN

I'm building a docker image to run zeppelin or spark-shell in local against a production Hadoop cluster with YARN. edit: the environment was macOS
I can execute jobs or a spark-shell well but when I try to access on Tracking URL on YARN meanwhile the job is running it hangs YARN-UI for exactly 10 minutes. YARN still working and if I connect via ssh I can execute yarn commands.
If I don't access SparkUI (directly or through YARN) nothing happens. Jobs are executed and YARN-UI is not hanged.
More info:
Local, on Docker: Spark 2.1.2, Hadoop 2.6.0-cdh5.4.3
Production: Spark 2.1.0, Hadoop 2.6.0-cdh5.4.3
If I execute it locally (--master local[*]) it works and I can connect to SparkUI though 4040.
Spark config:
spark.driver.bindAddress 172.17.0.2 #docker_eth0_ip
spark.driver.host 192.168.XXX.XXX #local_ip
spark.driver.port 5001
spark.ui.port 4040
spark.blockManager.port 5003
Yes, ApplicationMaster and nodes have visibility over my local SparkUI or driver (telnet test)
As I said I can execute jobs then docker expose ports and its binding is working. Some logs proving it:
INFO ApplicationMaster: Driver now available: 192.168.XXX.XXX:5001
INFO TransportClientFactory: Successfully created connection to /192.168.XXX.XXX:5001 after 65 ms (0 ms spent in bootstraps)
INFO ApplicationMaster$AMEndpoint: Add WebUI Filter. AddWebUIFilter(org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,Map(PROXY_HOSTS -> jobtracker.hadoop, PROXY_URI_BASES -> http://jobtracker.hadoop:8088/proxy/application_000_000),/proxy/application_000_000)
Some ideas or where I can look to see what's happening?
The problem was related with how docker manage IP incoming requests when it's executed on MacOS.
When YARN, which's running inside docker container, receives a request doesn't see original IP it sees the internal proxy docker IP (in my case 172.17.0.1).
When a request is send to my local container SparkUI, automatically redirects the request to hadoop master (is how YARN works) because it see that the request is not coming from hadoop master and it only accepts requests from this source.
When master receives the forwarded request it tries to send it to spark driver (my local docker container) which forward again the request to hadoop master because it see that the IP source is not the master, is the proxy IP.
It takes all threads reserved for UI. Until threads are not released YARN UI is hanged
I "solved" changing docker yarn configuration
<property>
<name>yarn.web-proxy.address</name>
<value>172.17.0.1</value>
</property>
This allows sparkUI to handle any request made to docker container.

Spark runs endlessly for Pi example

I just setup Spark and ran the command
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
However, it just keeps endlessly printing out messages like
16/04/25 17:34:46 INFO Client: Application report for application_1460481694166_0125 (state: ACCEPTED)
I read somewhere that I could try to kill the application. But I'm not sure what
When I try
yarn application -list
I see
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1460481694166_0118 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0124 Spark shell SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0120 Spark shell ...
Zeppelin SPARK zeppelin default RUNNING UNDEFINED 10% http://10.0.2.15:4040
application_1460481694166_0117 org.apache.spark.examples.SparkPi SPARK root default ACCEPTED UNDEFINED 0% N/A
application_1460481694166_0123 Spark shell
...
I'm not sure why Zeppelin is showing up because I closed it in my web browser
What do I need to do now?
I'm guessing Zeppelin is still running even though you closed your browser. Closing the browser is not the same as stopping the hosting process. Stopping the hosting process is done in the CLI tab that started the process. As a last ditch, you can yarn application -kill any of the running processes in any tab.
yarn application -kill application_1460481694166_0118
That will kill the (first) spark application.

Resources