apache_beam spark runner with python can't be implemented on remote spark cluster? - apache-spark

i am following the python guide beam spark runner,and the beam_pipeline can submit job to a local jobserver which is launched by ./gradlew :runners:spark:job-server:runShadow with a local spark,
and the addition parameter-PsparkMasterUrl=spark://localhost:7077 to a pre-deployed spark.
But i have a spark cluster on yarn, i set the launch command as ./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=yarn(also tried yarn-client), but only get org.apache.spark.SparkException: Could not parse Master URL: 'yarn'
and the source code of the spark runner(beam\sdks\python\apache_beam\runners\portability\spark_runnner.py) shows that:
parser.add_argument('--spark_master_url',
default='local[4]',
help='Spark master URL (spark://HOST:PORT). '
'Use "local" (single-threaded) or "local[*]" '
'(multi-threaded) to start a local cluster for '
'the execution.')
it doesn't mention 'yarn', and the Provided SparkContext and StreamingListeners are not supported on the Spark portable runner. So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly) and can only be test locally? or maybe i can set the job_endpoint as the remote job server url of my spark cluster.
and the every ./gradlew command blocked at 98%,but the jab server started with info like that:
19/11/28 13:47:48 INFO org.apache.beam.runners.fnexecution.jobsubmission.JobServerDriver: JobService started on localhost:8099
<============-> 98% EXECUTING [16s]
> IDLE
> :runners:spark:job-server:runShadow
> IDLE

So does that meaning apache_beam spark runner with python can't be implemented on remote spark cluster(yarn mostly)
We've recently added portable Spark jars, which can be submitted via spark-submit. This feature isn't scheduled be included a Beam release until 2.19.0, however.
I created a JIRA ticket to track the status of YARN support, in case there are other related issues that need to be addressed.
and the every ./gradlew command blocked at 98%
That's expected behavior. The job server will stay running until canceled.

Related

Spark on YARN : Job Submitted v/s Accepted?

I am running spark job on YARN-cluster mode . What is the difference between YARN Accepted and YARN Submitted status ?
We submit the spark job using spark-submit (cluster mode YARN).
YARN submitted: Job has submitted to the YARN scheduler queue (FIFO/Fair scheduler) and waiting for its turn.
YARN accepted: YARN has started execution of the job but only application master is running, Application master has not got resources from the resource manager to run the job.

How can I run spark in headless mode in my custom version on HDP?

How can I run spark in headless mode?
Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster.
I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop
Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/<<my_user>>/development/software/spark_no_provided_hadoop
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>>
Unfortunately, it fails to start:
19/05/01 07:12:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/01 07:12:38 ERROR cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
19/05/01 07:12:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Application application_1555489055691_64276 failed 2 times due to AM Container for appattempt_1555489055691_64276_000002 exited with exitCode: 1
When looking at the logs for details I see:
Log Type: prelaunch.err
launch_container.sh: line 30: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/*:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:/usr/hdp/2.6.4.0-91/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/2.6.4.0-91/hadoop/.//*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/./:/usr/hdp/2.6.4.0-91/hadoop-hdfs/lib/*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/.//*:/usr/hdp/2.6.4.0-91/hadoop-yarn/lib/*:/usr/hdp/2.6.4.0-91/hadoop-yarn/.//*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/lib/*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/.//*:/usr/hdp/2.6.4.0-91/tez/*:/usr/hdp/2.6.4.0-91/tez/lib/*:/usr/hdp/2.6.4.0-91/tez/conf:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
So:
/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar: bad substitution
is the cause (and similar to https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html), but this is completely inside Ambari's management domain. How can I work around it to run a more recent version of spark (2.4.x) on the existing 2.6.x HDP plattform?
edit
Assuming I passed a wrong configuration directory for HADOOP_CONF_DIR, it is unset. But then:
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
so it must be passed. Could it be, that I am passing the wrong value?
According to Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark could be correct. For me, no HADOOP_HOME is set by default.
Even when setting to: export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf, the same bad substitution error remains.
NOTE: some interesting steps:
https://community.hortonworks.com/articles/244059/steps-to-install-supplementary-spark-on-hdp-cluste.html, but not for the headless edition
https://community.hortonworks.com/questions/85757/how-to-add-the-hadoop-and-yarn-configuration-file.html
Indeed, https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html is the solution:
cd /usr/hdp
ls
2.6.xxx current share
So for me:
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>>--conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'
works

How to run an interactive spark application from spark-shell/spark-submit

I have a spark app that reads large data, loads it in memory and sets everything in between ready for user to query the dataframe in memory multiple times. Once a query is done, the user is prompted on the console to either continue with new set of input or quit the application.
I can do this very well on the IDE. However, can I run this interactive spark app from spark-shell?
I've used spark job server before to achieve multiple interactive querying on a memory loaded dataframe but not from a shell. Any pointers?
Thanks!
UPDATE 1:
Here is how the project jar looks and its packaged with all the other dependencies.
jar tf target/myhome-0.0.1-SNAPSHOT.jar
META-INF/MANIFEST.MF
META-INF/
my_home/
my_home/myhome/
my_home/myhome/App$$anonfun$foo$1.class
my_home/myhome/App$.class
my_home/myhome/App.class
my_home/myhome/Constants$.class
my_home/myhome/Constants.class
my_home/myhome/RecommendMatch$$anonfun$1.class
my_home/myhome/RecommendMatch$$anonfun$2.class
my_home/myhome/RecommendMatch$$anonfun$3.class
my_home/myhome/RecommendMatch$.class
my_home/myhome/RecommendMatch.class
and ran spark-shell with the following options
spark-shell -i my_home/myhome/RecommendMatch.class --master local --jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar
but shell throws the following message on start up. The jars are loaded as per the environment shown at localhost:4040
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 10:10:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 10:10:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local, app id = local-1494909601904).
Spark session available as 'spark'.
That file does not exist
Welcome to
...
UPDATE 2 (using spark-submit)
Tried with full path to jar. Next, tried by copying project jar to bin location.
pwd
/usr/local/Cellar/apache-spark/2.1.0/bin
spark-submit --master local —-class my_home.myhome.RecommendMatch.class --jars myhome-0.0.1-SNAPSHOT.jar
Error: Cannot load main class from JAR file:/usr/local/Cellar/apache-spark/2.1.0/bin/—-class
Try the -i <path_to_file> option to run the scala code in your file or the scala shell :load <path_to_file> function.
Relevant Q&A: Spark : how to run spark file from spark shell
The following command works to run an interactive spark application.
spark-submit /usr/local/Cellar/apache-spark/2.1.0/bin/myhome-0.0.1-SNAPSHOT.jar
Note that is a uber jar built with the main class as entry point and all dependent libraries. Check out http://maven.apache.org/plugins/maven-shade-plugin/

Spark + Mesos cluster mode, who uploads the jar?

I'm trying to run Spark applications with Mesos cluster mode. (I've got client mode working but still would like to try cluster mode)
I have launched spark-mesos-dispatcher on the Mesos master node.
When I submit the assembly at local path /tmp/assembly.jar using the following command,
bin/spark-submit --master mesos://dispatcher:7077 --deploy-mode cluster --class com.example.Example /tmp/assembly.jar
It fails because the file /tmp/assembly.jar does not exist on the mesos slave nodes.
I1129 10:47:43.839771 5884 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/deploy","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/assembly.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/frameworks\/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291\/executors\/driver-20151129104742-0008\/runs\/31bf5840-226e-4b87-ae76-d14bd2f17950","user":"user"}
I1129 10:47:43.840710 5884 fetcher.cpp:369] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840721 5884 fetcher.cpp:243] Fetching directly into the sandbox directory
I1129 10:47:43.840731 5884 fetcher.cpp:180] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840737 5884 fetcher.cpp:160] Copying resource with command:cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'
cp: cannot stat `/tmp/assembly.jar': No such file or directory
Failed to fetch '/tmp/assembly.jar': Failed to copy with command 'cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'', exit status: 256
Failed to synchronize with slave (it's probably exited)
In case of YARN cluster mode, Spark's YARN client implementation will upload the application jar to HDFS so that the driver and all executors have access to the jar, but I could not find such code in RestSubmissionClient, which is used by Mesos or Standalond cluster mode.
Who does the uploading in this case? or do I need to manually put the application assembly somewhere accessible via an HTTP URI?
From my understanding you could use the SparkContext addJar() method to add a local (to the driver application) JAR file path, which will then be distributed to the executor nodes (in client mode).
As you state that you want to use cluster mode, I'd suggest that you have a look at the Spark Jobserver project, which should make the running of Spark applications on Mesos easier than with the built-in tools.

SparkDeploySchedulerBackend Error: Application has been killed. All masters are unresponsive

While I'm starting Spark shell:
bin>./spark-shell
I get the following error :
Spark assembly has been built with Hive, including Data nucleus jars on classpath
Welcome to SPARK VERSION 1.3.0
Using Scala version 2.10.4 (Java HotSpot(TM) Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
15/05/10 12:12:21 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/10 12:12:21 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
I have installed spark by follow below link :- http://www.philchen.com/2015/02/16/how-to-install-apache-spark-and-cassandra-stack-on-ubuntu
You should supply your Spark Cluster's Master URL when start a spark-shell
At least:
bin/spark-shell --master spark://master-ip:7077
All the options make up a long list and you can find the suitable ones yourself:
bin/spark-shell --help
I am assuming that you are running this in standalone/local mode.
Run your spark shell with following line. That indicates you are using all the available cores of your master which is local machine.
bin/spark-shell --master local[*]
http://spark.apache.org/docs/1.2.1/submitting-applications.html#master-urls
You also need to start spark master and slave before giving spark-submit command
start-master.sh
start-slave.sh spark://spark:7077
then use
spark-submit --master spark://spark:7077
Look at your log files for "permission denied" errors... It may happens that your client service doesn't have the proper authority to access your Master folders.

Resources