How to run an interactive spark application from spark-shell/spark-submit - apache-spark

I have a spark app that reads large data, loads it in memory and sets everything in between ready for user to query the dataframe in memory multiple times. Once a query is done, the user is prompted on the console to either continue with new set of input or quit the application.
I can do this very well on the IDE. However, can I run this interactive spark app from spark-shell?
I've used spark job server before to achieve multiple interactive querying on a memory loaded dataframe but not from a shell. Any pointers?
Thanks!
UPDATE 1:
Here is how the project jar looks and its packaged with all the other dependencies.
jar tf target/myhome-0.0.1-SNAPSHOT.jar
META-INF/MANIFEST.MF
META-INF/
my_home/
my_home/myhome/
my_home/myhome/App$$anonfun$foo$1.class
my_home/myhome/App$.class
my_home/myhome/App.class
my_home/myhome/Constants$.class
my_home/myhome/Constants.class
my_home/myhome/RecommendMatch$$anonfun$1.class
my_home/myhome/RecommendMatch$$anonfun$2.class
my_home/myhome/RecommendMatch$$anonfun$3.class
my_home/myhome/RecommendMatch$.class
my_home/myhome/RecommendMatch.class
and ran spark-shell with the following options
spark-shell -i my_home/myhome/RecommendMatch.class --master local --jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar
but shell throws the following message on start up. The jars are loaded as per the environment shown at localhost:4040
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 10:10:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 10:10:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local, app id = local-1494909601904).
Spark session available as 'spark'.
That file does not exist
Welcome to
...
UPDATE 2 (using spark-submit)
Tried with full path to jar. Next, tried by copying project jar to bin location.
pwd
/usr/local/Cellar/apache-spark/2.1.0/bin
spark-submit --master local —-class my_home.myhome.RecommendMatch.class --jars myhome-0.0.1-SNAPSHOT.jar
Error: Cannot load main class from JAR file:/usr/local/Cellar/apache-spark/2.1.0/bin/—-class

Try the -i <path_to_file> option to run the scala code in your file or the scala shell :load <path_to_file> function.
Relevant Q&A: Spark : how to run spark file from spark shell

The following command works to run an interactive spark application.
spark-submit /usr/local/Cellar/apache-spark/2.1.0/bin/myhome-0.0.1-SNAPSHOT.jar
Note that is a uber jar built with the main class as entry point and all dependent libraries. Check out http://maven.apache.org/plugins/maven-shade-plugin/

Related

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

How can I run spark in headless mode in my custom version on HDP?

How can I run spark in headless mode?
Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster.
I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop
Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/<<my_user>>/development/software/spark_no_provided_hadoop
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>>
Unfortunately, it fails to start:
19/05/01 07:12:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/01 07:12:38 ERROR cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
19/05/01 07:12:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Application application_1555489055691_64276 failed 2 times due to AM Container for appattempt_1555489055691_64276_000002 exited with exitCode: 1
When looking at the logs for details I see:
Log Type: prelaunch.err
launch_container.sh: line 30: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/*:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:/usr/hdp/2.6.4.0-91/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/2.6.4.0-91/hadoop/.//*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/./:/usr/hdp/2.6.4.0-91/hadoop-hdfs/lib/*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/.//*:/usr/hdp/2.6.4.0-91/hadoop-yarn/lib/*:/usr/hdp/2.6.4.0-91/hadoop-yarn/.//*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/lib/*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/.//*:/usr/hdp/2.6.4.0-91/tez/*:/usr/hdp/2.6.4.0-91/tez/lib/*:/usr/hdp/2.6.4.0-91/tez/conf:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
So:
/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar: bad substitution
is the cause (and similar to https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html), but this is completely inside Ambari's management domain. How can I work around it to run a more recent version of spark (2.4.x) on the existing 2.6.x HDP plattform?
edit
Assuming I passed a wrong configuration directory for HADOOP_CONF_DIR, it is unset. But then:
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
so it must be passed. Could it be, that I am passing the wrong value?
According to Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark could be correct. For me, no HADOOP_HOME is set by default.
Even when setting to: export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf, the same bad substitution error remains.
NOTE: some interesting steps:
https://community.hortonworks.com/articles/244059/steps-to-install-supplementary-spark-on-hdp-cluste.html, but not for the headless edition
https://community.hortonworks.com/questions/85757/how-to-add-the-hadoop-and-yarn-configuration-file.html
Indeed, https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html is the solution:
cd /usr/hdp
ls
2.6.xxx current share
So for me:
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>>--conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'
works

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this
Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

spark on yarn java.io.IOException: No FileSystem for scheme: s3n

My english poor , sorry,but I really need help.
I use spark-2.0.0-bin-hadoop2.7 and hadoop2.7.3. and read log from s3, write result to local hdfs. and I can run spark driver use standalone mode successfully. But when I run the same driver on yarn mode. It's throw
17/02/10 16:20:16 ERROR ApplicationMaster: User class threw exception: java.io.IOException: No FileSystem for scheme: s3n
hadoop-env.sh I add
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
run hadoop fs -ls s3n://xxx/xxx/xxx, can list files.
I thought it's should be can't find aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
how can do.
I'm not using the same versions as you, but here is an extract of my [spark_path]/conf/spark-defaults.conf file that was necessary to get s3a working:
# hadoop s3 config
spark.driver.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.executor.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key [key]
spark.hadoop.fs.s3a.secret.key [key]
spark.hadoop.fs.s3a.fast.upload true
Alternatively you can specify paths to the jars in a comma-separated format to the --jars option on job submit:
--jars [path]aws-java-sdk-[version].jar,[path]hadoop-aws-[version].‌​‌​jar
Notes:
Ensure the jars are in the same location on all nodes in your cluster
Replace [path] with your path
Replace s3a with your preferred protocol (last time I checked s3a was best)
I don't think guava is required to get s3a working but I can't remember
Stick the JARs into SPARK_HOME/lib, with the rest of the spark bits.
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem isn't needed; the JAR will be autoscanned and picked up.
don't play with fast.output.enabled on 2.7.x unless you know what you are doing and prepared to tune some of the thread pool options. Start without that option.
Add these jars to $SPARK_HOME/jars:
ws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar

Spark + Mesos cluster mode, who uploads the jar?

I'm trying to run Spark applications with Mesos cluster mode. (I've got client mode working but still would like to try cluster mode)
I have launched spark-mesos-dispatcher on the Mesos master node.
When I submit the assembly at local path /tmp/assembly.jar using the following command,
bin/spark-submit --master mesos://dispatcher:7077 --deploy-mode cluster --class com.example.Example /tmp/assembly.jar
It fails because the file /tmp/assembly.jar does not exist on the mesos slave nodes.
I1129 10:47:43.839771 5884 fetcher.cpp:414] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/deploy","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/assembly.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9\/frameworks\/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291\/executors\/driver-20151129104742-0008\/runs\/31bf5840-226e-4b87-ae76-d14bd2f17950","user":"user"}
I1129 10:47:43.840710 5884 fetcher.cpp:369] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840721 5884 fetcher.cpp:243] Fetching directly into the sandbox directory
I1129 10:47:43.840731 5884 fetcher.cpp:180] Fetching URI '/tmp/assembly.jar'
I1129 10:47:43.840737 5884 fetcher.cpp:160] Copying resource with command:cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'
cp: cannot stat `/tmp/assembly.jar': No such file or directory
Failed to fetch '/tmp/assembly.jar': Failed to copy with command 'cp '/tmp/assembly.jar' '/var/lib/mesos/slaves/9d725348-931a-48fb-96f7-d29a4b09f3e8-S9/frameworks/9d725348-931a-48fb-96f7-d29a4b09f3e8-0291/executors/driver-20151129104742-0008/runs/31bf5840-226e-4b87-ae76-d14bd2f17950/assembly.jar'', exit status: 256
Failed to synchronize with slave (it's probably exited)
In case of YARN cluster mode, Spark's YARN client implementation will upload the application jar to HDFS so that the driver and all executors have access to the jar, but I could not find such code in RestSubmissionClient, which is used by Mesos or Standalond cluster mode.
Who does the uploading in this case? or do I need to manually put the application assembly somewhere accessible via an HTTP URI?
From my understanding you could use the SparkContext addJar() method to add a local (to the driver application) JAR file path, which will then be distributed to the executor nodes (in client mode).
As you state that you want to use cluster mode, I'd suggest that you have a look at the Spark Jobserver project, which should make the running of Spark applications on Mesos easier than with the built-in tools.

Resources