Spark-shell returning the error : SparkContext: Error initializing SparkContext/Utils: Uncaught exception in thread main - apache-spark

I tried installing Spark on windows 10. I followed the steps in this order:
Installed Java (outside Program Files folder in C drive)
Validated the version of spark downloaded from Apache(spark-3.2.0-bin-hadoop3.2.tgz)
unzip the spark in the folder outside Program files(Installed in C drive)
Downloaded winutils.exe (from GIT which is in the folder Hadoop-3.2.0/bin) and put that in c:/hadoop/bin folder
Set the environment variables for JAVA_HOME (path of java), SPARK_HOME (path of the spark installation), HADOOP_HOME (path of winutils)
Included the PATH variable with %JAVA_HOME%/bin and similarly the other 2.
When I tried running Spark -version, it gives an error Spark is not recognized as the internal or external command. When I run spark-shell, it gives the error SparkContext: Error initializing SparkContext / Utils: Uncaught exception in thread main / ERROR Main: Failed to initialize Spark session.
Could you please let me know if I missed any steps for successful execution? Any suggestions on how to resolve these errors while running spark?

Related

Pyspark - spark-submit to an AWS EMR

I have created an EMR cluster (emr-5.36.0) in AWS with the default sparks components (Spark 2.4.8, Hive 2.3.9).
I have installed Pyspark (3.3.0) on an EC2, in an python virtual environment.
From there, I would like to run "spark-submit" commands to the EMR cluster.
To test the command, I am using python the code at the bottom of this page
To configured the YARN_CONF_DIR environment variable on the EC2, I copied the yarn-site.xml file from /etc/hadoop/conf.empty/ on the EMR's master node to a folder on the EC2.
But now, on the EC2, when I try to run spark-submit, I get:
$ export YARN_CONF_DIR=/home/me/spark/
$ spark-submit --master yarn --deploy-mode cluster spark_test.py
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 13 more 22/07/18 18:36:25 INFO ShutdownHookManager: Shutdown hook called
And from here I am basically lost. I tried to google the error but I am still not clear what the error is about. Did I miss a step? An environment variable maybe?
Ultimately, I want to use the SparkSubmitOperator in Airflow, but I figured I should get the "native" command to work first before using the operator (which is just a wrapper).
If you do YARN_CONF_DIR=/etc/hadoop_files/ locally, the content of the folder hadoop_files needs to be the content of the EMR's /etc/hadoop/ folder, not /etc/hadoop/conf.empty/.

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

How can I run spark in headless mode in my custom version on HDP?

How can I run spark in headless mode?
Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster.
I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop
Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/<<my_user>>/development/software/spark_no_provided_hadoop
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>>
Unfortunately, it fails to start:
19/05/01 07:12:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/01 07:12:38 ERROR cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
19/05/01 07:12:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Application application_1555489055691_64276 failed 2 times due to AM Container for appattempt_1555489055691_64276_000002 exited with exitCode: 1
When looking at the logs for details I see:
Log Type: prelaunch.err
launch_container.sh: line 30: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/*:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:/usr/hdp/2.6.4.0-91/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/2.6.4.0-91/hadoop/.//*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/./:/usr/hdp/2.6.4.0-91/hadoop-hdfs/lib/*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/.//*:/usr/hdp/2.6.4.0-91/hadoop-yarn/lib/*:/usr/hdp/2.6.4.0-91/hadoop-yarn/.//*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/lib/*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/.//*:/usr/hdp/2.6.4.0-91/tez/*:/usr/hdp/2.6.4.0-91/tez/lib/*:/usr/hdp/2.6.4.0-91/tez/conf:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
So:
/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar: bad substitution
is the cause (and similar to https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html), but this is completely inside Ambari's management domain. How can I work around it to run a more recent version of spark (2.4.x) on the existing 2.6.x HDP plattform?
edit
Assuming I passed a wrong configuration directory for HADOOP_CONF_DIR, it is unset. But then:
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
so it must be passed. Could it be, that I am passing the wrong value?
According to Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark could be correct. For me, no HADOOP_HOME is set by default.
Even when setting to: export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf, the same bad substitution error remains.
NOTE: some interesting steps:
https://community.hortonworks.com/articles/244059/steps-to-install-supplementary-spark-on-hdp-cluste.html, but not for the headless edition
https://community.hortonworks.com/questions/85757/how-to-add-the-hadoop-and-yarn-configuration-file.html
Indeed, https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html is the solution:
cd /usr/hdp
ls
2.6.xxx current share
So for me:
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>>--conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'
works

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this
Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

How to run an interactive spark application from spark-shell/spark-submit

I have a spark app that reads large data, loads it in memory and sets everything in between ready for user to query the dataframe in memory multiple times. Once a query is done, the user is prompted on the console to either continue with new set of input or quit the application.
I can do this very well on the IDE. However, can I run this interactive spark app from spark-shell?
I've used spark job server before to achieve multiple interactive querying on a memory loaded dataframe but not from a shell. Any pointers?
Thanks!
UPDATE 1:
Here is how the project jar looks and its packaged with all the other dependencies.
jar tf target/myhome-0.0.1-SNAPSHOT.jar
META-INF/MANIFEST.MF
META-INF/
my_home/
my_home/myhome/
my_home/myhome/App$$anonfun$foo$1.class
my_home/myhome/App$.class
my_home/myhome/App.class
my_home/myhome/Constants$.class
my_home/myhome/Constants.class
my_home/myhome/RecommendMatch$$anonfun$1.class
my_home/myhome/RecommendMatch$$anonfun$2.class
my_home/myhome/RecommendMatch$$anonfun$3.class
my_home/myhome/RecommendMatch$.class
my_home/myhome/RecommendMatch.class
and ran spark-shell with the following options
spark-shell -i my_home/myhome/RecommendMatch.class --master local --jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar
but shell throws the following message on start up. The jars are loaded as per the environment shown at localhost:4040
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 10:10:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 10:10:06 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local, app id = local-1494909601904).
Spark session available as 'spark'.
That file does not exist
Welcome to
...
UPDATE 2 (using spark-submit)
Tried with full path to jar. Next, tried by copying project jar to bin location.
pwd
/usr/local/Cellar/apache-spark/2.1.0/bin
spark-submit --master local —-class my_home.myhome.RecommendMatch.class --jars myhome-0.0.1-SNAPSHOT.jar
Error: Cannot load main class from JAR file:/usr/local/Cellar/apache-spark/2.1.0/bin/—-class
Try the -i <path_to_file> option to run the scala code in your file or the scala shell :load <path_to_file> function.
Relevant Q&A: Spark : how to run spark file from spark shell
The following command works to run an interactive spark application.
spark-submit /usr/local/Cellar/apache-spark/2.1.0/bin/myhome-0.0.1-SNAPSHOT.jar
Note that is a uber jar built with the main class as entry point and all dependent libraries. Check out http://maven.apache.org/plugins/maven-shade-plugin/

Resources