error when starting the spark shell - apache-spark

I just downloaded the latest version of spark and when I started the spark shell I got the following error:
java.net.BindException: Failed to bind to: /192.168.1.254:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
...
...
java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:193)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:71)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.<init>(<console>:9)
...
...
<console>:10: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:10: error: not found: value sqlContext
import sqlContext.sql
^
Is there something that I missed in setting up spark?

Try setting the Spark env variable SPARK_LOCAL_IP to a local IP address.
In my case, I was running Spark on an Amazon EC2 Linux instance. spark-shell stopped working, with an error message similar to yours. I was able to fix it by adding a setting like the following to the Spark config file spark-env.conf.
export SPARK_LOCAL_IP=172.30.43.105
Could also set it in ~/.profile or ~/.bashrc.
Also check host settings in /etc/hosts

See SPARK-8162.
It looks like it only affects 1.4.1 and 1.5.0 - you're probably best off running the latest release (1.4.0 at time of writing).

I was experiencing the same issue. First got to .bashrc and put
export SPARK_LOCAL_IP=172.30.43.105
then goto
cd $HADOOP_HOME/bin
then run the following command
hdfs dfsadmin -safemode leave
This just switches your safemode of namenode off.
Then delete the metastore_db folder from the spark home folder or /bin. It will be generally be in a folder from which you generally start a spark session.
then I ran my spark-shell using this
spark-shell --master "spark://localhost:7077"
and voila I didnot get the sqlContext.implicits._ error.

Related

Pyspark - spark-submit to an AWS EMR

I have created an EMR cluster (emr-5.36.0) in AWS with the default sparks components (Spark 2.4.8, Hive 2.3.9).
I have installed Pyspark (3.3.0) on an EC2, in an python virtual environment.
From there, I would like to run "spark-submit" commands to the EMR cluster.
To test the command, I am using python the code at the bottom of this page
To configured the YARN_CONF_DIR environment variable on the EC2, I copied the yarn-site.xml file from /etc/hadoop/conf.empty/ on the EMR's master node to a folder on the EC2.
But now, on the EC2, when I try to run spark-submit, I get:
$ export YARN_CONF_DIR=/home/me/spark/
$ spark-submit --master yarn --deploy-mode cluster spark_test.py
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 13 more 22/07/18 18:36:25 INFO ShutdownHookManager: Shutdown hook called
And from here I am basically lost. I tried to google the error but I am still not clear what the error is about. Did I miss a step? An environment variable maybe?
Ultimately, I want to use the SparkSubmitOperator in Airflow, but I figured I should get the "native" command to work first before using the operator (which is just a wrapper).
If you do YARN_CONF_DIR=/etc/hadoop_files/ locally, the content of the folder hadoop_files needs to be the content of the EMR's /etc/hadoop/ folder, not /etc/hadoop/conf.empty/.

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

Spark on Yarn Client Mode - HADOOP_CONF_DIR is ignored

I have Spark 2.1.1 installed on top of Hadoop 2.8.1.
I have already specified HADOOP_CONF_DIR on spark-env.sh. I also have the following setting on spark-defaults.sh
spark.yarn.access.namenodes hdfs://hadoop-node0:55555/
But when I execute spark-shell with the following command for instance,
sparkuser#hadoop-node0:/home/apps/spark-2.1.1-bin-hadoop2.7$ bin/spark-shell --master yarn --deploy-mode client
HADOOP_CONF_DIR setting seems to be ignored so it does not retrieve the settings on core-site.xml and hdfs-site.xml, as I always get the following error:
17/07/25 10:15:24 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.UnknownHostException: spark
When I added "spark" on my /etc/hosts as alternative for localhost, I always get the following error:
17/07/25 10:17:15 ERROR spark.SparkContext: Error initializing SparkContext.
java.net.ConnectException: Call From XXXX/XXX.XXX.XXX.XXX to spark:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
So it always tries to reach 127.0.0.1:8020 which of course does not work as nothing is listening on it.
What do you think I missed specifying in the config files?
Thanks a lot in advance.
Kind regards,
Anto

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

error: not found: value sqlContext

I would like to create a python application to analyze twitter streaming data using Apache Spark.
Is there any way I can use the functionality of Apache Spark streaming without setting up the Hadoop environment. How to run Apache Spark in standalone mode?
I just downloaded the binaries and tried to run spark-shell, getting NullPointerException. Can someone please help.
<console>:10: error: not found: value sqlContext
import sqlContext.implicits.
<console>:10: error: not found: value sqlContext
import sqlContext.sql
I install spark 1.5.2 using homebrew, and when I started the spark-shell, I met the same error. I add export SPARK_LOCAL_IP=127.0.0.1 to .bashrc or .bash_profile. It Works.
If you work with Spark 1.6, Linux/Unix and if you find the following lines in the error message:
...
java.net.UnknownHostException: <YOURHOSTNAME>: <YOURHOSTNAME>: unknown error at
java.net.InetAddress.getLocalHost(InetAddress.java:1663)
...
Caused by: java.net.UnknownHostException: <YOURHOSTNAME>: unknown error
...
<console>:16: error: not found: value sqlContext
import sqlContext.sql
add in /etc/hosts:
$ sudo vi /etc/hosts
...
127.0.0.1 <YOURHOSTNAME>
...
This solved in my case the missing sqlContext value problem.

Resources