Spark can't connect to YARN resource manager after adding Kafka jar - apache-spark

I am trying to hook Spark up with Kafka. Previously, Spark worked correctly but did not have this functionality. I installed the spark-streaming-kafka-spark-streaming-kafka-0-8-assembly jar into my jars folder for Spark, and now when I try to submit a task I get
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
The job hangs while it keeps trying to connect. I have yarn-site.xml that specifies the resource manager IP address - it has
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>my.Server.Name:8032</value>
</property>
So it seems that the address is being overwritten - I am not sure why or how I can prevent this.
Update: If I move the jar outside of the Jar folders and include it with --jars instead, I don't get the hang. However, when I try to create a direct Kafka stream I get n error occurred while calling o28.createDirectStreamWithoutMessageHandler.
: java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce. I'm not sure if this is a version mismatch or what.

I fixed this by upgrading the jar to the correct version - 2.11/2.1.0. You also need to have it outside of the Spark jars folder.

Related

Apache PySpark - Failed to connect to master 7077

I set up Spark and HDFS after watching this video. The only difference is that I did it on a server (ubuntu) and not on a VM.
On the server, everything works perfect. Now I wanted to access it from my local machine (Windows) with PySpark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://ubuntu-spark:7077").appName("test").getOrCreate()
spark.stop()
However, here I get the following error messages:
22/11/12 10:38:35 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
22/11/12 10:38:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
22/11/12 10:38:37 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master
ubuntu-spark:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: ...
According to other posts, the DNS should be correct. I got this from the Spark Master website (at port 8080):
URL: spark://ubuntu-spark:7077
Alive Workers: 1
Cores in use: 2 Total, 0 Used
Memory in use: 6.8 GiB Total, 0.0 B Used
Resources in use:
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
The ports are open. I also don't understand the following message: "HADOOP_HOME and hadoop.home.dir are unset." Hadoop is configured on the server. Why should I do the same thing locally again? My expectation would be that I can use Spark like an API or am I wrong?
Thank you very much for your help. If you need any configuration files I can provide them.
Hadoop should not be necessary for the code shown since you're not using HDFS, but the log is saying it's looking on your Windows machine for those settings.
DNS needs to work between your windows machine and wherever your server is running (a VM can still be server, so it's unclear where you're running this). Start debugging with ping spark-master to check, or you should be able to open spark-master:8080 from Windows browser instance as well.
If you only want to run Spark code, and don't care if it's distributed, you could just use Docker on Windows - https://github.com/jupyter/docker-stacks
Or setup Pycharm all locally for the same

Pyspark not creating SparkContext (Yarn). bad gateway or network traffic blocked?

Here is some context of my installation of pyspark binary.
In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.
Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.
What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/.
Then untar.gz them in a container or in a WSL2 instance.
Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.
I've sed and modifiy all configuration files to adapt paths:
spark-env.sh
topology.py
Any *.txt, *.sh, *.py
So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.
So I set a bunch of env variables to make this work:
export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>
Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.
I launch my pyspark binary like this:
pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose
FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf
In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :
Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...
After few seconds it fails to create a sparkSession:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58
From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.
In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
</property>
Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?
Edit 1:
As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.
PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip
Now I get an error UnknownHostException:
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...

How to let spark cluster fetch package jars from local path rather than from master?

I found that every time I launch an application in my spark standalone cluster with external packages, say pyspark --master=spark://master:7077 --packages Azure:mmlspark:0.17, executors are always trying to fetch package jars from the driver. Here is the log:
2019-05-23 21:14:56 INFO Executor:54 - Fetching spark://Master:2653/files/com.microsoft.cntk_cntk-2.4.jar with timestamp 1558616430055
2019-05-23 21:14:56 INFO TransportClientFactory:267 - Successfully created connection to Master/192.168.100.2:2653 after 23 ms (0 ms spent in bootstraps)
2019-05-23 21:14:56 INFO Utils:54 - Fetching spark://Master:2653/files/com.microsoft.cntk_cntk-2.4.jar to /tmp/spark-0a60d982-0082-4d37-aea1-e1c0b21ee2be/executor-c9632fd2-29fc-429c-bdfb-31d870ed19e8/spark-15805ad8-ab00-41b3-b466-b0e8e95a3f56/fetchFileTemp5196357990337888981.tmp
Something like this repeats in the logs of the executors. The size of the package is quite big, so the process takes a lot of time.
I have tried using --jars argument of pyspark to upload required jars to each executor. Executors did fetch them from the local path, but I couldn't import the package in the shell.
So how to solve the problem? What should I do to let the executors fetch the package from the local path? Or maybe from HDFS?
We can copy the jar to all nodes and add the path to the jar in spark.executor.extraClassPath config param, so that the jars are available in executor's classpath.

Spark on Yarn Client Mode - HADOOP_CONF_DIR is ignored

I have Spark 2.1.1 installed on top of Hadoop 2.8.1.
I have already specified HADOOP_CONF_DIR on spark-env.sh. I also have the following setting on spark-defaults.sh
spark.yarn.access.namenodes hdfs://hadoop-node0:55555/
But when I execute spark-shell with the following command for instance,
sparkuser#hadoop-node0:/home/apps/spark-2.1.1-bin-hadoop2.7$ bin/spark-shell --master yarn --deploy-mode client
HADOOP_CONF_DIR setting seems to be ignored so it does not retrieve the settings on core-site.xml and hdfs-site.xml, as I always get the following error:
17/07/25 10:15:24 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.UnknownHostException: spark
When I added "spark" on my /etc/hosts as alternative for localhost, I always get the following error:
17/07/25 10:17:15 ERROR spark.SparkContext: Error initializing SparkContext.
java.net.ConnectException: Call From XXXX/XXX.XXX.XXX.XXX to spark:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
So it always tries to reach 127.0.0.1:8020 which of course does not work as nothing is listening on it.
What do you think I missed specifying in the config files?
Thanks a lot in advance.
Kind regards,
Anto

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

Resources