Spark Remote execution to Cluster fails - HDFS connection Refused at 8020 - apache-spark

I am having issues submitting a spark-submit remote job from a machine outside from the Spark Cluster running on YARN.
Exception in thread "main" java.net.ConnectionException: Call from remote.dev.local/192.168.10.65 to target.dev.local:8020 failed on connection exception: java.net.ConnectionException: Connection Refused
In my core-site.xml:
<property>
<name>fs.defaultFS</name>
<value>hdfs://target.dev.local:8020</value>
<property>
Also at my hdfs-site.xml in the cluster I have disbled permissions checking for HDFS:
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<property>
Also, when I telnet from the machine outside the cluster:
telnet target.dev.local 8020
I am getting
telnet: connect to address 192.168.10.186: Connection Refused
But, when I
telnet target.dev.local 9000
it says Connected.
Also when I ping target.dev.local it works.
My spark-submit script from the remote machine is:
export HADOOP_CONF_DIR=/<path_to_conf_dir_copied_from_cluster>/
spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 5g \
--executor-memory 50g \
--executor-cores 5 \
--queue default \
<path to jar>.jar \
10
What am I missing here?

Turns out I had to change
<property>
<name>fs.defaultFS</name>
<value>hdfs://target.dev.local:8020</value>
<property>
to
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:8020</value>
<property>
to allow connections form the outside since target.dev.local sits in a private network switch.

Related

Connect to hive metastore from remote spark

I have the hadoop cluster with installed hive and spark. In addition I have a separate workstation machine and I am trying to connect to the cluster from it
I installed spark on this machine and try to connect using following command:
pyspark --name testjob --master spark://hadoop-master.domain:7077
In the results I see sunning application on the spark WebUI page.
I want to connect to hive database (in the cluster) from my workstation, but I can't do this. I have the hive-site.xml config into my spark conf directory on local workstation with following contents:
<configuration>
<property>
<name>metastore.thrift.uris</name>
<value>thrift://hadoop-master.domain:9083</value>
<description>IP address (or domain name) and port of the metastore host</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop-master.domain:9000/user/hive/warehouse</value>
<description>Warehouse location</description>
</property>
<property>
<name>metastore.warehouse.dir</name>
<value>hdfs://hadoop-master.domain:9000/user/hive/warehouse</value>
<description>Warehouse location</description>
</property>
<property>
<name>spark.sql.hive.metastore.version</name>
<value>3.1.0</value>
<description>Metastore version</description>
</property>
</configuration>
I tied this construction, but can't make it work with external hive databases:
spark = SparkSession \
.builder \
.appName('test01') \
.config('hive.metastore.uris', "thrift://hadoop-master.domain:9083") \
.config("spark.sql.warehouse.dir", "hdfs://hadoop-master.domain:9000/user/hive/warehouse") \
.enableHiveSupport() \
.getOrCreate()
What I shoul do to connect from local pyspark to remote hive database?

Spark on Yarn Failed to send RPC and Slave lost

I want to deploy spark2.3.2 on Yarn, Hadoop2.7.3.
But when I run:
spark-shell
Always raise ERROR:
ERROR TransportClient:233 - Failed to send RPC 4858956348523471318 to /10.20.42.194:54288: java.nio.channels.ClosedChannelException
...
ERROR YarnScheduler:70 - Lost executor 1 on dc002: Slave lost
Both dc002 and dc003 will raise ERRORs Failed to send RPC and Slave lost.
I have one master node and two slave node server. They all are:
CentOS Linux release 7.5.1804 (Core) with 40 cpu and 62.6GB memory and 31.4 GB swap.
My HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=/home/spark-test/hadoop-2.7.3/etc/hadoop
My /etc/hosts:
10.20.51.154 dc001
10.20.42.194 dc002
10.20.42.177 dc003
In Hadoop and Yarn Web UI, I can see both dc002 and dc003 node, and I can run simple mapreduce task on yarn in hadoop.
But when I run spark-shell or SparkPi example program by
./spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar 10
, ERRORs always raise.
I really want to why those errors happened.
I fixed this problem by changing the yarn-site.xml conf file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Try this parameter in you code-
spark.conf.set("spark.dynamicAllocation.enabled", "false")
Secondly while doing spark submit, define parameters like --executor-memory and --num-executors
sample:
spark2-submit --executor-memory 20g --num-executors 15 --class com.executor mapping.jar

looking local file system instead of hdfs when I ran the spark submit

when I ran the spark-submit, it throw error indicating that no file in the file system as below.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/user/sclee/clustering2/mapTemplate_micron
I think that my file is on the hdfs not in the my local system.
I found that my hadoop configuration file was correctly configured as below
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark.dso.hdm1:9000</value>
</property>
<property>
How to resolve this issue?
supplement
Below is my submit query.
Actually, I used the spark well using below query. However, I mistakenly removed the spark directories. So I copied the spark directory from worker node. And then my issue occurred. I hope to fix my issue. Thanks.
hadoop fs -rm -r /home/hawq2/*
spark-submit \
--class com.bistel.spark.examples.yma.ClusterServiceBasedOnNewAlgo \
--master spark://spark.dso.spkm1:7077 \
--executor-memory 8g\
--executor-cores 4\
--jars /home/jumbo/user/sclee/clustering/guava-19.0.jar\
--conf spark.eventLog.enabled=true\
--conf spark.eventLog.dir=hdfs://spark.dso.hdm1:9000/user/jumbo/applicationHistory\
--conf spark.memory.offHeap.enabled=true\
--conf spark.memory.offHeap.size=268435456\
./new.jar\
/user/sclee/clustering2/mapTemplate_micron /user/sclee/clustering2/data/bin3 /user/sclee/clustering2/ret
It looks your HADOOP_CONF_DIR isn't loaded, or the files in it
For example, check this in the spark-env.sh, setting the correct directory for your config
HADOOP_CONF_DIR=/etc/hadoop/
Then, ensure that you have configured hdfs-site.xml, core-site.xml, and yarn-site.xml in that directory. (Although looks like you're not using YARN, so probably just the core and hdfs)

Why does spark-shell --master yarn-client fail with "UnknownHostException: Invalid host name"?

This is Spark 1.6.1.
When I do below at spark/bin
$ ./spark-shell --master yarn-client
I get the following error.
I checked hostname at /etc/hosts and also in Hadoop but they are assigned as same hostname. Any idea?

oozie spark action - how to specify spark-opts

I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.
<spark-opts>--executor-memory 20G</spark-opts> should work ideally.
Also, try using:
<master>yarn-cluster</master>
<mode>cluster</mode>
"Spark configuration options can be passed by specifying '--conf key=value' here " is probably referring the configuration tag.
For Ex:
--conf mapred.compress.map.output=true would translate to:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
try changing <master>local[*]</master> to <master>yarn</master>

Resources