How to load data into spark from a remote HDFS? - apache-spark

Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?

You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)
Custom Hadoop/Hive Configuration
If your Spark application is interacting with Hadoop, Hive, or both, there are probably Hadoop/Hive
configuration files in Spark’s classpath.
Multiple running applications might require different Hadoop/Hive
client side configurations. You can copy and modify hdfs-site.xml,
core-site.xml, yarn-site.xml, hive-site.xml in Spark’s classpath for
each application. In a Spark cluster running on YARN, these
configuration files are set cluster-wide, and cannot safely be changed
by the application.
As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.

For anyone, who wants to access remote HDFS from Spark Java app, here is steps.
Firstly, you need to add --conf key to your run command. Depends on Spark version:
(Spark 1.x-2.1)
spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
(Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB
Secondly, when you creating Spark’s Java context, add that:
javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));
If you facing this exception:
java.net.UnknownHostException: clusterB
then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.
More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.

Related

Does Spark 2.4.4 support forwarding Delegation Tokens when master is k8s?

I'm currently in the process of setting up a Kerberized environment for submitting Spark Jobs using Livy in Kubernetes.
What I've achieved so far:
Running Kerberized HDFS Cluster
Livy using SPNEGO
Livy submitting Jobs to k8s and spawning Spark executors
KNIME is able to interact with Namenode and Datanodes from outside the k8s Cluster
To achieve this I used the following Versions for the involved components:
Spark 2.4.4
Livy 0.5.0 (The currently only supported version by KNIME)
Namenode and Datanode 2.8.1
Kubernetes 1.14.3
What I'm currently struggling with:
Accessing HDFS from the Spark executors
The error message I'm currently getting, when trying to access HDFS from the executor is the following:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "livy-session-0-1575455179568-exec-1/10.42.3.242"; destination host is: "hdfs-namenode-0.hdfs-namenode.hdfs.svc.cluster.local":8020;
The following is the current state:
KNIME connects to HDFS after having successfully challenged against the KDC (using Keytab + Principal) --> Working
KNIME puts staging jars to HDFS --> Working
KNIME requests new Session from Livy (SPNEGO challenge) --> Working
Livy submits Spark Job with k8s master / spawns executors --> Working
KNIME submits tasks to Livy which should be executed by the executors --> Basically working
When trying to access HDFS to read a file the error mentioned before occurs --> The problem
Since KNIME is placing jar files on HDFS which have to be included in the dependencies for the Spark Jobs it is important to be able to access HDFS. (KNIME requires this to be able to retrieve preview data from DataSets for example)
I tried to find a solution to this but unfortunately, haven't found any useful resources yet.
I had a look at the code an checked UserGroupInformation.getCurrentUser().getTokens().
But that collection seems to be empty. That's why I assume that there are not Delegation Tokens available.
Has anybody ever achieved running something like this and can help me with this?
Thank you all in advance!
For everybody struggeling with this:
It took a while to find the reason on why this is not working, but basically it is related to Spark's Kubernetes implementation as of 2.4.4.
There is no override defined for CoarseGrainedSchedulerBackend's fetchHadoopDelegationTokens in KubernetesClusterSchedulerBackend.
There has been the pull request which will solve this by passing secrets to executors containing the delegation tokens.
It was already pulled into master and is available in Spark 3.0.0-preview but is not, at least not yet, available in the Spark 2.4 branch.

How to specify where to read data from HDFS when submitting Spark application?

I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.
tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.
The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.

How to use custom log4j.properties accessible via hdfs file with dcos Spark

I am trying to pass a custom log4j.properties file located in hdfs to a spark job launched with the dcos cli. I have tried any possible permutation of available spark configuration options (e.g. --conf spark.executor.extraJavaOptions=‘-Dlog4j.configuration=hdfs://conf/custom-log4j.properties’) but there is no way to make it work, the configuration is just ignored.

How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container?

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)
Thanks in advance!

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !
I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.
Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.
Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

Resources