Run Spark executors as specified Linux user - apache-spark

I have a spark standalone cluster having 5 nodes. All nodes have mounted the same volume via nfs. Files within these mount have certain linux file permission.
When I Spark-submit my Job as user x (who is available on all nodes and has on all nodes the same uid) then I want that the spark executors also run as user x so that the job can only access files user x has permissions to.
I don't have Kerberos and I don't have HDFS.
Is this possible in this setup?
Would it help to use YARN?

As someone who as been playing a lot with Spark Standalone, Yarn, HDFS and so on.
Here's what my experience told me:
Spark Standalone has absolutly no form of access control or regulation possible.
Yarn without HDFS is possible but your job will always run as Yarn, if you write files to something else than HDFS the files will be owned by yarn user.
Kerberos is not a solution to that kind of usage, HDFS/yarn works hand in hand and if you run a job as spark with kerberos and write in HDFS the files will belong to spark. If you do the same with NFS or any other distributed file system files will belong to the system user used to run Yarn.
Finally you might be able to mitigate some issues with Ranger or Livy but files written outside of HDFS will belong to the system user that write them.
My conclusion to such a problem is that HDFS is the centre piece of all the Hadoop ecosystem and not using it is problematic.
It kinda s**** somehow as HDFS is really complex to maintain compared to NFS.

Related

Run spark cluster using an independent YARN (without using Hadoop's YARN)

I want to deploy a spark cluster with YARN cluster manager.
This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
If yes, Is there any downside or performance penalty to this approach?
If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.
using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available

Run Spark or Flink on a distributed file system other than HDFS or S3

Is there a way to run Spark or Flink on a distributed file system say lustre or anything except from HDFS or S3.
So we are able to create a distributed file system framework using Unix cluster, Can we run spark/flink on a cluster mode rather than standalone.
you can use file:/// as a DFS provided every node has access to common paths, and *your app is configured to use those common paths for sharing source libraries, source data, intermediate data, final data
Things like lustre tend to do that and/or have a specific hadoop filesystem client lib which wraps/extends that.

Is it possible to run ANY application or program with HADOOP YARN?

I'm studying distributed computing recently and found out Hadoop Yarn is one of them.
So thought if I just establish Hadoop Yarn cluster, then every application will run distributed.
But now someone told me that HADOOP Yarn cannot do anything by itself and need other things like mapreduce, spark, and hbase.
If this is correct, then is that mean only limited tasks can be run with Yarn?
Or can I apply Yarn's distributed computing to all applications I want?
Hadoop is the name which refers to the entire system.
HDFS is the actual storage system. Think of it as S3 or a distributed Linux filesystem.
YARN is a framework for scheduling jobs and allocating resources. It handles these things for you, but you don't interact very much with it.
Spark and MapReduce are managed by Yarn. With these two, you can actually write your code/applications and give work to the cluster.
HBase uses the HDFS storage (with is file based) and provides NoSql storage.
Theoretically you can run more than just Spark and MapReduce on Yarn and you can use something else then Yarn (Kubernetes is in works or supported now). You can even write your own processing tool, queue/resource management system, storage... Hadoop has many pieces which you may use or not, depending on your case. But the majority of Hadoop systems use Yarn and Spark.
If you want to deploy Docker containers for example, just a Kubernetes cluster would be a better choice. If you need batch/real time processing with Spark, use Hadoop.
YARN itself is a resource manager. You will need to write code that can be deployed onto those resources, and then that could do anything, given that the nodes running the tasks are themselves capable of running the job. For example, you cannot distribute a Python library without first installing the dependencies for that script. Mesos is a bit more generalized / accessible than YARN, if you want more flexibility for the same affect.
YARN mostly supports running JAR files, shell scripts (at least, from Oozie) or Docker containers can be deployed to it as well (refer Apache docs)
You may also refer to the Apache Slider or Twill projects for more information.

How do I set up a HDFS file system to run a Spark job with HDFS?

I am interested in running Spark in standalone mode with Minio/HDFS.
This question asked exactly what I want: "I require a HDFS, is it thus enough to just use the file-system part of Hadoop?" -- but the accepted answer was not helpful, as it did not mention how to use HDFS with Spark.
I have downloaded Spark 2.4.3 pre-built for Apache Hadoop 2.7 and later.
I have followed the Apache Spark tutorials and successfully deployed one master (my local machine) and one worker (my RPi4 on the same local network). I was able to run a simple word count (counting words in /opt/spark/README.md).
Now I want to count words of a file that exists only on the master. I understand that I will need to use HDFS for this to share files across the local network. However, I don't have any idea how to do this, despite perusing both the Apache Spark and Hadoop documentation.
I am confused about the interplay between Spark and Hadoop. I don't know if I should be setting up a Hadoop cluster in addition to a Spark cluster. This tutorial on hadoop.apache.org doesn't seem to help, as it says that "you will need to start both the HDFS and YARN cluster". I want to run Spark in standalone mode, not YARN.
What do I need to do in order for me to run
val textFile = spark.read.textFile("file_that_exists_only_on_my_master")
and have the file be propagated to the worker nodes, i.e. not get a "File does not exist" error on the worker nodes?
I set up MinIO instead, and wrote the following Github Gist on instructions.
The trick is to set up core_site.xml to point to the MinIO server.
Github Gist here
<script src="https://gist.github.com/lieuzhenghong/c062aa2c5544d6b1a0fa5139e10441ad.js"></script>

How to specify where to read data from HDFS when submitting Spark application?

I have been trying to deploy a spark multi-node cluster on three machines (master, slave1 and slave2). I have successfully deployed the spark cluster but I am confused about how to distribute my HDFS data over the slaves? Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client? I have searched multiple forums but haven't been able to figure out how to use HDFS with Spark without using Hadoop.
tl;dr Store files to be processed by a Spark application on Hadoop HDFS and Spark executors will be told how to access them.
From HDFS Users Guide:
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
So, HDFS is a mere file system that you can use to store files and use them in a distributed application, incl. a Spark application.
To my great surprise, it's only in HDFS Architecture where you can find a HDFS URI, i.e. hdfs://localhost:8020/user/hadoop/delete/test1 that is a HDFS URL to a resource delete/test1 that belongs to the user hadoop.
The URL that start with hdfs points at a HDFS that in the above example is managed by a NameNode at localhost:8020.
That means that HDFS does not require Hadoop YARN, but is usually used together because they come together and is just simple to use together.
Do I need to manually put data on my slave nodes and how can I specify where to read data from when submitting an application from the client?
Spark supports Hadoop HDFS with or without Hadoop YARN. A cluster manager (aka master URL) is an orthogonal concern to HDFS.
Wrapping it up, just use hdfs://hostname:port/path/to/directory with to access files on HDFS.

Resources