spark standalone mode -- permission to submit and file read/write permission

spark standalone mode -- permission to submit and file read/write permission - apache-spark

I have a simple Spark Standalone cluster setup, it is launched by a system user called 'spark'. It's been a few years since such questions have been raised so I am wondering if things have changed.
Firstly, the cluster is reading/writing from NFS (no HDFS). Job is submitted by a real user, but it appears that user 'spark' is doing the actual read/write on NFS, without granting permission to user 'spark' to the real users' directory, we run into issues.
If I give user 'spark' access this the real user's directory. Then another users can use this cluster to access this former user's directory. This seems like a security concern.
Is this still the case? We are not ready to explore YARN, HDFS etc.
Secondly, is there a way for the real user to start the standalone cluster AND only allow the user to submit jobs and no other users?
Thanks!
Read through all the posts in stack overflow

Related

Does calling `cache` on a spark dataframe eliminate the need for future calls to Hive/HDFS?

We have a spark application that reads data using spark SQL from HMS tables built on parquet files stored in HDFS. The spark application is running on a seperate hadoop environment. We use delegation tokens to allow the spark application to authenticate to Kerberized HMS/HDFS. We cannot and must not use keytabs to authenticate the spark application directly.
Because delegation tokens expire, after certain period of time our spark application will no longer be able to authenticate and will fail if it has not completed within the timeframe during which the token is valid.
My question is this.
If I call .cache or .persist on the source dataframe against which all subsequent operations are executed, my understanding is that this will cause spark to store all the data in memory. If all the data is in memory, it should not need to make subsequent calls to read leaf files in HDFS and the authentication error could be avoided. Not that the spark application has its own local file system, it is not using the remote HDFS source as its default fs.
Is this assumption about the behavior of .cache or .persist correct, or is the only solution to rewrite the data to intermediate storage?

Solve Kerberos issue, instead of adding work arounds. I'm not sure how you are using the kerberos principal, but I will point out that the documentation maintains a solution for this issue:
Long-Running
Applications Long-running applications may run into
issues if their run time exceeds the maximum delegation token lifetime
configured in services it needs to access.
This feature is not available everywhere. In particular, it’s only
implemented on YARN and Kubernetes (both client and cluster modes),
and on Mesos when using client mode.
Spark supports automatically creating new tokens for these
applications. There are two ways to enable this functionality.
Using a Keytab
By providing Spark with a principal and keytab (e.g.
using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
I would also point out that caching will reduce visits to HDFS but may still require reads from HDFS if there isn't sufficient space in memory. If you don't solve the Kerberos issue because of [reasons]. You may wish to instead use checkpoints. They are slower than caching, but are made specifically to help [long running process that sometimes fail] get over that hurdle of expensive recalculation, but they do require disk to be written to. This will remove any need to revisit the original HDFS cluster. Typically they're used in Streaming to remove data lineage, but they also have their place in expensive long running spark applications. (You also need to manage their cleanup.)
How to recover with a checkpoint file.

Run Spark executors as specified Linux user

I have a spark standalone cluster having 5 nodes. All nodes have mounted the same volume via nfs. Files within these mount have certain linux file permission.
When I Spark-submit my Job as user x (who is available on all nodes and has on all nodes the same uid) then I want that the spark executors also run as user x so that the job can only access files user x has permissions to.
I don't have Kerberos and I don't have HDFS.
Is this possible in this setup?
Would it help to use YARN?

As someone who as been playing a lot with Spark Standalone, Yarn, HDFS and so on.
Here's what my experience told me:
Spark Standalone has absolutly no form of access control or regulation possible.
Yarn without HDFS is possible but your job will always run as Yarn, if you write files to something else than HDFS the files will be owned by yarn user.
Kerberos is not a solution to that kind of usage, HDFS/yarn works hand in hand and if you run a job as spark with kerberos and write in HDFS the files will belong to spark. If you do the same with NFS or any other distributed file system files will belong to the system user used to run Yarn.
Finally you might be able to mitigate some issues with Ranger or Livy but files written outside of HDFS will belong to the system user that write them.
My conclusion to such a problem is that HDFS is the centre piece of all the Hadoop ecosystem and not using it is problematic.
It kinda s**** somehow as HDFS is really complex to maintain compared to NFS.

how to deal with shared file permissions in standalone spark cluster?

We are setting up a spark cluster using the standalone deploy method. Master and all workers are looking at a shared (networked) file system. (This is a cluster that is spun up every now and then to do heavy lifting data wrangling, no need for the (beautiful but intense) HDFS).
The services are running as user spark with group spark. My user is a member of group spark. When I start a session, an application gets created on the cluster. That application can read any file on the shared file system that is readable by the group spark.
But when I write a file to it, in this setup (for instance: orders.write.parquet("file:///srv/spark-data/somefile.parquet")), the various steps are performed by different users - depending on which service in the application is performing it.
It seems that the directory is created by my user. The the spark user writes files to it (in _temporary). And then my user gets to move these temporary files to their final destination.
And this is where it goes wrong. These temporary files only have read access for the group spark. Therefore my user cannot move them accross to the permanent place.
I have not yet found a solution to either a) have all workers run under my user account or b) have the file permissions on these temporary files as read + write.
My current work-around it to create my session as user spark. That works fine, of course, but is not ideal for obvious reasons.

Securing Apache Spark

I'm trying to work out how one might enforce security when running spark jobs on a shared cluster. I understand how one can ensure unauthorised nodes cannot join the cluster (setup shared secret kerberos auth) and how one can restrict who can submit jobs (run under yarn and then use something like ranger to restrict who can access each queue). I am however, struggling to understand how one might restrict access to resources needed by the spark job.
If I understand correctly all Spark processes on the worker nodes will run as the spark user. Presumably the spark user itself should have pretty minimal permissions, however the question then becomes what to do if your spark job needs to access e.g. sql server. The Spark security docs make mention of a key store. Does that mean that a user submitting a job can pass through a principal and keytab with spark-submit which can be used to authenticate with the external resource as if it were the submitter making the request.
A follow up question is that the security docs also mention that temporary files (shuffle files etc) are not encrypted. Does that mean that you have to assume that any data processed by spark may be potentially leaked to any other user of your spark cluster? If so is it possible to use their proposed workaround (use an encrypted partition for this data) to solve this? I'm assuming not as the spark user itself must have the ability to decrypt this data and all programs are runining as this user....

I'm trying to work out how one might enforce security when running
spark jobs on a shared cluster. I understand how one can ensure
unauthorised nodes cannot join the cluster (setup shared secret
kerberos auth) and how one can restrict who can submit jobs (run under
yarn and then use something like ranger to restrict who can access
each queue). I am however, struggling to understand how one might
restrict access to resources needed by the spark job.
You use YARN queue to do that. Each queue could have minimal amount of resources available for the queue. Thus, you define queue ACL to ensure that only trusted users will submit to the queue and define minimum amount of resources this queue will have.
If I understand correctly all Spark processes on the worker nodes will
run as the spark user.
Your understanding is not accurate. With Kerberos enabled (which is precondition for any security discussion) Spark jobs will be executed as the Kerberos user, who launched them. There is an important caveat to the matter — Kerberos usernames must match operating system usernames.
Presumably the spark user itself should have
pretty minimal permissions, however the question then becomes what to
do if your spark job needs to access e.g. sql server. The Spark
security docs make mention of a key store. Does that mean that a user
submitting a job can pass through a principal and keytab with
spark-submit which can be used to authenticate with the external
resource as if it were the submitter making the request.
This key store is used for a different and very specific purpose — support TLS encryption for HTTP communication (e.g. Spark UI). Thus, you can not use it as a secret storage to access third-party systems. Overall, in Hadoop infrastructure there is no way to share credentials with the job. Thus, mechanism should be reinvented every time. As jobs will be executed on OS-level on behalf of users that start them, you could rely on OS controls to distribute credentials to third-party resources (e.g. file system permissions).
A follow up question is that the security docs also mention that
temporary files (shuffle files etc) are not encrypted. Does that mean
that you have to assume that any data processed by spark may be
potentially leaked to any other user of your spark cluster? If so is
it possible to use their proposed workaround (use an encrypted
partition for this data) to solve this? I'm assuming not as the spark
user itself must have the ability to decrypt this data and all
programs are runining as this user....
There couple things to note. First of all, as already mentioned, Spark job on Kerberized-cluster will be executed as a user, who started the job. All temporary files produced by the job will have file system permissions that grant access to only that specific user and yarn group (includes only yarn user). Secondly, disk encryption will protect you from disk being stolen, but will never guaranty safety for OS-level attacks. Thirdly, as of Spark 2.1 temporary files encryption is available.
If you are interested to get more in-depth understanding of Spark-on-YARN security model I would encourage you to read Apache Spark on YARN Security Model Analysis (disclaimer I'm the author).

How can each user on a machine install his/her own pseudodistributed hadoop cluster?

So, let us say, there are different users which use a server, and each one of them want to experiment/play with Hadoop by installing their own Hadoop in pseudo-distributed mode. How can each of them have that done without causing conflicts between the ports of different users? Also, notice that none of them has admin rights.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string