how to deal with shared file permissions in standalone spark cluster? - apache-spark

We are setting up a spark cluster using the standalone deploy method. Master and all workers are looking at a shared (networked) file system. (This is a cluster that is spun up every now and then to do heavy lifting data wrangling, no need for the (beautiful but intense) HDFS).
The services are running as user spark with group spark. My user is a member of group spark. When I start a session, an application gets created on the cluster. That application can read any file on the shared file system that is readable by the group spark.
But when I write a file to it, in this setup (for instance: orders.write.parquet("file:///srv/spark-data/somefile.parquet")), the various steps are performed by different users - depending on which service in the application is performing it.
It seems that the directory is created by my user. The the spark user writes files to it (in _temporary). And then my user gets to move these temporary files to their final destination.
And this is where it goes wrong. These temporary files only have read access for the group spark. Therefore my user cannot move them accross to the permanent place.
I have not yet found a solution to either a) have all workers run under my user account or b) have the file permissions on these temporary files as read + write.
My current work-around it to create my session as user spark. That works fine, of course, but is not ideal for obvious reasons.

Related

spark standalone mode -- permission to submit and file read/write permission

I have a simple Spark Standalone cluster setup, it is launched by a system user called 'spark'. It's been a few years since such questions have been raised so I am wondering if things have changed.
Firstly, the cluster is reading/writing from NFS (no HDFS). Job is submitted by a real user, but it appears that user 'spark' is doing the actual read/write on NFS, without granting permission to user 'spark' to the real users' directory, we run into issues.
If I give user 'spark' access this the real user's directory. Then another users can use this cluster to access this former user's directory. This seems like a security concern.
Is this still the case? We are not ready to explore YARN, HDFS etc.
Secondly, is there a way for the real user to start the standalone cluster AND only allow the user to submit jobs and no other users?
Thanks!
Read through all the posts in stack overflow

Does calling `cache` on a spark dataframe eliminate the need for future calls to Hive/HDFS?

We have a spark application that reads data using spark SQL from HMS tables built on parquet files stored in HDFS. The spark application is running on a seperate hadoop environment. We use delegation tokens to allow the spark application to authenticate to Kerberized HMS/HDFS. We cannot and must not use keytabs to authenticate the spark application directly.
Because delegation tokens expire, after certain period of time our spark application will no longer be able to authenticate and will fail if it has not completed within the timeframe during which the token is valid.
My question is this.
If I call .cache or .persist on the source dataframe against which all subsequent operations are executed, my understanding is that this will cause spark to store all the data in memory. If all the data is in memory, it should not need to make subsequent calls to read leaf files in HDFS and the authentication error could be avoided. Not that the spark application has its own local file system, it is not using the remote HDFS source as its default fs.
Is this assumption about the behavior of .cache or .persist correct, or is the only solution to rewrite the data to intermediate storage?
Solve Kerberos issue, instead of adding work arounds. I'm not sure how you are using the kerberos principal, but I will point out that the documentation maintains a solution for this issue:
Long-Running
Applications Long-running applications may run into
issues if their run time exceeds the maximum delegation token lifetime
configured in services it needs to access.
This feature is not available everywhere. In particular, it’s only
implemented on YARN and Kubernetes (both client and cluster modes),
and on Mesos when using client mode.
Spark supports automatically creating new tokens for these
applications. There are two ways to enable this functionality.
Using a Keytab
By providing Spark with a principal and keytab (e.g.
using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
I would also point out that caching will reduce visits to HDFS but may still require reads from HDFS if there isn't sufficient space in memory. If you don't solve the Kerberos issue because of [reasons]. You may wish to instead use checkpoints. They are slower than caching, but are made specifically to help [long running process that sometimes fail] get over that hurdle of expensive recalculation, but they do require disk to be written to. This will remove any need to revisit the original HDFS cluster. Typically they're used in Streaming to remove data lineage, but they also have their place in expensive long running spark applications. (You also need to manage their cleanup.)
How to recover with a checkpoint file.

Cannot read persisted spark warehouse databases on subsequent sessions

I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.
I have configured the spark session conf with:
.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")
When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db
However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.
On a subsequent run with the same configured spark session, if I
baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.
What is the trick to get these local persisted databases to be available to read in subsequent execution?
I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...
Spark Version: 3.0.1
Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:
System.setProperty("derby.system.home", derbyPath)
I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.
NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.

Securing Apache Spark

I'm trying to work out how one might enforce security when running spark jobs on a shared cluster. I understand how one can ensure unauthorised nodes cannot join the cluster (setup shared secret kerberos auth) and how one can restrict who can submit jobs (run under yarn and then use something like ranger to restrict who can access each queue). I am however, struggling to understand how one might restrict access to resources needed by the spark job.
If I understand correctly all Spark processes on the worker nodes will run as the spark user. Presumably the spark user itself should have pretty minimal permissions, however the question then becomes what to do if your spark job needs to access e.g. sql server. The Spark security docs make mention of a key store. Does that mean that a user submitting a job can pass through a principal and keytab with spark-submit which can be used to authenticate with the external resource as if it were the submitter making the request.
A follow up question is that the security docs also mention that temporary files (shuffle files etc) are not encrypted. Does that mean that you have to assume that any data processed by spark may be potentially leaked to any other user of your spark cluster? If so is it possible to use their proposed workaround (use an encrypted partition for this data) to solve this? I'm assuming not as the spark user itself must have the ability to decrypt this data and all programs are runining as this user....
I'm trying to work out how one might enforce security when running
spark jobs on a shared cluster. I understand how one can ensure
unauthorised nodes cannot join the cluster (setup shared secret
kerberos auth) and how one can restrict who can submit jobs (run under
yarn and then use something like ranger to restrict who can access
each queue). I am however, struggling to understand how one might
restrict access to resources needed by the spark job.
You use YARN queue to do that. Each queue could have minimal amount of resources available for the queue. Thus, you define queue ACL to ensure that only trusted users will submit to the queue and define minimum amount of resources this queue will have.
If I understand correctly all Spark processes on the worker nodes will
run as the spark user.
Your understanding is not accurate. With Kerberos enabled (which is precondition for any security discussion) Spark jobs will be executed as the Kerberos user, who launched them. There is an important caveat to the matter — Kerberos usernames must match operating system usernames.
Presumably the spark user itself should have
pretty minimal permissions, however the question then becomes what to
do if your spark job needs to access e.g. sql server. The Spark
security docs make mention of a key store. Does that mean that a user
submitting a job can pass through a principal and keytab with
spark-submit which can be used to authenticate with the external
resource as if it were the submitter making the request.
This key store is used for a different and very specific purpose — support TLS encryption for HTTP communication (e.g. Spark UI). Thus, you can not use it as a secret storage to access third-party systems. Overall, in Hadoop infrastructure there is no way to share credentials with the job. Thus, mechanism should be reinvented every time. As jobs will be executed on OS-level on behalf of users that start them, you could rely on OS controls to distribute credentials to third-party resources (e.g. file system permissions).
A follow up question is that the security docs also mention that
temporary files (shuffle files etc) are not encrypted. Does that mean
that you have to assume that any data processed by spark may be
potentially leaked to any other user of your spark cluster? If so is
it possible to use their proposed workaround (use an encrypted
partition for this data) to solve this? I'm assuming not as the spark
user itself must have the ability to decrypt this data and all
programs are runining as this user....
There couple things to note. First of all, as already mentioned, Spark job on Kerberized-cluster will be executed as a user, who started the job. All temporary files produced by the job will have file system permissions that grant access to only that specific user and yarn group (includes only yarn user). Secondly, disk encryption will protect you from disk being stolen, but will never guaranty safety for OS-level attacks. Thirdly, as of Spark 2.1 temporary files encryption is available.
If you are interested to get more in-depth understanding of Spark-on-YARN security model I would encourage you to read Apache Spark on YARN Security Model Analysis (disclaimer I'm the author).

Spark RDD access restrictions and location within the cluster

I have a question regarding RDD access control.
There is a data which has to be kept only at the given server(or list of them), no raw data is allowed to leave it. The data can be process by some map function and only after that can be transferred further.
Are there any features in Spark or in supported cluster management solutions (e.g. Mesos)?
A HadoopRDD (used by sc.textFile for example) has an affinity to be located on the machine that has the file data. (See HadoopRDD.getPreferredLocations.) map is performed on the same machine then.
But this does not guarantee that the raw data will not leave the machine. If the Spark worker on the machine dies, for example, then another worker will load it from a different machine.
I think the safe option is to run one Spark cluster (or other processing system) on the "secure" machines, perform the map step in this cluster, and write out the result to the HDFS (or other storage system) running on the "unsecure" machines. Then a separate Spark cluster running on the "unsecure" machines can process the data.

Resources