Securing Apache Spark - security

I'm trying to work out how one might enforce security when running spark jobs on a shared cluster. I understand how one can ensure unauthorised nodes cannot join the cluster (setup shared secret kerberos auth) and how one can restrict who can submit jobs (run under yarn and then use something like ranger to restrict who can access each queue). I am however, struggling to understand how one might restrict access to resources needed by the spark job.
If I understand correctly all Spark processes on the worker nodes will run as the spark user. Presumably the spark user itself should have pretty minimal permissions, however the question then becomes what to do if your spark job needs to access e.g. sql server. The Spark security docs make mention of a key store. Does that mean that a user submitting a job can pass through a principal and keytab with spark-submit which can be used to authenticate with the external resource as if it were the submitter making the request.
A follow up question is that the security docs also mention that temporary files (shuffle files etc) are not encrypted. Does that mean that you have to assume that any data processed by spark may be potentially leaked to any other user of your spark cluster? If so is it possible to use their proposed workaround (use an encrypted partition for this data) to solve this? I'm assuming not as the spark user itself must have the ability to decrypt this data and all programs are runining as this user....

I'm trying to work out how one might enforce security when running
spark jobs on a shared cluster. I understand how one can ensure
unauthorised nodes cannot join the cluster (setup shared secret
kerberos auth) and how one can restrict who can submit jobs (run under
yarn and then use something like ranger to restrict who can access
each queue). I am however, struggling to understand how one might
restrict access to resources needed by the spark job.
You use YARN queue to do that. Each queue could have minimal amount of resources available for the queue. Thus, you define queue ACL to ensure that only trusted users will submit to the queue and define minimum amount of resources this queue will have.
If I understand correctly all Spark processes on the worker nodes will
run as the spark user.
Your understanding is not accurate. With Kerberos enabled (which is precondition for any security discussion) Spark jobs will be executed as the Kerberos user, who launched them. There is an important caveat to the matter — Kerberos usernames must match operating system usernames.
Presumably the spark user itself should have
pretty minimal permissions, however the question then becomes what to
do if your spark job needs to access e.g. sql server. The Spark
security docs make mention of a key store. Does that mean that a user
submitting a job can pass through a principal and keytab with
spark-submit which can be used to authenticate with the external
resource as if it were the submitter making the request.
This key store is used for a different and very specific purpose — support TLS encryption for HTTP communication (e.g. Spark UI). Thus, you can not use it as a secret storage to access third-party systems. Overall, in Hadoop infrastructure there is no way to share credentials with the job. Thus, mechanism should be reinvented every time. As jobs will be executed on OS-level on behalf of users that start them, you could rely on OS controls to distribute credentials to third-party resources (e.g. file system permissions).
A follow up question is that the security docs also mention that
temporary files (shuffle files etc) are not encrypted. Does that mean
that you have to assume that any data processed by spark may be
potentially leaked to any other user of your spark cluster? If so is
it possible to use their proposed workaround (use an encrypted
partition for this data) to solve this? I'm assuming not as the spark
user itself must have the ability to decrypt this data and all
programs are runining as this user....
There couple things to note. First of all, as already mentioned, Spark job on Kerberized-cluster will be executed as a user, who started the job. All temporary files produced by the job will have file system permissions that grant access to only that specific user and yarn group (includes only yarn user). Secondly, disk encryption will protect you from disk being stolen, but will never guaranty safety for OS-level attacks. Thirdly, as of Spark 2.1 temporary files encryption is available.
If you are interested to get more in-depth understanding of Spark-on-YARN security model I would encourage you to read Apache Spark on YARN Security Model Analysis (disclaimer I'm the author).

Related

Does calling `cache` on a spark dataframe eliminate the need for future calls to Hive/HDFS?

We have a spark application that reads data using spark SQL from HMS tables built on parquet files stored in HDFS. The spark application is running on a seperate hadoop environment. We use delegation tokens to allow the spark application to authenticate to Kerberized HMS/HDFS. We cannot and must not use keytabs to authenticate the spark application directly.
Because delegation tokens expire, after certain period of time our spark application will no longer be able to authenticate and will fail if it has not completed within the timeframe during which the token is valid.
My question is this.
If I call .cache or .persist on the source dataframe against which all subsequent operations are executed, my understanding is that this will cause spark to store all the data in memory. If all the data is in memory, it should not need to make subsequent calls to read leaf files in HDFS and the authentication error could be avoided. Not that the spark application has its own local file system, it is not using the remote HDFS source as its default fs.
Is this assumption about the behavior of .cache or .persist correct, or is the only solution to rewrite the data to intermediate storage?
Solve Kerberos issue, instead of adding work arounds. I'm not sure how you are using the kerberos principal, but I will point out that the documentation maintains a solution for this issue:
Long-Running
Applications Long-running applications may run into
issues if their run time exceeds the maximum delegation token lifetime
configured in services it needs to access.
This feature is not available everywhere. In particular, it’s only
implemented on YARN and Kubernetes (both client and cluster modes),
and on Mesos when using client mode.
Spark supports automatically creating new tokens for these
applications. There are two ways to enable this functionality.
Using a Keytab
By providing Spark with a principal and keytab (e.g.
using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
I would also point out that caching will reduce visits to HDFS but may still require reads from HDFS if there isn't sufficient space in memory. If you don't solve the Kerberos issue because of [reasons]. You may wish to instead use checkpoints. They are slower than caching, but are made specifically to help [long running process that sometimes fail] get over that hurdle of expensive recalculation, but they do require disk to be written to. This will remove any need to revisit the original HDFS cluster. Typically they're used in Streaming to remove data lineage, but they also have their place in expensive long running spark applications. (You also need to manage their cleanup.)
How to recover with a checkpoint file.

Why shouldn't local mode in Spark be used for production?

The official documentation and all sorts of books and articles repeat the recommendation that Spark in local mode should not be used for production purposes. Why not? Why is it a bad idea to run a Spark application on one machine for production purposes? Is it simply because Spark is designed for distributed computing and if you only have one machine there are much easier ways to proceed?
Local mode in Apache Spark is intended for development and testing purposes, and should not be used in production because:
Scalability: Local mode only uses a single machine, so it cannot
handle large data sets or handle the processing needs of a production
environment.
Resource Management: Spark’s standalone cluster manager or a cluster
manager like YARN, Mesos, or Kubernetes provides more advanced
resource management capabilities for production environments compared
to local mode.
Fault Tolerance: Local mode does not have the ability to recover from
failures, while a cluster manager can provide fault tolerance by
automatically restarting failed tasks on other nodes.
Security: Spark’s cluster manager provides built-in security features
such as authentication and authorization, which are not present in
local mode.
Therefore, it is recommended to use a cluster manager for production environments to ensure scalability, resource management, fault tolerance, and security.
I have the same question. I am certainly not an authority on the subject, but because no-one has answered this question, I'll try to list the reasons I've encountered while using Spark local mode in Java. So far:
Spark uses System.exit() calls in certain occassions, such as an out of memory error or when the local dir does not have write permissions, so if such a call is triggered, the entire JVM shuts down (including your own application from within which Spark runs, see e.g., https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala#L45, https://github.com/apache/spark/blob/b22946ed8b5f41648a32a4e0c4c40226141a06a0/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L63). Moreover, it circumvents your own shutdownHook, so there is no way to gracefully handle such system exits in your own application. On a cluster, it is usually fine if a certain machine restarts, but, if all Spark components are contained in a single JVM, the assumption that we can shut down the entire JVM upon a Spark failure does not (always) hold.

Is it possible to know the resources used by a specific Spark job?

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.
See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

How to use authentication on Spark Thrift server on Spark standalone cluster

I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help

How to ensure that DAG is not recomputed after the driver is restarted?

How can I ensure that an entire DAG of spark is highly available i.e. not recomputed from scratch when the driver is restarted (default HA in yarn cluster mode).
Currently, I use spark to orchestrate multiple smaller jobs i.e.
read table1
hash some columns
write to HDFS
this is performed for multiple tables.
Now when the driver is restarted i.e. when working on the second table the first one is reprocessed - though it already would have been stored successfully.
I believe that the default mechanism of checkpointing (the raw input values) would not make sense.
What would be a good solution here?
Is it possible to checkpoint the (small) configuration information and only reprocess what has not already been computed?
TL;DR Spark is not a task orchestration tool. While it has built-in scheduler and some fault tolerance mechanisms built-in, it as suitable for granular task management, as for example server orchestration (hey, we can call pipe on each machine to execute bash scripts, right).
If you want granular recovery choose a minimal unit of computation that makes sense for a given process (read, hash, write looks like a good choice, based on the description), make it an application and use external orchestration to submit the jobs.
You can build poor man's alternative, by checking if expected output exist and skipping part of the job in that case, but really don't - we have variety of battle tested tools which can do way better job than this.
As a side note Spark doesn't provide HA for the driver, only supervision with automatic restarts. Also independent jobs (read -> transform -> write) create independent DAGs - there is no global DAG and proper checkpoint of the application would require full snapshot of its state (like good old BLCR).
when the driver is restarted (default HA in yarn cluster mode).
When the driver of a Spark application is gone, your Spark application is gone and so are all the cached datasets. That's by default.
You have to use some sort of caching solution like https://www.alluxio.org/ or https://ignite.apache.org/. Both work with Spark and both claim to be offering the feature to outlive a Spark application.
There has been times when people used Spark Job Server to share data across Spark applications (which is similar to restarting Spark drivers).

Resources