How to use authentication on Spark Thrift server on Spark standalone cluster - apache-spark

I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help

Related

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Does calling `cache` on a spark dataframe eliminate the need for future calls to Hive/HDFS?

We have a spark application that reads data using spark SQL from HMS tables built on parquet files stored in HDFS. The spark application is running on a seperate hadoop environment. We use delegation tokens to allow the spark application to authenticate to Kerberized HMS/HDFS. We cannot and must not use keytabs to authenticate the spark application directly.
Because delegation tokens expire, after certain period of time our spark application will no longer be able to authenticate and will fail if it has not completed within the timeframe during which the token is valid.
My question is this.
If I call .cache or .persist on the source dataframe against which all subsequent operations are executed, my understanding is that this will cause spark to store all the data in memory. If all the data is in memory, it should not need to make subsequent calls to read leaf files in HDFS and the authentication error could be avoided. Not that the spark application has its own local file system, it is not using the remote HDFS source as its default fs.
Is this assumption about the behavior of .cache or .persist correct, or is the only solution to rewrite the data to intermediate storage?
Solve Kerberos issue, instead of adding work arounds. I'm not sure how you are using the kerberos principal, but I will point out that the documentation maintains a solution for this issue:
Long-Running
Applications Long-running applications may run into
issues if their run time exceeds the maximum delegation token lifetime
configured in services it needs to access.
This feature is not available everywhere. In particular, it’s only
implemented on YARN and Kubernetes (both client and cluster modes),
and on Mesos when using client mode.
Spark supports automatically creating new tokens for these
applications. There are two ways to enable this functionality.
Using a Keytab
By providing Spark with a principal and keytab (e.g.
using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
I would also point out that caching will reduce visits to HDFS but may still require reads from HDFS if there isn't sufficient space in memory. If you don't solve the Kerberos issue because of [reasons]. You may wish to instead use checkpoints. They are slower than caching, but are made specifically to help [long running process that sometimes fail] get over that hurdle of expensive recalculation, but they do require disk to be written to. This will remove any need to revisit the original HDFS cluster. Typically they're used in Streaming to remove data lineage, but they also have their place in expensive long running spark applications. (You also need to manage their cleanup.)
How to recover with a checkpoint file.

why Livy or spark-jobserver instead of a simple web framework?

I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit seems to work fine:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
#cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?
If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job
What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)
I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:
Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.
Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.

Securing Apache Spark

I'm trying to work out how one might enforce security when running spark jobs on a shared cluster. I understand how one can ensure unauthorised nodes cannot join the cluster (setup shared secret kerberos auth) and how one can restrict who can submit jobs (run under yarn and then use something like ranger to restrict who can access each queue). I am however, struggling to understand how one might restrict access to resources needed by the spark job.
If I understand correctly all Spark processes on the worker nodes will run as the spark user. Presumably the spark user itself should have pretty minimal permissions, however the question then becomes what to do if your spark job needs to access e.g. sql server. The Spark security docs make mention of a key store. Does that mean that a user submitting a job can pass through a principal and keytab with spark-submit which can be used to authenticate with the external resource as if it were the submitter making the request.
A follow up question is that the security docs also mention that temporary files (shuffle files etc) are not encrypted. Does that mean that you have to assume that any data processed by spark may be potentially leaked to any other user of your spark cluster? If so is it possible to use their proposed workaround (use an encrypted partition for this data) to solve this? I'm assuming not as the spark user itself must have the ability to decrypt this data and all programs are runining as this user....
I'm trying to work out how one might enforce security when running
spark jobs on a shared cluster. I understand how one can ensure
unauthorised nodes cannot join the cluster (setup shared secret
kerberos auth) and how one can restrict who can submit jobs (run under
yarn and then use something like ranger to restrict who can access
each queue). I am however, struggling to understand how one might
restrict access to resources needed by the spark job.
You use YARN queue to do that. Each queue could have minimal amount of resources available for the queue. Thus, you define queue ACL to ensure that only trusted users will submit to the queue and define minimum amount of resources this queue will have.
If I understand correctly all Spark processes on the worker nodes will
run as the spark user.
Your understanding is not accurate. With Kerberos enabled (which is precondition for any security discussion) Spark jobs will be executed as the Kerberos user, who launched them. There is an important caveat to the matter — Kerberos usernames must match operating system usernames.
Presumably the spark user itself should have
pretty minimal permissions, however the question then becomes what to
do if your spark job needs to access e.g. sql server. The Spark
security docs make mention of a key store. Does that mean that a user
submitting a job can pass through a principal and keytab with
spark-submit which can be used to authenticate with the external
resource as if it were the submitter making the request.
This key store is used for a different and very specific purpose — support TLS encryption for HTTP communication (e.g. Spark UI). Thus, you can not use it as a secret storage to access third-party systems. Overall, in Hadoop infrastructure there is no way to share credentials with the job. Thus, mechanism should be reinvented every time. As jobs will be executed on OS-level on behalf of users that start them, you could rely on OS controls to distribute credentials to third-party resources (e.g. file system permissions).
A follow up question is that the security docs also mention that
temporary files (shuffle files etc) are not encrypted. Does that mean
that you have to assume that any data processed by spark may be
potentially leaked to any other user of your spark cluster? If so is
it possible to use their proposed workaround (use an encrypted
partition for this data) to solve this? I'm assuming not as the spark
user itself must have the ability to decrypt this data and all
programs are runining as this user....
There couple things to note. First of all, as already mentioned, Spark job on Kerberized-cluster will be executed as a user, who started the job. All temporary files produced by the job will have file system permissions that grant access to only that specific user and yarn group (includes only yarn user). Secondly, disk encryption will protect you from disk being stolen, but will never guaranty safety for OS-level attacks. Thirdly, as of Spark 2.1 temporary files encryption is available.
If you are interested to get more in-depth understanding of Spark-on-YARN security model I would encourage you to read Apache Spark on YARN Security Model Analysis (disclaimer I'm the author).

Typical Hadoop setup for remote job submission

So I am still a bit new to hadoop and am currently in the process of setting up a small test cluster on Amazonaws. So my question relates to some tips on the structuring of the cluster so it is possible to work submit jobs from remote machines.
Currently I have 5 machines. 4 are basically the Hadoop cluster with the NameNodes, Yarn etc. One machine is used as a manager machine( Cloudera Manager). I am gonna describe my thinking process on the setup and if anyone can chime in the points I am not clear with, that would be great.
I was thinking what was the best setup for a small cluster. So I decided to expose only one manager machine and probably use that to submit all the jobs through it. The other machines will see each other etc, but not be accessible from the outside world. I am have conceptual idea on how to do this,but I am not sure how to properly go about doing this though, if anyone could point me in the right direction that would great.
Also another big point is, I want to be able to submit jobs to the cluster through exposed machine from a client machine (might be Windows). I am not so clear on this setup as well. Do I need to have Hadoop installed on the machine in order to use the normal hadoop commands, and to write/submit jobs say from Eclipse or something similar.
So to sum it up my questions are,
Is this an ok setup for a small test cluster
How can I go about using one exposed machine to submit/route jobs to the cluster, without having any of the Hadoop nodes on it.
How do I setup a client machine to submit jobs to a remote cluster, and an example on how to do it on Windows. Also if there are any reason not to use Windows as a client machine in this setup.
Thanks I would greatly appreciate any advice or help on this.
Since this is not answered I will attempt to answer it.
1. Rest api to submit an application:
Resource 1(Cluster Applications API(Submit Application)): https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_APISubmit_Application
Resource 2: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ch_yarn_rest_apis.html
Resource 3: https://hadoop-forum.org/forum/general-hadoop-discussion/miscellaneous/2136-how-can-i-run-mapreduce-job-by-rest-api
Resource 4: Run a MapReduce job via rest api
2. Submitting hadoop job from  client machine
Resource 1: https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/
3. Sending program to remote hadoop cluster
It is possible to send the program to a remote Hadoop cluster for running it. All you need to ensure is that you have set the resource manager address, fs.defaultFS, library files, and mapreduce.framework.name correctly before running the actual job.
Resource 1: (how to submit mapreduce job with yarn api in java)

Resources