we've built hadoop cluster based on cdh, and now would like to implement mutiple-user to protect data for each user, is there andy good solution for this ?
Usually Hadoop comes with Linux type permissions. One user cannot access another user data, unless he is superuser.
Apart from that, You can consider using Hadoop ACL's and Sentry
Related
I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help
I'm trying to work out how one might enforce security when running spark jobs on a shared cluster. I understand how one can ensure unauthorised nodes cannot join the cluster (setup shared secret kerberos auth) and how one can restrict who can submit jobs (run under yarn and then use something like ranger to restrict who can access each queue). I am however, struggling to understand how one might restrict access to resources needed by the spark job.
If I understand correctly all Spark processes on the worker nodes will run as the spark user. Presumably the spark user itself should have pretty minimal permissions, however the question then becomes what to do if your spark job needs to access e.g. sql server. The Spark security docs make mention of a key store. Does that mean that a user submitting a job can pass through a principal and keytab with spark-submit which can be used to authenticate with the external resource as if it were the submitter making the request.
A follow up question is that the security docs also mention that temporary files (shuffle files etc) are not encrypted. Does that mean that you have to assume that any data processed by spark may be potentially leaked to any other user of your spark cluster? If so is it possible to use their proposed workaround (use an encrypted partition for this data) to solve this? I'm assuming not as the spark user itself must have the ability to decrypt this data and all programs are runining as this user....
I'm trying to work out how one might enforce security when running
spark jobs on a shared cluster. I understand how one can ensure
unauthorised nodes cannot join the cluster (setup shared secret
kerberos auth) and how one can restrict who can submit jobs (run under
yarn and then use something like ranger to restrict who can access
each queue). I am however, struggling to understand how one might
restrict access to resources needed by the spark job.
You use YARN queue to do that. Each queue could have minimal amount of resources available for the queue. Thus, you define queue ACL to ensure that only trusted users will submit to the queue and define minimum amount of resources this queue will have.
If I understand correctly all Spark processes on the worker nodes will
run as the spark user.
Your understanding is not accurate. With Kerberos enabled (which is precondition for any security discussion) Spark jobs will be executed as the Kerberos user, who launched them. There is an important caveat to the matter — Kerberos usernames must match operating system usernames.
Presumably the spark user itself should have
pretty minimal permissions, however the question then becomes what to
do if your spark job needs to access e.g. sql server. The Spark
security docs make mention of a key store. Does that mean that a user
submitting a job can pass through a principal and keytab with
spark-submit which can be used to authenticate with the external
resource as if it were the submitter making the request.
This key store is used for a different and very specific purpose — support TLS encryption for HTTP communication (e.g. Spark UI). Thus, you can not use it as a secret storage to access third-party systems. Overall, in Hadoop infrastructure there is no way to share credentials with the job. Thus, mechanism should be reinvented every time. As jobs will be executed on OS-level on behalf of users that start them, you could rely on OS controls to distribute credentials to third-party resources (e.g. file system permissions).
A follow up question is that the security docs also mention that
temporary files (shuffle files etc) are not encrypted. Does that mean
that you have to assume that any data processed by spark may be
potentially leaked to any other user of your spark cluster? If so is
it possible to use their proposed workaround (use an encrypted
partition for this data) to solve this? I'm assuming not as the spark
user itself must have the ability to decrypt this data and all
programs are runining as this user....
There couple things to note. First of all, as already mentioned, Spark job on Kerberized-cluster will be executed as a user, who started the job. All temporary files produced by the job will have file system permissions that grant access to only that specific user and yarn group (includes only yarn user). Secondly, disk encryption will protect you from disk being stolen, but will never guaranty safety for OS-level attacks. Thirdly, as of Spark 2.1 temporary files encryption is available.
If you are interested to get more in-depth understanding of Spark-on-YARN security model I would encourage you to read Apache Spark on YARN Security Model Analysis (disclaimer I'm the author).
From the wiki provided by those 2 projects, I found it seems they did the similar job. But there must be some difference or it's no need for 2.
So what are the differences, and what is the practical advice to choose from one another.
thx a lot!
Great answers above.
Just quick update with Cloudera+Hortonworks merge last year.
These companies have decided to standardize on Ranger.
CDH5 and CDH6 will still use Sentry until CDH product line retires in ~2-3 years.
Ranger will be used for Cloudera+Hortonworks' combined "Unity" platform / CDP product.
Cloudera were saying to us that Ranger is a more "mature" product.
Since Unity hasn't released yet (as of May 2019), something may come up in the future, but that's the current direction. (Oct 2019 update: Unity is now known as CDP and is available for beta testing; will be available for cloud deployments soon, and in 2020 for on-prem customers)
If you're a former Cloudera customer / or CDH user, you would still have to use Apache Sentry. There is a significant overlap between Sentry and Ranger, but if you start fresh, definitely look at Ranger.
You can use Sentry or Ranger depends upon what hadoop distribution tool that you are using like Cloudera or Hortonworks.
Apache Sentry - Owned by Cloudera. Supports HDFS, Hive, Solr and Impala. (Ranger will not support Impala)
Apache Ranger - Owned by Hortonworks. Apache Ranger offers a centralized security framework to manage fine-grained access control across: HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN
https://cwiki.apache.org/confluence/display/SENTRY/Sentry+Tutorial
http://hortonworks.com/apache/ranger/
Thx Kumar
Apache Ranger overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox. Both Sentry and Ranger support column-level permissions in Hive (startig from 1.5 release).
Ref: https://www.xplenty.com/blog/2014/11/5-hadoop-security-projects/
you can also check RecordService.
RecordService provides an abstraction layer between compute frameworks and data storage. It provides row- and column-level security, and other advantages.
Ref: http://blog.cloudera.com/blog/2015/09/recordservice-for-fine-grained-security-enforcement-across-the-hadoop-ecosystem/
http://recordservice.io/
Both manage permissions based on role-table grants. Ranger provides dynamic data masking (in transit). Both integrated with Informatica's Secure at Source (Identify risky data stores in the Enterprise) to deliver Data Governance solution.
I'm trying to implement security on my hadoop data.I'm using cloudera hadoop
Below are the two specific things I'm looking for
1. Role based authorization and authentication
2. Encryption on data residing in HDFS
I have looked into Kerboroes but it doesn't provide encryption for data already residing in HDFS.
Are there any other security tools i can go for? has anyone done above two security features in cloudera hadoop.
Please suggest
I think Apache Sentry will be best for you.You can find more information here.
Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?
No. Hadoop has its own software to manage a "cluster". Just install linux and make sure the machines can talk to each other.
Deploying the Hadoop software, along with the appropriate config files, and starting it on each node (which Hadoop can do automatically) creates the cluster from the Linux machines you have. So, no, by that definition you don't need to have a separate linux cluster. If your question is whether or not you need to have a multiple-machine cluster to use Hadoop: no, you can run Hadoop on a single machine for either testing or small-sized jobs, via either local mode (where everything is confined to a single process) or pseudodistributed mode (where you trick Hadoop into thinking it's running on multiple computers).