I'm trying to implement security on my hadoop data.I'm using cloudera hadoop
Below are the two specific things I'm looking for
1. Role based authorization and authentication
2. Encryption on data residing in HDFS
I have looked into Kerboroes but it doesn't provide encryption for data already residing in HDFS.
Are there any other security tools i can go for? has anyone done above two security features in cloudera hadoop.
Please suggest
I think Apache Sentry will be best for you.You can find more information here.
Related
I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help
I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Apache Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.
I have been trying Ranger through Hortoworks HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.
I have also been able to devise a solution using Apache Drill and views, however, it is not acceptable right now mainly because of the still scarce community support for Drill.
Has anyone faced the same requirement and/or have some directions for a solution?
After a great deal of research, I've come to a conclusion that this is not possible.
The way Ranger works with other tools (HDFS, Hive, HBase, etc) is by using plug-ins that implements hooks provided by those tools. For instance, to create a custom plug-in to secure Hive, one needs to create a HiveAuthorizer through the HiveAuthorizerFactory. But there's no such a hook for Parquet as it is nothing more than a file format.
A possible solution that would allow to secure Parquet files at a column-wise level from Ranger is to create an extension for Ranger's HDFS plugin. This extension would implement the access rules for Parquet files defined through Ranger. That way, we could seamlessly secure Parquet files the same way we do for Hive or HBase as long as the files are stored in HDFS.
we've built hadoop cluster based on cdh, and now would like to implement mutiple-user to protect data for each user, is there andy good solution for this ?
Usually Hadoop comes with Linux type permissions. One user cannot access another user data, unless he is superuser.
Apart from that, You can consider using Hadoop ACL's and Sentry
From the wiki provided by those 2 projects, I found it seems they did the similar job. But there must be some difference or it's no need for 2.
So what are the differences, and what is the practical advice to choose from one another.
thx a lot!
Great answers above.
Just quick update with Cloudera+Hortonworks merge last year.
These companies have decided to standardize on Ranger.
CDH5 and CDH6 will still use Sentry until CDH product line retires in ~2-3 years.
Ranger will be used for Cloudera+Hortonworks' combined "Unity" platform / CDP product.
Cloudera were saying to us that Ranger is a more "mature" product.
Since Unity hasn't released yet (as of May 2019), something may come up in the future, but that's the current direction. (Oct 2019 update: Unity is now known as CDP and is available for beta testing; will be available for cloud deployments soon, and in 2020 for on-prem customers)
If you're a former Cloudera customer / or CDH user, you would still have to use Apache Sentry. There is a significant overlap between Sentry and Ranger, but if you start fresh, definitely look at Ranger.
You can use Sentry or Ranger depends upon what hadoop distribution tool that you are using like Cloudera or Hortonworks.
Apache Sentry - Owned by Cloudera. Supports HDFS, Hive, Solr and Impala. (Ranger will not support Impala)
Apache Ranger - Owned by Hortonworks. Apache Ranger offers a centralized security framework to manage fine-grained access control across: HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN
https://cwiki.apache.org/confluence/display/SENTRY/Sentry+Tutorial
http://hortonworks.com/apache/ranger/
Thx Kumar
Apache Ranger overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox. Both Sentry and Ranger support column-level permissions in Hive (startig from 1.5 release).
Ref: https://www.xplenty.com/blog/2014/11/5-hadoop-security-projects/
you can also check RecordService.
RecordService provides an abstraction layer between compute frameworks and data storage. It provides row- and column-level security, and other advantages.
Ref: http://blog.cloudera.com/blog/2015/09/recordservice-for-fine-grained-security-enforcement-across-the-hadoop-ecosystem/
http://recordservice.io/
Both manage permissions based on role-table grants. Ranger provides dynamic data masking (in transit). Both integrated with Informatica's Secure at Source (Identify risky data stores in the Enterprise) to deliver Data Governance solution.
As I have requirement of to store large amount of data with faster processing and higher scalability, So I choosen hadoop for this but I requires data collaboration also, I know the sharepoint is best candidate for it.
Please let me know how to integrate sharepoint with hadoop.
I know the SSIS which is used to SQL server integration with Hadoop but I need realtime examples so I am able to find out the exact solution for it.
Setup HDFS NFS Gateway and copy Sharepoint files. You could also use basic script to PUT the files to HDFS. It would require to use an edge node that has access to SharePoint repository and HDFS client.
HDFS NFS Gateway: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
HDFS PUT: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put
If you already use HDP and it is installed with Ambari, HDFS NFS Gateway is just another service to add via Ambari.