I have been looking for a way to secure Parquet files, column-wise, for Spark access. Ideally, that would work the same way Apache Ranger works for Hive, i.e., a Sysadmin defines the access policies for different groups and columns.
I have been trying Ranger through Hortoworks HDP, however, it seems that plug-ins for Spark and Parquet are not there yet.
I have also been able to devise a solution using Apache Drill and views, however, it is not acceptable right now mainly because of the still scarce community support for Drill.
Has anyone faced the same requirement and/or have some directions for a solution?
After a great deal of research, I've come to a conclusion that this is not possible.
The way Ranger works with other tools (HDFS, Hive, HBase, etc) is by using plug-ins that implements hooks provided by those tools. For instance, to create a custom plug-in to secure Hive, one needs to create a HiveAuthorizer through the HiveAuthorizerFactory. But there's no such a hook for Parquet as it is nothing more than a file format.
A possible solution that would allow to secure Parquet files at a column-wise level from Ranger is to create an extension for Ranger's HDFS plugin. This extension would implement the access rules for Parquet files defined through Ranger. That way, we could seamlessly secure Parquet files the same way we do for Hive or HBase as long as the files are stored in HDFS.
Related
I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.
Which option is better to use, spark as an execution engine on hive or accessing hive tables using spark SQL? And Why?
A few assumptions here are:
Reason to opt for SQL is to stay user friendly, e.g. if you have business users trying to access data.
Hive is in consideration because it provides an SQL like interface and persistence of data
If that is true, Spark-SQL is perhaps the better way forward. It is better integrated within Spark and as an integral part of Spark, it will provide more features (one example is structured streaming). You will still get user friendliness and an SQL like interface to Spark so you will get full benefits. But you will need to manage your system only from Spark's point of view. Hive installation and management will still be there but from a single perspective.
Using Hive with Spark as execution engine will keep you limited based upon how good a translation Hive's libraries are able to do to convert your HQL to Spark. They may do a pretty good job but you will still loose the advanced features of Spark SQL. And new features may take longer to get integrated in Hive compared to Spark SQL.
Also, with Hive exposed to end users, some advanced users or data engineering teams may want access to Spark. This will cause you to manage two tools. System management may get more tedious compared to only using Spark-SQL in this scenario as Spark SQL has the potential to serve both non-technical and advanced users and even if advanced users use pyspark, spark-shell or more, they will still be integrated within the same toolset.
I am currently working on batch applications using Apache Spark and we are using the storage format mainly as delimiter separated text file and parquet.
Is there any storage format developed by spark or are there any plans to develop any storage format?
Spark is highly agnostic when it comes to languages, cluster managers and supported data sources, including file formats and file systems. Moreover it is a general purpose framework so finding a solution which fits all scenarios is rather unlikely.
That being seeing said it is always worth to watch AMPLab projects page.
In my lambda architecture, i am debating on whether to use HDFS or Cassandra to store my immutable data. I need Cassandra to serve the online requests etc. so it is the mandatory part of the tech stack. Now, I do not want to introduce new tool (HDFS) into the stack if I don't have to. So my question is, what will I be missing if I don't use HDFS and use Cassandra to host my immutable data as well.
EDIT:
I understand HDFS is a distributed filesystem and Cassandra is NoSQL DB. Still, both support data replication, both support high-throughput writes. In addition Cassandra supports low latent data retrieval. So am I right saying that HDFS isn't going to provide me much lift?
As I understand You are trying to clarify your Serving Layer of your Lambda Architecture.
If it is true, you want to store your batch views and real-time views into a Database.
And as I understand you do not have Hadoop cluster in your batch layer.
And your batch views have not been completed in HDFS.
At this point your architecture is outside of HDFS.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
If you dont want a hadoop cluster, omit HBase.
Cassandra is distributed NoSQL Database(column-oriented) and it works outside the Hadoop cluster and HDFS.
If I understand your architecture and your needs right, I think Cassandra is best for you.
Additionally, you can get quick info about Lambda architecture from this link;
http://artofbigdata.blogspot.com.tr/2016/01/lambda-architecture.html
HDFS supports different file formats to store. For example, sequence files, Avro and Parquet etc..so that you can choose a file format suitable to your application needs.
Also note that you can efficiently read the data using SQL-like queries.
So different data models are available in HDFS over Cassandra to host the data.
As I have requirement of to store large amount of data with faster processing and higher scalability, So I choosen hadoop for this but I requires data collaboration also, I know the sharepoint is best candidate for it.
Please let me know how to integrate sharepoint with hadoop.
I know the SSIS which is used to SQL server integration with Hadoop but I need realtime examples so I am able to find out the exact solution for it.
Setup HDFS NFS Gateway and copy Sharepoint files. You could also use basic script to PUT the files to HDFS. It would require to use an edge node that has access to SharePoint repository and HDFS client.
HDFS NFS Gateway: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
HDFS PUT: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#put
If you already use HDP and it is installed with Ambari, HDFS NFS Gateway is just another service to add via Ambari.