How to support row level security in Apache Spark SQL? - apache-spark

I would like to have row level security enforced in Apache Spark SQL. Is that supported? (allow users to send raw HiveQL / Spark SQL queries but only show the data they are supposed to see). Is there a built in way to do so in Apache Spark?

No Spark does not provide security at this level. If you want that kind of security look for Accumulo DB. Accumulo was created in 2008 by the US National Security Agency and contributed to the Apache Foundation. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. You can refer the book - Accumulo- Application Development Table Design and Best Practices

Related

Understanging kappa architecture with apache superset

There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.
Let's imaging you want to implement a kappa architecture involving the following tech stack:
Apache Kafka
Apache Spark
Apache Superset
Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.
But now you want to see how you would do this with a kappa architecture and you add kafka and spark.
You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work?
Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.
How can you write queries against the data that is in kafka (which is in fact the live data)?
On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?
Or is this not the intended way of connecting the things?
If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.
Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.
But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.
Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.

Spark as execution engine or spark as an application?

Which option is better to use, spark as an execution engine on hive or accessing hive tables using spark SQL? And Why?
A few assumptions here are:
Reason to opt for SQL is to stay user friendly, e.g. if you have business users trying to access data.
Hive is in consideration because it provides an SQL like interface and persistence of data
If that is true, Spark-SQL is perhaps the better way forward. It is better integrated within Spark and as an integral part of Spark, it will provide more features (one example is structured streaming). You will still get user friendliness and an SQL like interface to Spark so you will get full benefits. But you will need to manage your system only from Spark's point of view. Hive installation and management will still be there but from a single perspective.
Using Hive with Spark as execution engine will keep you limited based upon how good a translation Hive's libraries are able to do to convert your HQL to Spark. They may do a pretty good job but you will still loose the advanced features of Spark SQL. And new features may take longer to get integrated in Hive compared to Spark SQL.
Also, with Hive exposed to end users, some advanced users or data engineering teams may want access to Spark. This will cause you to manage two tools. System management may get more tedious compared to only using Spark-SQL in this scenario as Spark SQL has the potential to serve both non-technical and advanced users and even if advanced users use pyspark, spark-shell or more, they will still be integrated within the same toolset.

How to choose between apache ranger and sentry

From the wiki provided by those 2 projects, I found it seems they did the similar job. But there must be some difference or it's no need for 2.
So what are the differences, and what is the practical advice to choose from one another.
thx a lot!
Great answers above.
Just quick update with Cloudera+Hortonworks merge last year.
These companies have decided to standardize on Ranger.
CDH5 and CDH6 will still use Sentry until CDH product line retires in ~2-3 years.
Ranger will be used for Cloudera+Hortonworks' combined "Unity" platform / CDP product.
Cloudera were saying to us that Ranger is a more "mature" product.
Since Unity hasn't released yet (as of May 2019), something may come up in the future, but that's the current direction. (Oct 2019 update: Unity is now known as CDP and is available for beta testing; will be available for cloud deployments soon, and in 2020 for on-prem customers)
If you're a former Cloudera customer / or CDH user, you would still have to use Apache Sentry. There is a significant overlap between Sentry and Ranger, but if you start fresh, definitely look at Ranger.
You can use Sentry or Ranger depends upon what hadoop distribution tool that you are using like Cloudera or Hortonworks.
Apache Sentry - Owned by Cloudera. Supports HDFS, Hive, Solr and Impala. (Ranger will not support Impala)
Apache Ranger - Owned by Hortonworks. Apache Ranger offers a centralized security framework to manage fine-grained access control across: HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN
https://cwiki.apache.org/confluence/display/SENTRY/Sentry+Tutorial
http://hortonworks.com/apache/ranger/
Thx Kumar
Apache Ranger overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox. Both Sentry and Ranger support column-level permissions in Hive (startig from 1.5 release).
Ref: https://www.xplenty.com/blog/2014/11/5-hadoop-security-projects/
you can also check RecordService.
RecordService provides an abstraction layer between compute frameworks and data storage. It provides row- and column-level security, and other advantages.
Ref: http://blog.cloudera.com/blog/2015/09/recordservice-for-fine-grained-security-enforcement-across-the-hadoop-ecosystem/
http://recordservice.io/
Both manage permissions based on role-table grants. Ranger provides dynamic data masking (in transit). Both integrated with Informatica's Secure at Source (Identify risky data stores in the Enterprise) to deliver Data Governance solution.

What is the differences between Apache Spark and Apache Apex?

Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT.
What are the key differences between these 2 platforms?
Questions
From a data science perspective, how is it different from Spark?
Does Apache Apex provide functionality like Spark MLlib? If we have to built scalable ML models on Apache apex how to do it & which language to use?
Will data scientists have to learn Java to built scalable ML models? Does it have python API like pyspark?
Can Apache Apex be integrated with Spark and can we use Spark MLlib on top of Apex to built ML models?
Apache Apex an engine for processing streaming data. Some others which try to achieve the same are Apache storm, Apache flink. Differenting factor for Apache Apex is: it comes with built-in support for fault-tolerance, scalability and focus on operability which are key considerations in production use-cases.
Comparing it with Spark: Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.
Currently, work is under progress for adding support for integration of Apache Apex with machine learning libraries like Apache Samoa, H2O
Refer https://issues.apache.org/jira/browse/SAMOA-49
Currently, it has support for Java, Scala.
https://www.datatorrent.com/blog/blog-writing-apache-apex-application-in-scala/
For Python, you may try it using Jython. But, I haven't not tried it myself. So, not very sure about it.
Integration with Spark may not be good idea considering they are two different processing engines. But, Apache apex integration with Machine learning libraries is under progress.
If you have any other questions, requests for features you can post them on mailing list for apache apex users: https://mail-archives.apache.org/mod_mbox/incubator-apex-users/

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources