'hive on spark' in datastax enterprise DSE? - apache-spark

DSE 6 comes pre-bundled Cassandra and SparkSql. Has anyone also setup 'Hive on Spark' there? I wonder about spark version conflicts being an issue. Reason i wan't this is that Hive seems to allow masking/authorization with Ranger but SparkSQL doesn't

Answer not directly related to setting Hive, etc. but DSE has security (authentication/authorization/...) built-in (see FAQ), and it's supported by the all components, including Spark SQL. If you want to have more granular permissions, you can set row-level access control.

Related

Spark Integration with hive

Currently in our project we are using HDInsights 3.6 in which we have spark and hive integration enabled by default as both shares the same catalogs. Now we want to migrate HDInsights 4.0 where spark and hive will be having different catalogs . I had a go through the Microsoft document (https://learn.microsoft.com/en-us/azure/hdinsight/interactive-query/apache-hive-warehouse-connector) where we need additional cluster required to integrate with help of Hive warehouse connector. Now i wanted to know if there is any other approach instead of using extra cluster .Any suggestions will be highly appreciable.
Thanks
If you are using external tables, they can point both Spark and Hive to use same metastore. This only applies to external tables .

Spark JobServer can use Cassandra as SharedDb

I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.

Why was the Cassandra Context removed from DataStax Enterprise 4.7

I came to know from this link that Cassandra context was removed DataStax Enterprise 4.7. Does it mean it will be removed from Spark Cassandra Connector? Also, what is the reason for removing it. Is it performance related?
Cassandra Context
The 'CassandraContext' object was Datastax Only and never existed in the Spark Cassandra connector. It was basically a compiled mapping of Cassandra tables to Scala objects and case classes. It required compiling a new object every time the underlying schema of Cassandra changed and created a divergence with the OSS Spark Cassandra Connector API. The additional performance cost of creating this object was seen as a waste of time versus the limited convenience it offered. In addition, the code would only work in the spark shell so it was not suitable for prototyping code for stand alone applications.
Edit: I was mistaken the Cassandra Context is a Separate structure than the CassandraSQLContext. My memory was wrong.
The CassandraSQLContext's main purpose was to provide a persistent catalogue and automatic mapping to Cassandra tables from Spark when the system does not have a HiveMetastore present. When using the CassandraSqlContext the user is limited to a tiny subset of AnsiSQL as opposed to with a HiveContext which uses 99% of HiveQL. The code for the CassandraSQLContext is still present in the Connector and you are still able to create a CassandraSQLContext in DSE.
In Datastax Enterprise there is already a HiveMetastore written to work with Cassandra. The custom Metastore automatically registers all Cassandra tables as well so having the CassandraSQLContext was seen as being redundant, confusing, and less featured than it's Hive counterpart. To this end it is recommended that all users use a HiveContext instead of the CassandraSQLContext and we removed the automatic cc object from the shell.

Why use Hive on Spark instead of Spark-SQL?

I'm new to the Data Science field and I don't understand why would someone want to connect Hive to Spark instead of just using Sqark-SQL.
What benefits are there for using Hive on Spark rather than Spark-SQL (other than being able to use Hive code already in production)?
Thanks
That answer above is not correct. The one component that is common between Hive and SparkSQL is SemanticAnalyzer.
Hive has significantly better SQL support and a more sophisticated cost based optimizer.
My recommendation is to use Hive on Tez opposed to Hive on Spark or SparkSQL as it is production ready, more stable and scalable.
hmm, it seems the only answer here gives an advice to use tez...
back to the original question, benefits for using Hive on Spark, IMHO, the benefits are mainly a better hive feature support, not the HiveQL language support, Hive on Spark has a much better support for hiveserver2 and security features.
in SparkSQL they are really buggy, there is a hiveserver2 impl in SparkSQL, but
in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either... see https://issues.apache.org/jira/browse/SPARK-13983
our requirement is using spark with hiveserver2 in a secure way (with
authentication and authorization), currently SparkSQL alone can not
provide this, and we do not need to use other hadoop components like HDFS or YARN, we are using spark standalone, so for our requirement, we are using ranger/sentry + Hive on Spark.

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources