Currently in our project we are using HDInsights 3.6 in which we have spark and hive integration enabled by default as both shares the same catalogs. Now we want to migrate HDInsights 4.0 where spark and hive will be having different catalogs . I had a go through the Microsoft document (https://learn.microsoft.com/en-us/azure/hdinsight/interactive-query/apache-hive-warehouse-connector) where we need additional cluster required to integrate with help of Hive warehouse connector. Now i wanted to know if there is any other approach instead of using extra cluster .Any suggestions will be highly appreciable.
Thanks
If you are using external tables, they can point both Spark and Hive to use same metastore. This only applies to external tables .
Related
I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.
Previously I could work entirely within the spark.sql api to interact with both hive tables and spark data frames. I could query views registered with spark or the hive tables with the same api.
I'd like to confirm, that is no longer possible with hadoop 3.1 and pyspark 2.3.2? To do any operation on a hive table one must use the 'HiveWarehouseSession' api and not the spark.sql api. Is there any way to continue using the spark.sql api and interact with hive or will I have to refactor all my code?
hive = HiveWarehouseSession.session(spark).build()
hive.execute("arbitrary example query here")
spark.sql("arbitrary example query here")
It's confusing because the spark documentation says
Connect to any data source the same way
and specifically gives Hive as an example, but then the Hortonworks hadoop 3 documentation says
As a Spark developer, you execute queries to Hive using the JDBC-style HiveWarehouseSession API
These two statements are in direct contradiction.
The Hadoop documentation continues "You can use the Hive Warehouse Connector (HWC) API to access any type of table in the Hive catalog from Spark. When you use SparkSQL, standard Spark APIs access tables in the Spark catalog."
At least as of present, Spark.sql spark is no longer universal correct? and I can no longer seamlessly interact with hive tables using the same api?
Yep, correct. I'm using Spark 2.3.2 but I can no longer access to hive tables using Spark SQL default API.
From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, they are mutually exclusive.
As you mentioned you have to use HiveWarehouseSession from pyspark-llap library.
I'm new to the Data Science field and I don't understand why would someone want to connect Hive to Spark instead of just using Sqark-SQL.
What benefits are there for using Hive on Spark rather than Spark-SQL (other than being able to use Hive code already in production)?
Thanks
That answer above is not correct. The one component that is common between Hive and SparkSQL is SemanticAnalyzer.
Hive has significantly better SQL support and a more sophisticated cost based optimizer.
My recommendation is to use Hive on Tez opposed to Hive on Spark or SparkSQL as it is production ready, more stable and scalable.
hmm, it seems the only answer here gives an advice to use tez...
back to the original question, benefits for using Hive on Spark, IMHO, the benefits are mainly a better hive feature support, not the HiveQL language support, Hive on Spark has a much better support for hiveserver2 and security features.
in SparkSQL they are really buggy, there is a hiveserver2 impl in SparkSQL, but
in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either... see https://issues.apache.org/jira/browse/SPARK-13983
our requirement is using spark with hiveserver2 in a secure way (with
authentication and authorization), currently SparkSQL alone can not
provide this, and we do not need to use other hadoop components like HDFS or YARN, we are using spark standalone, so for our requirement, we are using ranger/sentry + Hive on Spark.
Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.
as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.
I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.
Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?
There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb
If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.
I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.