I am developing Spark SQL code with the Eclipse IDE. I want to access the tables I have in hdfs, particularly in the hdfs:/data directory. That configuration should develop at the beginning of my program? By default only support me Hive in /tmp.
Thank you
Related
Currently in our project we are using HDInsights 3.6 in which we have spark and hive integration enabled by default as both shares the same catalogs. Now we want to migrate HDInsights 4.0 where spark and hive will be having different catalogs . I had a go through the Microsoft document (https://learn.microsoft.com/en-us/azure/hdinsight/interactive-query/apache-hive-warehouse-connector) where we need additional cluster required to integrate with help of Hive warehouse connector. Now i wanted to know if there is any other approach instead of using extra cluster .Any suggestions will be highly appreciable.
Thanks
If you are using external tables, they can point both Spark and Hive to use same metastore. This only applies to external tables .
I have installed spark 2.4.0 on a clean ubuntu instance. Spark dataframes work fine but when I try to use spark.sql against a dataframe such as in the example below,i am getting an error "Failed to access metastore. This class should not accessed in runtime."
spark.read.json("/data/flight-data/json/2015-summary.json")
.createOrReplaceTempView("some_sql_view")
spark.sql("""SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""").where("DEST_COUNTRY_NAME like 'S%'").where("sum(count) > 10").count()
Most of the fixes that I have see in relation to this error refer to environments where hive is installed. Is hive required if I want to use sql statements against dataframes in spark or am i missing something else?
To follow up with my fix. The problem in my case was that Java 11 was the default on my system. As soon as I set Java 8 as the default metastore_db started working.
Yes, we can run spark sql queries on spark without installing hive, by default hive uses mapred as an execution engine, we can configure hive to use spark or tez as an execution engine to execute our queries much faster. Hive on spark hive uses hive metastore to run hive queries. At the same time, sql queries can be executed through spark. If spark is used to execute simple sql queries or not connected with hive metastore server, its uses embedded derby database and a new folder with name metastore_db will be created under the user home folder who executes the query.
I have Spark 2.2 installed but not Hive and I would like to expose Spark tables through ODBC. I am able to start thrift server , with apparently no errors and my ODBC driver application is able to connect to thrift sever, but can’t see any Spark tables. Do I need to have Hive installed up and running in order to my ODBC applications access the Spark tables that I create?
Thanks
Spark uses Hive metastore.
You need to setup hiveserver as well to get access to hive tables.
Can you please let me know the steps to connect scala ide which I use for developing spark to connect to hive. Currently the output goes to hdfs and then I create an external table on top of it. But as spark streaming creates small files , the performance is getting bad and I want spark to write directly to Hive and I am not sure what configuration in my PC that I should make for that to happen for my development.
I'm looking for a client jdbc driver that supports Spark SQL.
I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC so I can use third-party SQL clients (e.g. SQuirreL, SQL Explorer, etc.) instead of the notebook interface.
I found an ODBC driver from Microsoft but this doesn't help me with java-based SQL clients. I also tried downloading the Hive jdbc driver from my cluster, but the Hive JDBC driver does not appear to support more advance SQL features that Spark does. For example, the Hive driver complains about not supporting join statements that are not equajoins, where I know that this is a supported feature of Spark because I've executed the same SQL in Jupyter successfully.
the Hive JDBC driver does not appear to support more advance SQL features that Spark does
Regardless of the support that it provides, the Spark Thrift Server is fully compatible with Hive/Beeline's JDBC connection.
Therefore, that is the JAR you need to use. I have verified this works in DBVisualizer.
The alternative solution would be to run Spark code in your Java clients (non-third party tools) directly and skip the need for the JDBC connection.