I have a requirement to read hive table from spark which is ACID enabled.
Spark by native doesn't support to read ORC file which is ACID enabled, only option is use spark jdbc.
We can also use hive warehouse connector to read files , can someone explain what is the steps to read using hive warehouse connector.
Is HWC only work in HDP 3 version.Kindly advise.
Spark version :2.3.0
HDP -2.6.5
Spark can read ORC file, check documentation on it here: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#orc-files
Here is a sample of code to read orc file:
spark.read.format("orc").load("example.orc")
HWC is made for HDP 3 version, as Hive and Spark catalogs are not compatible anymore in HDP 3, (Hive is in version 3, and Spark in version 2).
See documentation on it here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html
Related
I'm new to spark. Now I want to use spark to read some data and write it to the tables defined by hive. I'm using spark3.3.1 and hadoop 3.3.2, and now, can I download hive3 and config spark3 work together? Because some materials I found from internet told me spark can't work with all versions of hive
thanks
From Spark 3.2.1 documentation it is compatible with Hive 3.1.0 if the versions of spark and hive can be modified I would suggest you to use the above mentioned combination to start with.
I try to integrate hive 3.1.2 with spark 3.2.1. There is a hive fork for spark 3:
https://github.com/forsre/hive3.1.2
You can use it to recompile hive with spark 3 and hive on spark can work.
But spark thrift server is incompatible with hive 3. Apache kyuubi is suggested to replace spark thrift server and hiveserver2.
https://kyuubi.apache.org/
You can just use standard hive 3.1.2 and spart 3.2.1 package with kyuubi 1.6.0 to make them work.
I use HDP3.1 and I added Spark2, Hive and Other services which are needed. I turned of the ACID feature in Hive. The spark job can't find the table in hive. But the table exists in Hive. The exception likes:
org.apache.spark.sql.AnalysisException: Table or view not found
There is hive-site.xml in Spark's conf folder. It is automaticly created by HDP. But it isn't same as the file in hive's conf folder. And from the log, the spark can get the thrift URI of hive correctly.
I use spark sql and created one hive table in spark-shell. I found the table was created in the fold which is specified by spark.sql.warehouse.dir. I changed its value to the value of hive.metastore.warehouse.dir. But the problem is still there.
I also enabled hive support when creating spark session.
val ss = SparkSession.builder().appName("统计").enableHiveSupport().getOrCreate()
There is metastore.catalog.default in hive-site.xml in spark's conf folder. It value is spark. It should be changed to hive. And btw, we should disable ACID feature of hive.
You can use hivewarehouse connector and use llap in hive conf
In HDP 3.0 and later, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables on the same or different platforms.
Spark, by default, only reads the Spark Catalog. And, this means Spark applications that attempt to read/write to tables created using hive CLI will fail with table not found exception.
Workaround:
Create the table in Hive CLI and Spark SQL
Hive Warehouse Connector
I am using snappydata-1.0.1 on HDP2.6.2, spark 2.1.1 and was able to connect from an external spark application. But when i enable hive support by adding hive-site.xml to spark conf, snappysession is listing the tables from hivemetastore instead of snappystore.
SparkConf sparkConf = new SparkConf().setAppName("TEST APP");
JavaSparkContext javaSparkContxt = new JavaSparkContext(sparkConf);
SparkSession sps = new SparkSession.Builder().enableHiveSupport().getOrCreate();
SnappySession snc = new SnappySession(new SparkSession(javaSparkContxt.sc()).sparkContext());
snc.sqlContext().sql("show tables").show();
The above code gives me list of tables in snappy store when hive-site.xml is not in sparkconf, if hive-site.xml added.. it lists me tables from hive metastore.
Is it not possible to use hive metastore and snappydata metastore in the same application?
Can is read hive table into a dataframe and snappydata table to another DF in same application?
Thanks in advance
So, it isn't the hive metastore that is the problem. You can use Hive tables and Snappy Tables in the same application. e.g. copy hive table into Snappy in-memory.
But, we will need to test the use of external hive metastore configured in hive-site.xml. Perhaps a bug.
You should try using the Snappy smart connector. i.e. Run your Spark using the Spark distribution in HDP and connect to Snappydata cluster using the connector (see docs). Here it looks like you are trying to run your Spark app using the Snappydata distribution.
This is my opinion:
Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.Spark SQL also supports reading and writing data stored in Apache Hive.Hive on Spark only uses Spark execution engine. Spark SQL with Hive Metastore not only uses Spark execution engine, but also uses Spark SQL Which is a Spark module for structured data processing and to execute SQL queries. Due to Spark SQL with Hive Metastore does not support all of Hive configurations and all version of the Hive metastore(Available versions are 0.12.0 through 1.2.1.), In production, The deploy mode of Hive on Spark is better and more effective.
So, am I wrong? Does anyone have others ideas?
I am using spark(standalone) of CDH5.4.2
After copying hive-site.xml to $SPARK_HOME/conf,i can query from hive in spark-shell,such as below:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#6c6f3a15
scala> hiveContext.sql("show tables").show();
But when i open spark-sql ,it show wrong:
java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.
What is different between spark-shell and spark-sql? If spark of cdh don't support hive,why can i use HiveConext?
Cloudera has a list of unsupported features here:
https://docs.cloudera.com/runtime/7.2.6/spark-overview/topics/spark-unsupported-features.html
The Thrift server is not supported.
This is a copy of the list for 7.2.6:
Apache Spark experimental features/APIs are not supported unless stated otherwise.
Using the JDBC Datasource API to access Hive or Impala is not supported
ADLS not supported for All Spark Components. Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available for ADLS.)
IPython / Jupyter notebooks is not supported. The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
Certain Spark Streaming features, such as the mapWithState method, are not supported.
Thrift JDBC/ODBC server is not supported
Spark SQL CLI is not supported
GraphX is not supported
SparkR is not supported
Structured Streaming is supported, but the following features of it are not:
Continuous processing, which is still experimental, is not supported.
Stream static joins with HBase have not been tested and therefore are not supported.
Spark cost-based optimizer (CBO) not supported.