Read all tables from one hive then write to another hive on another cluster using spark - apache-spark

We can read or write tables from hive by putting hive-site.xml to the direction "conf" of spark.But now I have two cluster which can be connected to each other.Let`s say hive 1 on cluster,and hive 2 on another cluster
Now I need to read data from hive 1 and do some transformation then write to hive 2,the problem is I can only put one hive-site.xml file to spark conf,means when I execute
someDataFrame.write.saveAsTable("dbName.tableName")
,it will be save to hive 1 not hive 2,because spark only recognize one hive (hive 1)
My question is can I read and write to different hives on different cluster using spark?

Since there would only be one Hive Context active during this operation, I'm going to say it's not possible.
At a minimum, you would have to actually register the table in the "local" Hive metastore as an external table with LOCATION hdfs://othernamenode:9000/table/path, then make Spark write to it that way, but I've not tried it
Alternatively, look into the Circus Train project for migrating Hive tables

Related

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...
Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)
if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.
Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.
Please assist.
There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.
Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.
In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

Hive table requires 'repair' for every new partitions while inserting parquet files using pyspark

I have spark conf as:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")
I am using the spark context to write the parquet files into hdfs location as:
df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')
In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).
It's better if you integrate hive with spark to make your job easier.
After the hive-spark integration setup, you can enable hive support while creating SparkSession.
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Now you can access hive tables from spark.
You can run repair command from spark itself.
spark.sql("MSCK REPAIR TABLE <tbl_name>")
I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.
df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

Is it possible to use Spark with ORC file format without Hive?

I am working with HDP 2.6.4, to be more specific Hive 1.2.1 with TEZ 0.7.0 , Spark 2.2.0.
My task is simple. Store data in ORC file format then use Spark to process the data. To achieve this, I am doing this:
Create a Hive table through HiveQL
Use Spark.SQL("select ... from ...") to load data into dataframe
Process against the dataframe
My questions are:
1. What is Hive's role behind the scene?
2. Is it possible to skip Hive?
You can skip Hive and use SparkSQL to run the command in step 1
In your case, Hive is defining a schema over your data and providing you a query layer for Spark and external clients to communicate
Otherwise, spark.orc exists for reading and writing of dataframes directly on the filesystem

How to use different Hive metastore for saveAsTable?

I am using Spark SQL (Spark 1.6.1) using PySpark and I have a requirement of loading a table from one Hive metastore and writing the result of the dataframe into a different Hive metastore.
I am wondering how can I use two different metastores for one spark SQL script?
Here is my script looks like.
# Hive metastore 1
sc1 = SparkContext()
hiveContext1 = HiveContext(sc1)
hiveContext1.setConf("hive.metastore.warehouse.dir", "tmp/Metastore1")
#Hive metastore 2
sc2 = SparkContext()
hiveContext2 = HiveContext(sc2)
hiveContext2.setConf("hive.metastore.warehouse.dir", "tmp/Metastore2")
#Reading from a table presnt in metastore1
df_extract = hiveContext1.sql("select * from emp where emp_id =1")
# Need to write the result into a different dataframe
df_extract.saveAsTable('targetdbname.target_table',mode='append',path='maprfs:///abc/datapath...')
HotelsDotCom have developed an application (WaggleDance) specifically for this https://github.com/HotelsDotCom/waggle-dance. Using it as a proxy you should be able to achieve what your trying to do
TL;DR It is not possible to use one Hive metastore (for some tables) and another (for other tables).
Since Spark SQL supports a single Hive metastore (in a SharedState) regardless of the number of SparkSessions reading from and writing to different Hive metastores is technically impossible.

Resources