Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1 - apache-spark

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1
Hudi version is 0.7
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same partition with updated data/records.
Now, when I query the same table using Spark SQL, it works fine and does not return any duplicates. Basically, it only honours the latest records/parquet files for processing. It also works fine when I use Tez as the underlying engine for Hive.
But, when I run the same query on Hive prompt with Spark as underlying execution engine, it returns all the records and does not filter the previous parquet files.
Have tried setting the property spark.sql.hive.convertMetastoreParquet=false, still it did not work.
Please help.

This is a known issue in Hudi.
Still, using the below property, I am able to remove the duplicates in RO (read optimised) Hudi tables. The issue still persists in RT table (real time).
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

Related

Creating an Athena view on a HUDI table returns soft deleted records when the view is read using SPARK

I have multiple HUDI tables with differing column names and I built a view on top of it to standardize the column names. When this view is read from Athena, it returns a correct response. But, when the same view is read using SPARK using spark.read.parquet("<>") , it returns the soft deleted records too.
I understand a HUDI table needs to be read with spark.read.format("hudi") but since this is a view on it , I have to use spark.read.parquet("").
Is there a way to enforce HUDI to retain only the latest commit in the table and suppress all the old commits?
Athena view is a virtual table store in the metastore Glue, the best way to have the same result of Athena in Spark is by using AWS Glue as metastore/catalog for your spark session. To do that you can use this lib which allows you to use AWS Glue as an Hive metastore, then you can read the view using spark.read.table("<database name>.<view name>") or via an SQL query:
val df = spark.sql("SELECT * FROM <database name>.<view name>")
Try to avoid spark.read.parquet("") because it doesn't use the hudi metadata at all, if you have issues with Glue, you can use Hive to create the same view you created in Athena for spark.

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...
Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)
if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.
Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.
Please assist.
There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.
Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.
In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)
Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.
Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

How to save temptable to Hive metastore (and analyze it in Hive)?

I use Spark 1.3.1.
How to store/save a DataFrame data to a Hive metastore?
In Hive If I run show tables the DataFrame does not appear as a table in Hive databases. I have copied hive-site.xml to $SPARK_HOME/conf, but it didn't help (and the dataframe does not appear in Hive metastore either).
I am following this document, using spark 1.4 version.
dataframe.registerTempTable("people")
How to analyze the spark table in Hive?
You can use insertInto method to save dataframe in hive.
Example:
dataframe.insertInto("table_name", false);
//2nd arg is to specify if table should be overridden
Got the solution. I am using spark 1.3.1, so that it's not supporting all. Now using spark 1.5.1 that problem resolved.
I have noticed data-frames fully working after 1.40. Many commands are deprecated.

Resources