Creating an Athena view on a HUDI table returns soft deleted records when the view is read using SPARK

Creating an Athena view on a HUDI table returns soft deleted records when the view is read using SPARK - apache-spark

I have multiple HUDI tables with differing column names and I built a view on top of it to standardize the column names. When this view is read from Athena, it returns a correct response. But, when the same view is read using SPARK using spark.read.parquet("<>") , it returns the soft deleted records too.
I understand a HUDI table needs to be read with spark.read.format("hudi") but since this is a view on it , I have to use spark.read.parquet("").
Is there a way to enforce HUDI to retain only the latest commit in the table and suppress all the old commits?

Athena view is a virtual table store in the metastore Glue, the best way to have the same result of Athena in Spark is by using AWS Glue as metastore/catalog for your spark session. To do that you can use this lib which allows you to use AWS Glue as an Hive metastore, then you can read the view using spark.read.table("<database name>.<view name>") or via an SQL query:
val df = spark.sql("SELECT * FROM <database name>.<view name>")
Try to avoid spark.read.parquet("") because it doesn't use the hudi metadata at all, if you have issues with Glue, you can use Hive to create the same view you created in Athena for spark.

Related

Write a spark DataFrame to a table

I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)

Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.

Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)

Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.

Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB

tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)

I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including：
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

SparkSQL on hive partitioned external table on amazon s3

I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?

Create a intermediate table and load to your hive table with insert overwrite partition on date.

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.

Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';

I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string