Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column - apache-spark

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?

Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

Related

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

Replace a hive partition from Spark

Is there a way I can replace (an existing) a hive partition from a Spark program? Replace only the latest partition, rest of the partitions remains the same.
Below is the idea which I am trying to work upon,
We get transnational data from our RDBMS systems coming into HDFS every min. There will be a spark program (running every 5 or 10 min) which reads the data, performs the ETL and writes the output into a Hive Table.
Since overwriting entire hive table would be huge,
we would like to overwrite the hive table for today's partition only.
End of Day the source and destination partitions would be changed to next day.
Thanks in advance
As you know the hive table location, append the currentdate to location as your table is partitioned on date and overwrite the hdfs path.
df.write.format(source).mode("overwrite").save(path)
Msck repair hive table
once it is completed

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

How to store Spark data frame as a dynamic partitioned Hive table in Parquet format?

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?
If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Resources