Inserting data into a static Hive partition using Spark SQL - apache-spark

I have trouble figuring out how to insert data into a static partition of a Hive table using Spark SQL. I can use code like this to write into dynamic partitions:
df.write.partitionBy("key").insertInto("my_table")
However, I can't figure out how to insert the data into a static partition. That means, I want to define the partition where the entire DataFrame should be written without the need to add the column to the DataFrame.
I see static partitioning mentioned in the
InsertIntoHiveTable class, so I guess it is supported. Is there a public API to do what I want?

You can use
DataFrame tableMeta = sqlContext.sql(String.format("DESCRIBE FORMATTED %s", tableName));
String location = tableMeta.filter("result LIKE 'Location:%'").first().getString(0);
and use regex to get your table partition. Once you get the table location, you can easily construct the partition location like
String partitionLocation = location + "/" + partitionKey
(partitionKey is something like dt=20160329/hr=21)
Then, you can write to that path
df.write.parquet(partitionLocation)
(in my case when I build the dataframe, I do not include the partition columns in. Not sure if there is any error when partition columns are included)

Related

Read partioned hive table in pyspark instead of a parquet

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?
Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

drop column in a table/view using spark sql only

i have 30 columns in a table i.e table_old
i want to use 29 columns in that table except one . that column is dynamic.
i am using string interpolation.
the below sparksql query i am using
drop_column=now_current_column
var table_new=spark.sql(s"""alter table table_old drop $drop_column""")
but its throwing error
mismatched input expecting 'partition'
i dont want to drop the column using dataframe. i requirement is to drop the column in a table using sparksql only
As mentioned in previous answer, DROP COLUMN is not supported by spark yet.
But, there is a workaround to achieve the same, without much overhead. This trick works for both EXTERNAL and InMemory tables. The code snippet below works for EXTERNAL table, you can easily modify it and use it for InMemory tables as well.
val dropColumnName = "column_name"
val tableIdentifier = "table_name"
val tablePath = "table_path"
val newSchema=StructType(spark.read.table(tableIdentifier).schema.filter(col => col.name != dropColumnName))
spark.sql(s"drop table ${tableIdentifier}")
spark.catalog.createTable(tableIdentifier, "orc", newSchema, Map("path" -> tablePath))
orc is the file format, it should be replaced with the required format. For InMemory tables, remove the tablePath and you are good to go. Hope this helps.
DROP COLUMN (and in general majority of ALTER TABLE commands) are not supported in Spark SQL.
If you want to drop column you should create a new table:
CREATE tmp_table AS
SELECT ... -- all columns without drop TABLE
FROM table_old
and then drop the old table or view, and reclaim the name.
Now drop columns is supported by Spark if you´re using v2 tables. You can check this link
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Resources