How can I preserve HIVE table order in pyspark? - apache-spark

There is a HIVE table where the rows have been saved in a certain order (time-based). However, the timestamp of each row has not been saved.
I want to read this table through PySpark and I want to preserve the order of the rows as they are in HIVE. Is there a way to do that?

Related

SparkSQL cannot find existing rows in a Hive table that Trino(Presto) does

We are using Trino(Presto) and SparkSQL for querying Hive tables on s3 but they give different results with the same query and on the same tables. We found the main problem. There are existing rows in a problematic Hive table which can be found with a simple where filter on a specific column with Trino but cannot be found with SparkSQL. The sql statements are the same in both.
On the other hand, SparkSQL can find these rows in the source table of that problematic table, filtering on the same column.
Create sql statement:
CREATE TABLE problematic_hive_table AS SELECT c1,c2,c3 FROM source_table
The select sql that can be used to find missing rows in Trino but not in SparkSQL
SELECT * FROM problematic_hive_table WHERE c1='missing_rows_value_in_column'
And this is the select query which can find these missing rows in SparkSQL:
SELECT * FROM source_table WHERE c1='missing_rows_value_in_column'
We execute the CTAS in Trino(Presto). If we are using ...WHERE trim(c1) = 'missing_key'' then spark can also find the missing rows but the fields do not contain trailing spaces (the length of these fields are the same in the source table as in the problematic table). In the source table spark can find these missing rows without trim.

Hbase filter specific rows

I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase.
My table has millions of rows and reading the entire table is very expensive (memory).
My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?
Currently, I read from Hbase as following:
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
I saw the answer in this post, but I didn't find how can I filter out specific keys.
Any help?
Thanks!

spark parquet partitioning which remove the partition column

If am using df.write.partitionby(col1).parquet(path) .
the data will remove the partition column on the data.
how to avoid it ?
You can duplicate col1 before writing:
df.withColumn("partition_col", col("col1")).write.partitionBy("partition_col").parquet(path)
Note that this step is not really necessary, because whenever you read a Parquet file in a partitioned directory structure, Spark will automatically add that as a new column to the dataframe.
Actually spark does not remove the column but it uses that column in a way to organize the files so that when you read the files it adds that as a column and display that to you in a table format. If you check the schema of the table or the schema of the dataframe you would still see that as a column in the table.
Also you are partitioning your data so you know how that data from table is queried frequently and based on that information you might have decided to partition the data so that your reads becomes faster and more efficient.

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

On saveAsTable from Spark

We are trying to write into a HIVE table from SPARK and we are using saveAsTable function. I want to know whether saveAsTable every time drop and recreate the hive table or not? If it does so, then is there any other possible spark function which will actually just truncate and load a table, instead drop and recreate.
It depends on which .mode value you are specifying
overwrite --> then spark drops the table first then recreates the table
append --> insert new data to the table
1.Drop if exists/create if not exists default.spark1 table in parquet format
>>> df.write.mode("overwrite").saveAsTable("default.spark1")
2.Drop if exists/create if not exists default.spark1 table in orc format
>>> df.write.format("orc").mode("overwrite").saveAsTable("default.spark1")
3.Append the new data to the existing data in the table(doesn't drop/recreate table)
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
Achieve Truncate and Load using Spark:
Method1:-
You can register your dataframe as temp table then execute insert overwrite statement to overwrite target table
>>> df.registerTempTable("temp") --registering df as temptable
>>> spark.sql("insert overwrite table default.spark1 select * from temp") --overwriting the target table.
This method will work for Internal/External tables also.
Method2:-
In case of internal tables as we can truncate the tables first then append the data to the table, by using this way we are not recreating the table but we are just appending the data to the table.
>>> spark.sql("truncate table default.spark1")
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
This method will only work for Internal tables.
Even in case of external tables we can do some workaround to truncate the table by changing table properties.
Let's assume default.spark1 table is external table and
--change external table to internal table
>>> saprk.sql("alter table default.spark1 set tblproperties('EXTERNAL'='FALSE')")
--once the table is internal then we can run truncate table statement
>>> spark.sql("truncate table default.spark1")
--change back the table as External table again
>>> spark.sql("alter table default.spark1 set tblproperties('EXTERNAL'='TRUE')")
--then append data to the table
>>> df.write.format("orc").mode("append").saveAsTable("default.spark1")
You can also use insertInto("table") which doesn't recreate the table
The main difference between saveAsTable is that insertInto expects that the table already exists and is based on the order of columns instead of names.

Resources