How to partition SQL Server table, where partition column is integer but in date format(20170101 to 20200306) using pyspark? - apache-spark

I have integer column which is a date actually.
like this
20170101
20170103
20170102
.....
20200101
around 10 million rows in each partition.
how to read table using this field as partition column in pyspark?

run spark sql -
spark.sql("select * from table where intPartitionColumn=20200101")
This will push the partition filters to source to read only directory intPartitionColumn=20200101.
You can also check the physical plan(PartitionFilters & PushedFilters) to verify the same

Related

How to repartition into fixed number of partition per column in Spark?

I need to read data from one hive table and insert it into another Hive table. The schema of both the tables is the same. The table is partitioned by date & country. The size of each partition is ~500MB. I want to insert these data in a new table where the files inside each partition are roughly 128 MB (i.e 4 files)
Step 1: Read data from the source table in Spark.
Step 2: Repartition by column(country, date) and the number of partitions to 4.
df.repartition(4, col("country_code"), col("record_date"))
I am getting only 1 partition per country_code & record_date.
Whatever you are doing in the step 2 will repartition your data to 4 partitions in the memory but it won't save 4 files if you do df.write.
In order to do that you can use below code:
df.repartition(4, col("country_code"),col("record_date"))
.write
.partitionBy(col("country_code"),col("record_date"))
.mode(SaveMode.Append).saveAsTable("TableName")

Automatically Updating a Hive View Daily

I have a requirement I want to meet. I need to sqoop over data from a DB to Hive. I am sqooping on a daily basis since this data is updated daily.
This data will be used as lookup data from a spark consumer for enrichment. We want to keep a history of all the data we have received but we don't need all the data for lookup only the latest data (same day). I was thinking of creating a hive view from the historical table and only showing records that were inserted that day. Is there a way to automate the view on a daily basis so that the view query will always have the latest data?
Q: Is there a way to automate the view on a daily basis so that the
view query will always have the latest data?
No need to update/automate the process if you get a partitioned table based on date.
Q: We want to keep a history of all the data we have received but we
don't need all the data for lookup only the latest data (same day).
NOTE : Either hive view or hive table you should always avoid scanning the full table data aka full table scan for getting latest partitioned data.
Option 1: hive approach to query data
If you want to adapt hive approach
you have to go with partition column for example : partition_date and partitioned table in hive
select * from table where partition_column in
(select max(distinct partition_date ) from yourpartitionedTable)
or
select * from (select *,dense_rank() over (order by partition_date desc) dt_rnk from db.yourpartitionedTable ) myview
where myview.dt_rnk=1
will give the latest partition always. (if same day or todays date is there in partition data then it will give the same days partition data otherwise it will give max partition_date) and its data from the partition table.
Option 2: Plain spark approach to query data
with spark show partitions command i.e. spark.sql(s"show Partitions $yourpartitionedtablename") get the result in array and sort that to get latest partition date. using that you can query only latest partitioned date as lookup data using your spark component.
see my answer as an idea for getting latest partition date.
I prefer option2 since no hive query is needed and no full table query since
we are using show partitions command. and no performance bottle necks
and speed will be there.
One more different idea is querying with HiveMetastoreClient or with option2... see this and my answer and the other
I am assuming that you are loading daily transaction records to your history table with some last modified date. Every time you insert or update record to your history table you get your last_modified_date column updated. It could be date or timestamp also.
you can create a view in hive to fetch the latest data using analytical function.
Here's some sample data:
CREATE TABLE IF NOT EXISTS db.test_data
(
user_id int
,country string
,last_modified_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS orc
;
I am inserting few sample records. you see same id is having multiple records for different dates.
INSERT INTO TABLE db.test_data VALUES
(1,'India','2019-08-06'),
(2,'Ukraine','2019-08-06'),
(1,'India','2019-08-05'),
(2,'Ukraine','2019-08-05'),
(1,'India','2019-08-04'),
(2,'Ukraine','2019-08-04');
creating a view in Hive:
CREATE VIEW db.test_view AS
select user_id, country, last_modified_date
from ( select user_id, country, last_modified_date,
max(last_modified_date) over (partition by user_id) as max_modified
from db.test_data ) as sub
where last_modified_date = max_modified
;
hive> select * from db.test_view;
1 India 2019-08-06
2 Ukraine 2019-08-06
Time taken: 5.297 seconds, Fetched: 2 row(s)
It's showing us result with max date only.
If you further inserted another set of record with max last modified date as:
hive> INSERT INTO TABLE db.test_data VALUES
> (1,'India','2019-08-07');
hive> select * from db.test_view;
1 India 2019-08-07
2 Ukraine 2019-08-06
for reference:Hive View manuual

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date).
However my attempt failed since the actual files reside in S3 and even if I drop a hive table the partitions remain the same.
Is there any way to change the partition of an existing Delta table? Or the only solution will be to drop the actual data and reload it with a newly indicated partition column?
There's actually no need to drop tables or remove files. All you need to do is read the current table, overwrite the contents AND the schema, and change the partition column:
val input = spark.read.table("mytable")
input.write.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.partitionBy("colB") // different column
.saveAsTable("mytable")
UPDATE: There previously was a bug with time travel and changes in partitioning that has now been fixed.
As Silvio pointed out there is no need to drop the table. In fact the strongly recommended approach by databricks is to replace the table.
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
in spark SQL, This can be done easily by
REPLACE TABLE <tablename>
USING DELTA
PARTITIONED BY (view_date)
AS
SELECT * FROM <tablename>
Modded example from:
https://docs.databricks.com/delta/best-practices.html#replace-the-content-or-schema-of-a-table
Python solution:
If you need more than one column in the partition
partitionBy(column, column_2, ...)
def change_partition_of(table_name, column):
df = spark.read.table(tn)
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").partitionBy(column).saveAsTable(table_name)
change_partition_of("i.love_python", "column_a")

How to fasten spark dataframe write to hive table in ORC store

thirdCateBrandres.createOrReplaceTempView("tempTable2")
sql("insert overwrite table temp_cate3_Brand_List select * from tempTable2")
The code as above, thirdCateBrandres is a spark DataFrame, registered as a temp table,then write to table temp_cate3_Brand_List, the table has 3 billion row with 7 fields, data size is about 4GB in ORC+SNAPPY format .
These codes took about 20 minutes.
How can I speed up the program?

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources