Avoiding use of SELECT in WHERE - apache-spark

I have an input file on hdfs in CSV format with following cols: date, time, public_ip
Using this I need to filter out data from quite a big table (~100M rows daily). The table has the following structure (roughly):
CREATE TABLE big_table (
`user_id` int,
`ip` string,
`timestamp_from` timestamp,
`timestamp_to` timestamp)
PARTITIONED BY (`PARTITION_DATE` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat';
I need to read CSV data and then filter big_table checking which user_ids have been using the desired ip address in selected period.
I tried using spark SQL with different joins, without much luck. Whatever I do, spark is simply not "smart" enough to limit big table per partition. I also tried using WHERE PARTITION_DATE IN (SELECT DISTINCT date FROM csv_file, but this was also quite slow.
CSV should have up to 20 different days or so.
Here's my solution - I ended up picking up distinct days and using this as a string:
spark.sql("select date from csv_file group by date").createOrReplaceTempView("csv_file_uniq_date")
val partitions=spark.sql("select * from csv_file_uniq_date").collect.mkString(sep=",").replaceAll("[\\[\\]]","")
spark.sql("select user_id, timestamp_from, timestamp_to from big_table where partition_date in (" + partitions + ") group by user_id, timestamp_from, timestamp_to").write.csv("output.csv")
Now, this does the work - I cut the tasks from 100s of thousands to thousands, but I feel quite unhappy with the implementation. Could someone point me to the right direction? How to avoid pulling this as a string of comma separated partition values?
Using spark 2.2
Cheers!

What you expect is called Dynamic Partition Pruning, by which Spark will be smart enough to resolve the partitions to filter from the direct join condition.
This feature is available from Spark 3.0 as part of Adaptive Query Execution improvements.
Find more details from this link
It is disabled by default, can be enabled by setting spark.sql.adaptive.enabled=true

Related

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Pyspark parquet file sizes are drastically different

I use pyspark to process a fix set of data records on a daily basis and store them as 16 parquet files in a Hive table using the date as partition. In theory, the number of records every day should be on the same order of magnitude showing below, about 1.2 billion rows and it is indeed on the same order.
When I look at the parquet files, the size of every parquet files in each day is around 86MB like 2019-09-04 showing below
But one thing I noticed to be very strange is the date of 2019-08-03, the file size is 10x larger than the files in other date, but the number of records seems to be more or less the same. I am so confused and could not come up with a reason for it. If you have any idea as to why, please share it with me. Thank you.
I've just realised that the way I saved the data for 2019-08-03 is as follows
cols = sparkSession \
.sql("SELECT * FROM {} LIMIT 1".format(table_name)).columns
df.select(cols).write.insertInto(table_name, overwrite=True)
For other days
insertSQL = """
INSERT OVERWRITE TABLE {}
PARTITION(crawled_at_ds = '{}')
SELECT column1, column2, column3, column4
FROM calendarCrawlsDF
"""
sparkSession.sql(
insertSQL.format(table_name,
calendarCrawlsDF.take(1)[0]["crawled_at_ds"]))
For 2019-08-03, I used Dataframe insertInto method. For other days, I used sparkSession sql to execute INSERT OVERWRITE TABLE
Could this be the reason?

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Resources