Performance of pyspark + hive when a table has many partition columns - apache-spark

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet

Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Related

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

Can sort() and cache() combined in spark increase filter speed like creating index column in SQL?

We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Optimize Partitionning for billions of distinct keys

I'm processing a file each day with PySpark for contaning information about device navigation through the web. At the end of each month I want to use window functions in order to have the navigation journey for each device. It's a very slow processing, even with many nodes, so I'm looking for ways to speed it up.
My idea was to partition the data but I have 2 billion distinct keys, so partitionBy does not seem appropriate. Even bucketBy might not be a good choice because I create n buckets each day, so the files are not appended but for each day there are x parts of files that are created.
Does anyone have a solution ?
So here is an example of the export for each day (inside of each parquet file we find 9 partitions):
And here is the partitionBy query that we launch at the beggining of each month (compute_visit_number and compute_session_number are two udf that i've created on the notebook):
You want to ensure that each devices data is in the same partition to prevent exchanges when you do your window function. Or at least minimise the number of partitions the data could be in.
To do this I would create a column called partitionKey when you write the data - which contained a mod on the mc_device column - where the number you mod by is the number of partitions you want. Base this number of the size of the cluster that will run the end of month query. (If mc_device is not an integer then create a checksum first).
You can create a secondary partition on the date column if still needed.
Your end of month query should change:
w = Windows.partitionBy('partitionKey', 'mc_device').orderBy(event_time')
If you kept the date as a secondary partition column then repartition the dataframe to partitionKey only:
df = df.repartition('partitionKey')
At this point each devices data will be in the same partition and no exchanges should be needed. The sort should be faster and your query will hopefully complete in a sensible time.
If it is still slow you need more partitions when writing the data.

Querying split partitions on Cassandra in a single request

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

Resources