Partitioning the data into equal number of records for each group in spark data frame - apache-spark

We have 1 month of data and each day has data of size which falls in the range of 10 to 100GB. We will be writing this data set in a partitioned manner. Here in our case, we have DATE parameter using which we will be partitioning the data in the data frame (partition("DATE")). And we also apply repartition to this data frame to create single or multiple files. If we repartition to 1, it creates 1 file for each partition. If we set to 5 it creates 5 partition files and is good.
But what we are trying here is, we want to make sure is each group (partitioned data of date) is created with equal size files (either through a number of records or sizes of files).
We have used spark data frame option "maxRecordsPerFile" and set to 10Million records. And this is working as expected. for 10 days of data, if I am doing this in one go, it is eating up the execution time, as it is collecting all 10 days of data and trying to do some distribution.
If I don't set this parameter and if I don't set repartition to 1, then this activity is completing in 5 minutes, but if I just set partition("DATE") and maxRecrodsPerFile option it is taking almost an hour.
Looking forward to some help on this!
~Krish

Related

Spark Job stuck writing dataframe to partitioned Delta table

Running databricks to read csv files and then saving as a partitioned delta table.
Total records in file are 179619219 . It is being split on COL A (8419 unique values) and Year ( 10 Years) and Month.
df.write.partitionBy("A","year","month").format("delta") \
.mode("append").save(path)
Job gets stuck on the write step and aborts after running for 5-6 hours
This is very bad partitioning schema. You simply have too many unique values for column A, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.
For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.
I would recommend to use different partitioning schema, for example, year + month only, and do OPTIMIZE with ZOrder on A column after the data is written. This will lead to creation of only few partitions with bigger files.

Joining two large tables which have large regions of no overlap

Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.

Optimize Partitionning for billions of distinct keys

I'm processing a file each day with PySpark for contaning information about device navigation through the web. At the end of each month I want to use window functions in order to have the navigation journey for each device. It's a very slow processing, even with many nodes, so I'm looking for ways to speed it up.
My idea was to partition the data but I have 2 billion distinct keys, so partitionBy does not seem appropriate. Even bucketBy might not be a good choice because I create n buckets each day, so the files are not appended but for each day there are x parts of files that are created.
Does anyone have a solution ?
So here is an example of the export for each day (inside of each parquet file we find 9 partitions):
And here is the partitionBy query that we launch at the beggining of each month (compute_visit_number and compute_session_number are two udf that i've created on the notebook):
You want to ensure that each devices data is in the same partition to prevent exchanges when you do your window function. Or at least minimise the number of partitions the data could be in.
To do this I would create a column called partitionKey when you write the data - which contained a mod on the mc_device column - where the number you mod by is the number of partitions you want. Base this number of the size of the cluster that will run the end of month query. (If mc_device is not an integer then create a checksum first).
You can create a secondary partition on the date column if still needed.
Your end of month query should change:
w = Windows.partitionBy('partitionKey', 'mc_device').orderBy(event_time')
If you kept the date as a secondary partition column then repartition the dataframe to partitionKey only:
df = df.repartition('partitionKey')
At this point each devices data will be in the same partition and no exchanges should be needed. The sort should be faster and your query will hopefully complete in a sensible time.
If it is still slow you need more partitions when writing the data.

Repartition to avoid large number of small files

Currently I have a ETL job that reads few tables, performs certain transformations and writes them back to the daily table.
I use the following query in spark sql
"INSERT INTO dbname.tablename PARTITION(year_month)
SELECT * from Spark_temp_table "
The target table in which all these records are being inserted is partitioned at a year X month level. Records which are generated on a daily basis are not that much hence I am partitioning on year X month level.
However, when I check the partition, it has small ~50MB files for each day my code runs (code has to run daily) and eventually I will end up having around 30 files in my partition totalling ~1500MB
I want to know if there is way for me to just create one (or maybe 2-3 files as per block size restrictions) in one partition while I append my records on a daily basis
The way I think I can do it is to just read everything from the concerned partition in my spark dataframe, append it with the latest record and repartition it before writing back. How do I ensure I only read data from the concerned partition and only that partition is over written with lesser number of files?
you can use DISTRIBUTE BY clause to control how the records will be distributed in files inside each partition.
to have a single file per partition, you can use DISTRIBUTE BY year, month
and to have 3 file per partition, you can use DISTRIBUTE BY year, month, day % 3
the full query:
INSERT INTO dbname.tablename
PARTITION(year_month)
SELECT * from Spark_temp_table
DISTRIBUTE BY year, month, day % 3

Querying split partitions on Cassandra in a single request

I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement.
The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol # on 29-10-2015 and you see partitions A, D and Z have results so you go to those partitions and get financial data from them and glue it together on application level.

Resources