Align Dataset partitioning to table partitioning scheme - apache-spark

I am writing to a table partitioned by month. I know that my data is ≈100MB per partition, no skew - it is going to fit within single HDFS block and I want to ensure that every partition gets a single file written. I also know the exact number of months in my dataset (which is something between 1 and 10), therefore:
ds.repartition(nMonths, $"month").write.<options>.insertInto(<...>)
This works. However I'm thinking from here... As Spark uses key's hash to determine the partition, I have no guarantee that every partition will receive a single month's data. The more partitions I have, the less likely this actually is - right?
Does it make sense then to increase the number of partitions above number of distinct keys?
ds.repartition(nMonths * 3, $"month").write.<options>.insertInto(<...>)
Lots of partitions will be empty, but this shouldn't be that much of a pain (should it?) and we're reducing the probability that some unlucky partitions get 3x/4x data, increasing overall execution time. Does this make sense? Is there any rule of thumb regarding the factor? Or any other approach to achieve the same?

If you want to be super-safe you can use range partitioning, something like:
ds.repartitionByRange(nMonths,$"month").write...
This way you also won't be having empty partitions, which in turn means you won't produce zero-size files in HDFS too.

Related

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation.
I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.coalesce(1).write.saveAsTable("db.tiny_table")
and then when that didn't work I tried this, which I thought should be the same, but I wasn't sure. (I added the print's in an effort to debug.)
tiny = spark.table("db.big_table").limit(500).coalesce(1)
print(tiny.count())
print(tiny.show(10))
tiny.write.saveAsTable("db.tiny_table")
When I watch the Yarn UI, both print statements and the write are using 25k mappers. The count took 3 mins, the show took 25 mins, and the write took ~40 mins, although it finally did write the single file table I was looking for.
It seems to me like the first line should take the top 500 rows and coalesce them to a single partition, and then the other lines should happen extremely fast (on a single mapper/reducer). Can anyone see what I'm doing wrong here? I've been told maybe I should use sample instead of limit but as I understand it limit should be much faster. Is that right?
Thanks in advance for any thoughts!
I’ll approach the print functions issue first, as it’s something fundamental to understanding spark. Then limit vs sample. Then repartition vs coalesce.
The reasons the print functions take so long in this manner is because coalesce is a lazy transformation. Most transformations in spark are lazy and do not get evaluated until an action gets called.
Actions are things that do stuff and (mostly) dont return a new dataframe as a result. Like count, show. They return a number, and some data, whereas coalesce returns a dataframe with 1 partition (sort of, see below).
What is happening is that you are rerunning the sql query and the coalesce call each time you call an action on the tiny dataframe. That’s why they are using the 25k mappers for each call.
To save time, add the .cache() method to the first line (for your print code anyway).
Then the data frame transformations are actually executed on your first line and the result persisted in memory on your spark nodes.
This won’t have any impact on the initial query time for the first line, but at least you’re not running that query 2 more times because the result has been cached, and the actions can then use that cached result.
To remove it from memory, use the .unpersist() method.
Now for the actual query youre trying to do...
It really depends on how your data is partitioned. As in, is it partitioned on specific fields etc...
You mentioned it in your question, but sample might the right way to go.
Why is this?
limit has to search for 500 of the first rows. Unless your data is partitioned by row number (or some sort of incrementing id) then the first 500 rows could be stored in any of the the 25k partitions.
So spark has to go search through all of them until it finds all the correct values. Not only that, it has to perform an additional step of sorting the data to have the correct order.
sample just grabs 500 random values. Much easier to do as there’s no order/sorting of the data involved and it doesn’t have to search through specific partitions for specific rows.
While limit can be faster, it also has its, erm, limits. I usually only use it for very small subsets like 10/20 rows.
Now for partitioning....
The problem I think with coalesce is it virtually changes the partitioning. Now I’m not sure about this, so pinch of salt.
According to the pyspark docs:
this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
So your 500 rows will actually still sit across your 25k physical partitions that are considered by spark to be 1 virtual partition.
Causing a shuffle (usually bad) and persisting in spark memory with .repartition(1).cache() is possibly a good idea here. Because instead of having the 25k mappers looking at the physical partitions when you write, it should only result in 1 mapper looking at what is in spark memory. Then write becomes easy. You’re also dealing with a small subset, so any shuffling should (hopefully) be manageable.
Obviously this is usually bad practice, and doesn’t change the fact spark will probably want to run 25k mappers when it performs the original sql query. Hopefully sample takes care of that.
edit to clarify shuffling, repartition and coalesce
You have 2 datasets in 16 partitions on a 4 node cluster. You want to join them and write as a new dataset in 16 partitions.
Row 1 for data 1 might be on node 1, and row 1 for data 2 on node 4.
In order to join these rows together, spark has to physically move one, or both of them, then write to a new partition.
That’s a shuffle, physically moving data around a cluster.
It doesn’t matter that everything is partitioned by 16, what matters is where the data is sitting on he cluster.
data.repartition(4) will physically move data from each 4 sets of partitions per node into 1 partition per node.
Spark might move all 4 partitions from node 1 over to the 3 other nodes, in a new single partition on those nodes, and vice versa.
I wouldn’t think it’d do this, but it’s an extreme case that demonstrates the point.
A coalesce(4) call though, doesn’t move the data, it’s much more clever. Instead, it recognises “I already have 4 partitions per node & 4 nodes in total... I’m just going to call all 4 of those partitions per node a single partition and then I’ll have 4 total partitions!”
So it doesn’t need to move any data because it just combines existing partitions into a joined partition.
Try this, in my empirical experience repartition works better for this kind of problems:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.saveAsTable("db.tiny_table")
Even better if you are interested in the parquet you don't need to save it as a table:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.parquet(your_hdfs_path+"db.tiny_table")

Spark count dataframe to estimate output partitions, then write, efficiently without caching?

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?
Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Does joining additional columns in Spark scale horizontally?

I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
Is this slowdown is expected?
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
Are there other things I can do when combining lots of columns in spark?
Calling explain on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed). Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
Is this slowdown is expected
Yes, to some extent it is. Joins belong to the most expensive operations in a data intensive systems (it is not a coincidence that products which claim linear scalability usually take joins out of the table). Join-like operation in a distributed system typically require data exchange between nodes hitting a bunch of high latency numbers.
In Spark SQL there is also additional cost of computing execution plan, which has larger than linear complexity.
I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
No. Input format doesn't affect join logic at all.
All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
If truly excluded from the final output they will be pruned from the execution plan. But since you for a reason, I assume it is not the case and there are required for the final output.
Is there a way to really checkpoint so each join is just a join? I am actually calling show() after each join, so I assumed the join is actually happening at that point.
show computes only a small subset of data required for the output. It doesn't cache, although shuffle files might be reused.
(appears to include all previous joins and it also includes the complex sql queries, even though those have been checkpointed).
Checkpoints are created only if data is fully computed and don't remove stages from the execution plan. If you want to do it explicitly, write partial result to persistent storage and read it back at the beginning of each iteration (it is probably an overkill).
Are there other things I can do when combining lots of columns in spark?
The best thing you can do is to find a way to avoid joins completely. If key is always the same then single shuffle, and operation on groups / partitions (with byKey method, window functions) might be better choice.
However if you
have a dataset with about 2.4M rows
then using non-distributed system that supports in-place modification might be much better choice.
In the most naive implementation you can compute each aggregate separately, sort by key and write to disk. Then data can be merged together line by line with negligible memory footprint.

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context
Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

Resources