Spark count dataframe to estimate output partitions, then write, efficiently without caching?

Spark count dataframe to estimate output partitions, then write, efficiently without caching? - apache-spark

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?

Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Related

Align Dataset partitioning to table partitioning scheme

I am writing to a table partitioned by month. I know that my data is ≈100MB per partition, no skew - it is going to fit within single HDFS block and I want to ensure that every partition gets a single file written. I also know the exact number of months in my dataset (which is something between 1 and 10), therefore:
ds.repartition(nMonths, $"month").write.<options>.insertInto(<...>)
This works. However I'm thinking from here... As Spark uses key's hash to determine the partition, I have no guarantee that every partition will receive a single month's data. The more partitions I have, the less likely this actually is - right?
Does it make sense then to increase the number of partitions above number of distinct keys?
ds.repartition(nMonths * 3, $"month").write.<options>.insertInto(<...>)
Lots of partitions will be empty, but this shouldn't be that much of a pain (should it?) and we're reducing the probability that some unlucky partitions get 3x/4x data, increasing overall execution time. Does this make sense? Is there any rule of thumb regarding the factor? Or any other approach to achieve the same?

If you want to be super-safe you can use range partitioning, something like:
ds.repartitionByRange(nMonths,$"month").write...
This way you also won't be having empty partitions, which in turn means you won't produce zero-size files in HDFS too.

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation.
I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.coalesce(1).write.saveAsTable("db.tiny_table")
and then when that didn't work I tried this, which I thought should be the same, but I wasn't sure. (I added the print's in an effort to debug.)
tiny = spark.table("db.big_table").limit(500).coalesce(1)
print(tiny.count())
print(tiny.show(10))
tiny.write.saveAsTable("db.tiny_table")
When I watch the Yarn UI, both print statements and the write are using 25k mappers. The count took 3 mins, the show took 25 mins, and the write took ~40 mins, although it finally did write the single file table I was looking for.
It seems to me like the first line should take the top 500 rows and coalesce them to a single partition, and then the other lines should happen extremely fast (on a single mapper/reducer). Can anyone see what I'm doing wrong here? I've been told maybe I should use sample instead of limit but as I understand it limit should be much faster. Is that right?
Thanks in advance for any thoughts!

I’ll approach the print functions issue first, as it’s something fundamental to understanding spark. Then limit vs sample. Then repartition vs coalesce.
The reasons the print functions take so long in this manner is because coalesce is a lazy transformation. Most transformations in spark are lazy and do not get evaluated until an action gets called.
Actions are things that do stuff and (mostly) dont return a new dataframe as a result. Like count, show. They return a number, and some data, whereas coalesce returns a dataframe with 1 partition (sort of, see below).
What is happening is that you are rerunning the sql query and the coalesce call each time you call an action on the tiny dataframe. That’s why they are using the 25k mappers for each call.
To save time, add the .cache() method to the first line (for your print code anyway).
Then the data frame transformations are actually executed on your first line and the result persisted in memory on your spark nodes.
This won’t have any impact on the initial query time for the first line, but at least you’re not running that query 2 more times because the result has been cached, and the actions can then use that cached result.
To remove it from memory, use the .unpersist() method.
Now for the actual query youre trying to do...
It really depends on how your data is partitioned. As in, is it partitioned on specific fields etc...
You mentioned it in your question, but sample might the right way to go.
Why is this?
limit has to search for 500 of the first rows. Unless your data is partitioned by row number (or some sort of incrementing id) then the first 500 rows could be stored in any of the the 25k partitions.
So spark has to go search through all of them until it finds all the correct values. Not only that, it has to perform an additional step of sorting the data to have the correct order.
sample just grabs 500 random values. Much easier to do as there’s no order/sorting of the data involved and it doesn’t have to search through specific partitions for specific rows.
While limit can be faster, it also has its, erm, limits. I usually only use it for very small subsets like 10/20 rows.
Now for partitioning....
The problem I think with coalesce is it virtually changes the partitioning. Now I’m not sure about this, so pinch of salt.
According to the pyspark docs:
this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
So your 500 rows will actually still sit across your 25k physical partitions that are considered by spark to be 1 virtual partition.
Causing a shuffle (usually bad) and persisting in spark memory with .repartition(1).cache() is possibly a good idea here. Because instead of having the 25k mappers looking at the physical partitions when you write, it should only result in 1 mapper looking at what is in spark memory. Then write becomes easy. You’re also dealing with a small subset, so any shuffling should (hopefully) be manageable.
Obviously this is usually bad practice, and doesn’t change the fact spark will probably want to run 25k mappers when it performs the original sql query. Hopefully sample takes care of that.
edit to clarify shuffling, repartition and coalesce
You have 2 datasets in 16 partitions on a 4 node cluster. You want to join them and write as a new dataset in 16 partitions.
Row 1 for data 1 might be on node 1, and row 1 for data 2 on node 4.
In order to join these rows together, spark has to physically move one, or both of them, then write to a new partition.
That’s a shuffle, physically moving data around a cluster.
It doesn’t matter that everything is partitioned by 16, what matters is where the data is sitting on he cluster.
data.repartition(4) will physically move data from each 4 sets of partitions per node into 1 partition per node.
Spark might move all 4 partitions from node 1 over to the 3 other nodes, in a new single partition on those nodes, and vice versa.
I wouldn’t think it’d do this, but it’s an extreme case that demonstrates the point.
A coalesce(4) call though, doesn’t move the data, it’s much more clever. Instead, it recognises “I already have 4 partitions per node & 4 nodes in total... I’m just going to call all 4 of those partitions per node a single partition and then I’ll have 4 total partitions!”
So it doesn’t need to move any data because it just combines existing partitions into a joined partition.

Try this, in my empirical experience repartition works better for this kind of problems:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.saveAsTable("db.tiny_table")
Even better if you are interested in the parquet you don't need to save it as a table:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.parquet(your_hdfs_path+"db.tiny_table")

Why is it so bad to have large partitions in Cassandra?

I have seen this warning everywhere but cannot find any detailed explanation on this topic.

For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context

Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?

Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string