SparkSQL DataFrame order by across partitions - apache-spark

I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.
I would like to coalesce the resulting DataFrame and order the rows by a column. I tried
DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
I also tried
DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of
1, value
2, value
4, value
4, value
5, value
5, value
I get
2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
What is the correct way to get an absolute ordering of my query result?
Why isn't the data frame being coalesced into a single partition?

I want to mention couple of things here .
1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.
2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.
3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.
Community please let me know if you agree or disagree with statements.
how are you collecting data from the output anyway?
maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.

The orderBy will produce new partitions after your coalesce. To have a single output partition, reorder the operations...
DataFrame result = spark.sql("my sql").orderBy("col1").coalesce(1)
As #JavaPlanet mentioned, for really big data you don't want to coalesce into a single partition. It will drastically reduce your level of parallelism.


Repartition on non-deterministic expression

I want to write code like this:
df.repartition(42, monotonically_increasing_id() / lit(10000))
Is this code going to break something due to non-determinatic expression in repartition? I understand that this code will turn into HashPartitioning which is deterministic.
What alerts me is that Spark sorts partitions internally before applying RoundRobin partitioning due its non-deterministic nature.
I want my DF being reshuffled in bigger chunks to have some data homogeneity for better compression.
RangePartitioning is too slow and may have similar problems with non-determinism.
I tried to execute this code and it works correctly. But I want to make sure it's resilient to node failures.
Yes, this code will turn into HashPartitioning. RoundRobin is going to be used only in case when you provide numer of partitions to reparition function but without any repartition expression.
In your case i think that you should be fine. Lets take a look what Spark is producing in its plan, for us most important part is here:
(2) Project [codegen id : 1] Output 1:
[monotonically_increasing_id() AS _nondeterministic#64L] Input: []
(3) Exchange Input 1: [_nondeterministic#64L] Arguments:
hashpartitioning((cast(_nondeterministic#64L as double) / 10000.0),
42), REPARTITION_BY_NUM, [id=#231]
So we have two stages, first with project which will get value from monotonically_increasing_id and then we have the hashparitioning
Lets say that our input has 10 partition, we do the project and then exchange succesffuly for 9 partitions but 1 failed and needs to be recomputed. At this stage data from partitions 1-9 are calculated but for partition 10 spark need to call monotonically_increasing_id() again.
Spark 3.0 source code for this function
It looks like this function is non-deterministic because its result depends on partition number. So the question is if during recomputation partition number if changing and for this moment i dont have answer. But if it is not changing (thats my expectation) you are going to get the same values, if its changing, you are going to get different values here and your data may be distributed a little bit different but it still should be ok in your case (data distribution should be very similar).

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions ='category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Spark: where doesn't work properly

I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().
Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all
Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Why does df.limit keep changing in Pyspark?

I'm creating a data sample from some dataframe df with
rdd = df.limit(10000).rdd
This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now.
However, when I now work on rdd, it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save?
What is the reason behind it?
Update: Here is a reproduction on Spark 1.5.2
from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
for _ in range(3):
print( row:row.i).reduce(add))
The output is
I'm surprised that .rdd doesn't fix the data.
To show that it get's more tricky than the re-execution issue, here is a single action which produces incorrect results on Spark
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd x:(x,x))
# result is 10240 despite doing a self-join
Basically, whenever you use limit your results might be potentially wrong. I don't mean "just one of many samples", but really incorrect (since in the case the result should always be 12345).
Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.
Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.
Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).
From the Spark docs:
The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
So you need to sort the rows beforehand if you want the call to .limit() to be deterministic. But there is a catch! If you sort by a column that doesn't have unique values for every row, the so called "tied" rows (rows with same sorting key value) will not be deterministically ordered, thus the .limit() might still be nondeterministic.
You have two options to work around this:
Make sure you include the a unique row id in the sorting call.
For example df.orderBy('someCol', 'rowId').limit(n).
You can define the rowId like so:
df = df.withColumn('rowId', func.monotonically_increasing_id())
If you only need deterministic result in the single run, you could simply cache the results of limit df.limit(n).cache() so that at least the results from that limit do not change due to the consecutive action calls that would otherwise recompute the results of limit and mess up the results.
