how does sortWithinPartitions sort? - apache-spark

After applying sortWithinPartitions to a df and writing the output to a table I'm getting a result I'm not sure how to interpret.
df
.select($"type", $"id", $"time")
.sortWithinPartitions($"type", $"id", $"time")
result file looks somewhat like
1 a 5
2 b 1
1 a 6
2 b 2
1 a 7
2 b 3
1 a 8
2 b 4
It's not actually random, but neither is it sorted like I would expect it to be. Namely, first by type, then id, then time.
If I try to use a repartition before sorting, then I get the result I want. But for some reason the files weight 5 times more(100gb vs 20gb).
I'm writing to a hive orc table with compresssion set to snappy.
Does anyone know why it's sorted like this and why a repartition gets the right order, but a larger size?
Using spark 2.2.

The documentation of sortWithinPartition states
Returns a new Dataset with each partition sorted by the given expressions
The easiest way to think of this function is to imagine a fourth column (the partition id) that is used as primary sorting criterion. The function spark_partition_id() prints the partition.
For example if you have just one large partition (something that you as a Spark user would never do!), sortWithinPartition works as a normal sort:
df.repartition(1)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 1| a| 5| 0|
| 1| a| 6| 0|
| 1| a| 7| 0|
| 1| a| 8| 0|
| 2| b| 1| 0|
| 2| b| 2| 0|
| 2| b| 3| 0|
| 2| b| 4| 0|
+----+---+----+---------+
If there are more partitions, the results are only sorted within each partition:
df.repartition(4)
.sortWithinPartitions("type","id","time")
.withColumn("partition", spark_partition_id())
.show();
prints
+----+---+----+---------+
|type| id|time|partition|
+----+---+----+---------+
| 2| b| 1| 0|
| 2| b| 3| 0|
| 1| a| 5| 1|
| 1| a| 6| 1|
| 1| a| 8| 2|
| 2| b| 2| 2|
| 1| a| 7| 3|
| 2| b| 4| 3|
+----+---+----+---------+
Why would one use sortWithPartition instead of sort? sortWithPartition does not trigger a shuffle, as the data is only moved within the executors. sort however will trigger a shuffle. Therefore sortWithPartition executes faster. If the data is partitioned by a meaningful column, sorting within each partition might be enough.

Related

Check if a column is consecutive with groupby in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+

aggregate function in lit() of pyspark along with withColumn

I have column quantity in dataframe. I want to add a new column to this dataframe with each record having min("Quantity"). I am trying to use lit() in pyspark. something like below
df.withColumn("min_quant", lit(min(col("Quantity")))).show().
It's resulting in the getting below error
grouping expressions sequence is empty, and `InvoiceNo` is not an aggregate function.
Wrap (min(`Quantity`) AS `min_quant`) in windowing function(s) or wrap
This is working:
df.withColumn("min_quant", lit(2)).show().
But, in place of 2 here, I want min(Quantity). Am I missing something?
Please try using window function as min() function needs aggregation.
val windowSpec = Window.orderBy("InvoiceNo")
df.withColumn("min_quant", min("Quantity") over(windowSpec)).show()
Sample Result:
+---------+----+--------+---------+
|InvoiceNo|name|Quantity|min_quant|
+---------+----+--------+---------+
| 1| ABC| 19| 1|
| 1| ABC| 1| 1|
| 1| ABC| 8| 1|
| 1| ABC| 389| 1|
| 1| ABC| 196| 1|
| 2| CBD| 10| 1|
| 2| CBD| 946| 1|
| 3| XYZ| 3| 1|
+---------+----+--------+---------+

Skewed By in Spark

I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.
However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.
Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?
(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)
As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.
Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):
val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 5|
| 6| 6|
| 7| 7|
+-------+
Now let's add an artificial random column and write the dataframe.
val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id| y| r|
+---+---+---+
| 0| 0| 2|
| 1| 0| 2|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 1|
| 5| 5| 2|
| 6| 6| 2|
| 7| 7| 1|
+---+---+---+
// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")

Spark pairwise differences within groups

I have a spark dataframe, for the sake of argument lets take it to be:
val df = sc.parallelize(
Seq(("a",1,2),("a",1,4),("b",5,6),("b",10,2),("c",1,1))
).toDF("id","x","y")
+---+---+---+
| id| x| y|
+---+---+---+
| a| 1| 2|
| a| 1| 4|
| b| 5| 6|
| b| 10| 2|
| c| 1| 1|
+---+---+---+
I would like to compute all pairwise differences between entries in the dataframe with the same id and output the result to another dataframe. For a small dataframe I can accomplish this by:
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| c| 0| 0|
| b| 0| 0|
| b| -5| 4|
| b| 5| -4|
| b| 0| 0|
| a| 0| 0|
| a| 0| -2|
| a| 0| 2|
| a| 0| 0|
+---+---+---+
However, for large dataframes this isn't a reasonable approach as the crossJoin will mostly produce data that will be discarded by the subsequent where clause.
I'm still pretty new to spark and groupBy seemed like a natural place to start looking, but I can't figure out how to accomplish this using groupBy. Any help would be welcome.
I would eventually like to remove redundancy, for instance in:
val df1 = df.withColumn("idx",monotonicallyIncreasingId)
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id") && col("idx") < col("_idx")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| b| -5| 4|
| a| 0| -2|
+---+---+---+
But if its easier to accomplish this with redundancy, then I can live with that.
This is not an uncommon transformation to perform in ML so I thought something out of MLlib might be appropriate, but again I haven't found anything there either.
Can be achived via inner join, result the same as expected:
df.alias("left").join(df.alias("right"),"id")
.select($"id",
($"left.x"-$"right.x").alias("dx"),
($"left.y"-$"right.y").alias("dy"))

Convert Dataframe to multiple 2D arrays

I have this dataset:
+----+-----+-------+-----+
|code|code2|machine|value|
+----+-----+-------+-----+
| 1| 2| A| 42|
| 2| 1| A| 11|
| 1| 4| A| 55|
| 1| 1| B| 2|
| 3| 3| B| 34|
| 3| 2| B| 111|
+----+-----+-------+-----+
I want that for each machine a kind of matrix like the following:
code and code2 are the column and at the intersection I want to fill the value.
Machine A
+----+----+----+----+----+
| A| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0| 11| 0| 0|
| 2| 42| 0| 0| 0|
| 3| 0| 0| 0| 0|
| 4| 55| 0| 0| 0|
+----+----+----+----+----+
Machine B
+----+----+----+----+----+
| B| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 2| 0| 0| 0|
| 2| 0| 0| 111| 0|
| 3| 0| 0| 34| 0|
| 4| 0| 0| 0| 0|
+----+----+----+----+----+
I have multiple machine there (unknown number) and the codes can only be 0-255.
So my problem is how to achieve that matrix...
My fist naive idea was to make a hashmap and as key the machine name and as value a 256x256 2D array. But I don't think it would be efficient and I also don't know how to achieve that.
Or probably have a dataset for each machine??
If someone has an idea I would like to listen.
Btw I'm using Scala.
For maximum coding flexibility, you could switch to the RDD API. An example of a solution would give you a RDD that maps a machine to its matrix, represented as a scala two-dimensional array. Note that Array.ofDimInt creates a two-dim array of sine n*m with zeros everywhere.
df
.map(x=> x.getAs[String]("machine") -> (x.getAs[Int]("code"), x.getAs[Int]("code2"),x.getAs[Int]("value")))
.groupByKey
.mapValues( seq => {
var result = Array.ofDim[Int](256, 256)
seq.foreach{ case (i,j,value) => result(i)(j) = value }
result
})

Resources