I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer #zero323, but I am still confusing with the both. the code like:
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')
df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()
the result:
+---+----+
| k| v|
+---+----+
| b| 1|
| b| 3|
| a|null|
| a| -1|
| a| 1|
+---+----+
I supposed it should be like:
+---+----+
| k| v|
+---+----+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+----+
because, I showed df by operation of orderBy on 'k' and 'v':
df.orderBy('k','v').show()
+---+----+
| k| v|
+---+----+
| a|null|
| a| -1|
| a| 1|
| b| 1|
| b| 3|
+---+----+
additionally, I figured out the other solution to test this kind of problems, my code like:
df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()
I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.
Try inverting the sort order using .desc() and then first() will give the desired output.
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()
Outputs:
+---+---+
| k| v|
+---+---+
| b| 3|
| b| 3|
| a| 1|
| a| 1|
| a| 1|
+---+---+
You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.
The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.
To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.
df = spark.sparkContext.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])
#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select('k','v',
F.first("v",True).over(w).alias('first_w1'),
F.last("v",True).over(w).alias('last_w1'),
F.first("v",True).over(w2).alias('first_w2'),
F.last("v",True).over(w2).alias('last_w2')
).show()
Output:
+---+----+--------+-------+--------+-------+
| k| v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
| b| 1| 1| 1| 3| 1|
| b| 3| 1| 3| 3| 3|
| a|null| null| null| 1| -1|
| a| -1| -1| -1| 1| -1|
| a| 1| -1| 1| 1| 1|
+---+----+--------+-------+--------+-------+
Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using
w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)
will yield correct results for first() and last().
Related
I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.
I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.
I thought rangeBetween(start, end) looks into values of the range(cur_value - start, cur_value + end). https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html
But, I saw an example where they used descending orderBy() on timestamp, and then used (unboundedPreceeding, 0) with rangeBetween. Which led me to explore the following example:
dd = spark.createDataFrame(
[(1, "a"), (3, "a"), (3, "a"), (1, "b"), (2, "b"), (3, "b")],
['id', 'category']
)
dd.show()
# output
+---+--------+
| id|category|
+---+--------+
| 1| a|
| 3| a|
| 3| a|
| 1| b|
| 2| b|
| 3| b|
+---+--------+
It seems to include preceding row whose value is higher by 1.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-1, Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 3|
| 3| a| 6|
| 3| a| 6|
| 1| a| 1|
+---+--------+---+
And with start set to -2, it includes value greater by 2 but in preceding rows.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-2,Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 6|
| 3| a| 6|
| 3| a| 6|
| 1| a| 7|
+---+--------+---+
So, what is the exact behavior of rangeBetween with desc orderBy?
It's not well documented but when using range (or value-based) frames the ascending and descending order affects the determination of the values that are included in the frame.
Let's take the example you provided:
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW
Depending on the order by direction, 1 PRECEDING means:
current_row_value - 1 if ASC
current_row_value + 1 if DESC
Consider the row with value 1 in partition b.
With the descending order, the frame includes :
(current_value and all preceding values where x = current_value + 1) = (1, 2)
With the ascending order, the frame includes:
(current_value and all preceding values where x = current_value - 1) = (1)
PS: using rangeBetween(-1, Window.currentRow) with desc ordering is just equivalent to rangeBetween(Window.currentRow, 1) with asc ordering.
Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?
This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+
The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+
I have a spark dataframe, for the sake of argument lets take it to be:
val df = sc.parallelize(
Seq(("a",1,2),("a",1,4),("b",5,6),("b",10,2),("c",1,1))
).toDF("id","x","y")
+---+---+---+
| id| x| y|
+---+---+---+
| a| 1| 2|
| a| 1| 4|
| b| 5| 6|
| b| 10| 2|
| c| 1| 1|
+---+---+---+
I would like to compute all pairwise differences between entries in the dataframe with the same id and output the result to another dataframe. For a small dataframe I can accomplish this by:
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| c| 0| 0|
| b| 0| 0|
| b| -5| 4|
| b| 5| -4|
| b| 0| 0|
| a| 0| 0|
| a| 0| -2|
| a| 0| 2|
| a| 0| 0|
+---+---+---+
However, for large dataframes this isn't a reasonable approach as the crossJoin will mostly produce data that will be discarded by the subsequent where clause.
I'm still pretty new to spark and groupBy seemed like a natural place to start looking, but I can't figure out how to accomplish this using groupBy. Any help would be welcome.
I would eventually like to remove redundancy, for instance in:
val df1 = df.withColumn("idx",monotonicallyIncreasingId)
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id") && col("idx") < col("_idx")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| b| -5| 4|
| a| 0| -2|
+---+---+---+
But if its easier to accomplish this with redundancy, then I can live with that.
This is not an uncommon transformation to perform in ML so I thought something out of MLlib might be appropriate, but again I haven't found anything there either.
Can be achived via inner join, result the same as expected:
df.alias("left").join(df.alias("right"),"id")
.select($"id",
($"left.x"-$"right.x").alias("dx"),
($"left.y"-$"right.y").alias("dy"))
Given a DataFrame
+---+---+----+
| id| v|date|
+---+---+----+
| 1| a| 1|
| 2| a| 2|
| 3| b| 3|
| 4| b| 4|
+---+---+----+
And we want to add a column with the mean value of date by v
+---+---+----+---------+
| v| id|date|avg(date)|
+---+---+----+---------+
| a| 1| 1| 1.5|
| a| 2| 2| 1.5|
| b| 3| 3| 3.5|
| b| 4| 4| 3.5|
+---+---+----+---------+
Is there a better way (e.g in term of performance) ?
val df = sc.parallelize(List((1,"a",1), (2, "a", 2), (3, "b", 3), (4, "b", 4))).toDF("id", "v", "date")
val aggregated = df.groupBy("v").agg(avg("date"))
df.join(aggregated, usingColumn = "v")
More precisely, I think this join will trigger a shuffle.
[update] add some precisions because I don't think it's a duplicate. The join has a key in this case.
I may different options to avoid it :
automatic. Spark has an automaticBroadcastJoin but it requires that Hive metadata had been computed. Right ?
by using a known partitioner ? If yes, how to do that with DataFrame.
by forcing a broadcast (leftDF.join(broadcast(rightDF), usingColumn = "v") ?