Difference in dense rank and row number in spark - apache-spark

I tried to understand the difference between dense rank and row number.Each new window partition both is starting from 1. Does rank of a row is not always start from 1 ? Any help would be appreciated

The difference is when there are "ties" in the ordering column. Check the example below:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
val windowSpec = Window.partitionBy("col1").orderBy("col2")
df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show
+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+
Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.

I'm showing #Daniel's answer in Python and I'm adding a comparison with count('*') that can be used if you want to get top-n at most rows per group.
from pyspark.sql.session import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
['a', 10], ['a', 20], ['a', 30],
['a', 40], ['a', 40], ['a', 40], ['a', 40],
['a', 50], ['a', 50], ['a', 60]], ['part_col', 'order_col'])
window = Window.partitionBy("part_col").orderBy("order_col")
df = (df
.withColumn("rank", F.rank().over(window))
.withColumn("dense_rank", F.dense_rank().over(window))
.withColumn("row_number", F.row_number().over(window))
.withColumn("count", F.count('*').over(window))
)
df.show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
| a| 40| 4| 4| 4| 7|
| a| 40| 4| 4| 5| 7|
| a| 40| 4| 4| 6| 7|
| a| 40| 4| 4| 7| 7|
| a| 50| 8| 5| 8| 9|
| a| 50| 8| 5| 9| 9|
| a| 60| 10| 6| 10| 10|
+--------+---------+----+----------+----------+-----+
For example if you want to take at most 4 without randomly picking one of the 4 "40" of the sorting column:
df.where("count <= 4").show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
+--------+---------+----+----------+----------+-----+
In summary, if you filter <= n those columns you will get:
rank at least n rows
dense_rank at least n different order_col values
row_number exactly n rows
count at most n rows

Related

Iterating through rows to create custom formula structure in PySpark

I have a dataframe with variable names and numerator and denominator.
Each variable is a ratio, eg below:
And another dataset with actual data to compute the attributes:
Goal is to create these attributes with formulas in 1st and compute with 2nd.
Currently my approach is very naive:
df = df.withColumn("var1", col('a')/col('b'))./
.
.
.
Desired Output:
Since I have >500 variables, any suggestions for a smarter way to get around this are welcome!
This can be achieved by cross join , unpivot and pivot function in PySpark.
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("var1", "a","c"),
("var2", "b","d"),
("var3", "b","a"),
("var4", "d","c")
]
schema = StructType([
StructField('name', StringType(),True), \
StructField('numerator', StringType(),True), \
StructField('denonminator', StringType(),True)
])
data2 = [
("ID1", 6,4,3,7),
("ID2", 1,2,3,9)
]
schema2 = StructType([
StructField('ID', StringType(),True), \
StructField('a', IntegerType(),True), \
StructField('b', IntegerType(),True),\
StructField('c', IntegerType(),True), \
StructField('d', IntegerType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df2 = spark.createDataFrame(data=data2, schema=schema2)
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
""" CRoss Join for Duplicating the values """
df3=spark.sql("select * from table1 cross join table2")
df3.createOrReplaceTempView("table3")
""" Unpivoting the values and joining to fecth the value of numerator and denominator"""
cols = df2.columns[1:]
df4=df2.selectExpr('ID', "stack({}, {})".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols))))
df4.createOrReplaceTempView("table4")
df5=spark.sql("select name,B.ID,round(B.col1/C.col1,2) as value from table3 A left outer join table4 B on A.ID=B.ID and a.numerator=b.col0 left outer join table4 C on A.ID=C.ID and a.denonminator=C.col0 order by name,ID")
""" Pivot for fetching the results """
df_final=df5.groupBy("ID").pivot("name").max("value")
The results of all intermediate and final dataframes
>>> df.show()
+----+---------+------------+
|name|numerator|denonminator|
+----+---------+------------+
|var1| a| c|
|var2| b| d|
|var3| b| a|
|var4| d| c|
+----+---------+------------+
>>> df2.show()
+---+---+---+---+---+
| ID| a| b| c| d|
+---+---+---+---+---+
|ID1| 6| 4| 3| 7|
|ID2| 1| 2| 3| 9|
+---+---+---+---+---+
>>> df3.show()
+----+---------+------------+---+---+---+---+---+
|name|numerator|denonminator| ID| a| b| c| d|
+----+---------+------------+---+---+---+---+---+
|var1| a| c|ID1| 6| 4| 3| 7|
|var2| b| d|ID1| 6| 4| 3| 7|
|var1| a| c|ID2| 1| 2| 3| 9|
|var2| b| d|ID2| 1| 2| 3| 9|
|var3| b| a|ID1| 6| 4| 3| 7|
|var4| d| c|ID1| 6| 4| 3| 7|
|var3| b| a|ID2| 1| 2| 3| 9|
|var4| d| c|ID2| 1| 2| 3| 9|
+----+---------+------------+---+---+---+---+---+
>>> df4.show()
+---+----+----+
| ID|col0|col1|
+---+----+----+
|ID1| a| 6|
|ID1| b| 4|
|ID1| c| 3|
|ID1| d| 7|
|ID2| a| 1|
|ID2| b| 2|
|ID2| c| 3|
|ID2| d| 9|
+---+----+----+
>>> df5.show()
+----+---+-----+
|name| ID|value|
+----+---+-----+
|var1|ID1| 2.0|
|var1|ID2| 0.33|
|var2|ID1| 0.57|
|var2|ID2| 0.22|
|var3|ID1| 0.67|
|var3|ID2| 2.0|
|var4|ID1| 2.33|
|var4|ID2| 3.0|
+----+---+-----+
>>> df_final.show() final
+---+----+----+----+----+
| ID|var1|var2|var3|var4|
+---+----+----+----+----+
|ID2|0.33|0.22| 2.0| 3.0|
|ID1| 2.0|0.57|0.67|2.33|
+---+----+----+----+----+

Check if a column is consecutive with groupby in pyspark

I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+

Drop function doesn't work properly after joining same columns of Dataframe

I am facing this same issue while joining two Data frame A, B.
For ex:
c = df_a.join(df_b, [df_a.col1 == df_b.col1], how="left").drop(df_b.col1)
And when I try to drop the duplicate column like as above this query doesn't drop the col1 of df_b. Instead when I try to drop col1 of df_a, then it able to drop the col1 of df_a.
Could anyone please say about this.
Note: I tried the same in my project which has more than 200 columns and shows the same problem. Sometimes this drop function works properly if we have few columns but not if we have more columns.
Drop function not working after left outer join in pyspark
function to drop duplicates column after merge.
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
There are some similar issues I faced recently. Let me show them below with your case.
I am creating two dataframes with the same data
scala> val df_a = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_a: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val df_b = Seq((1, 2, "as"), (2,3,"ds"), (3,4,"ew"), (4, 1, "re"), (3,1,"ht")).toDF("a", "b", "c")
df_b: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
Joining them
scala> val df = df_a.join(df_b, df_a("b") === df_b("a"), "leftouter")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 4 more fields]
scala> df.show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Let's drop a column that is not present in the above dataframe
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
Ideally we will expect spark to throw an error, but it executes successfully.
Now, if you drop a column from the above dataframe
scala> df.drop("a").show
+---+---+---+---+
| b| c| b| c|
+---+---+---+---+
| 2| as| 3| ds|
| 3| ds| 1| ht|
| 3| ds| 4| ew|
| 4| ew| 1| re|
| 1| re| 2| as|
| 1| ht| 2| as|
+---+---+---+---+
It drops all the columns with provided column name in the input dataframe.
If you want to drop specific columns, it should be done as below:
scala> df.drop(df_a("a")).show()
+---+---+---+---+---+
| b| c| a| b| c|
+---+---+---+---+---+
| 2| as| 2| 3| ds|
| 3| ds| 3| 1| ht|
| 3| ds| 3| 4| ew|
| 4| ew| 4| 1| re|
| 1| re| 1| 2| as|
| 1| ht| 1| 2| as|
+---+---+---+---+---+
I don't think spark accepts the input as give by you(see below):
scala> df.drop(df_a.a).show()
<console>:30: error: value a is not a member of org.apache.spark.sql.DataFrame
df.drop(df_a.a).show()
^
scala> df.drop(df_a."a").show()
<console>:1: error: identifier expected but string literal found.
df.drop(df_a."a").show()
^
If you provide the input to drop, as below, it executes but will have no impact
scala> df.drop("df_a.a").show
+---+---+---+---+---+---+
| a| b| c| a| b| c|
+---+---+---+---+---+---+
| 1| 2| as| 2| 3| ds|
| 2| 3| ds| 3| 1| ht|
| 2| 3| ds| 3| 4| ew|
| 3| 4| ew| 4| 1| re|
| 4| 1| re| 1| 2| as|
| 3| 1| ht| 1| 2| as|
+---+---+---+---+---+---+
The reason being, spark interprets "df_a.a" as a nested column. As that column is not present ideally it should have thrown error, but as explained above, it just executes.
Hope this helps..!!!

Pyspark apply function to column value if condition is met [duplicate]

This question already has answers here:
Spark Equivalent of IF Then ELSE
(4 answers)
Closed 3 years ago.
Given a pyspark dataframe, for example:
ls = [
['1', 2],
['2', 7],
['1', 3],
['2',-6],
['1', 3],
['1', 5],
['1', 4],
['2', 7]
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
df.show()
+----+-----+
|col1| col2|
+----+-----+
| 1| 2|
| 2| 7|
| 1| 3|
| 2| -6|
| 1| 3|
| 1| 5|
| 1| 4|
| 2| 7|
+----+-----+
How can I apply a function to col2 values where col1 == '1' and store result in a new column?
For example the function is:
f = x**2
Result should look like this:
+----+-----+-----+
|col1| col2| y|
+----+-----+-----+
| 1| 2| 4|
| 2| 7| null|
| 1| 3| 9|
| 2| -6| null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7| null|
+----+-----+-----+
I tried defining a separate function, and use df.withColumn(y).when(condition,function) but it wouldn't work.
So what is a way to do this?
I hope this helps:
def myFun(x):
return (x**2).cast(IntegerType())
df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))
df2.show()
+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+

[Py]Spark SQL: Constrain each frame of a Window using the frame's input row

I would like to constrain what rows in a Window frame are used by the aggregate function based on the current input row. For example, given a DataFrame df and a Window w, I want to be able to do something like:
df2 = df.withColumn("foo", first(col("bar").filter(...)).over(w))
where .filter would remove rows from the current Window frame based on the frame's input row.
My specific use case is as follows: Given a DataFrame df
+-----+--+--+
|group|n1|n2|
+-----+--+--+
| 1| 1| 6|
| 1| 0| 3|
| 1| 2| 2|
| 1| 3| 5|
| 2| 0| 5|
| 2| 0| 7|
| 2| 3| 2|
| 2| 5| 9|
+-----+--+--+
window
w = Window.partitionBy("group")\
.orderBy("n1", "n2")\
.rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)
and some positive Long i, how would you find the first row (fr) in each input row r's frame such that r.n1 < fr.n1, r.n2 < fr.n2, and max(fr.n1 - r.n1, fr.n2 - r.n2) < i? The value returned can be either fr.n1 or fr's row index in df. So, for i = 6, the output for the example df would be
+-----+--+--+-----+
|group|n1|n2|fr.n1|
+-----+--+--+-----+
| 1| 1| 6| null|
| 1| 0| 3| 1|
| 1| 2| 2| 3|
| 1| 3| 5| null|
| 2| 0| 5| 5|
| 2| 0| 7| 5|
| 2| 3| 2| null|
| 2| 5| 9| null|
+-----+--+--+-----+
I've been studying the Spark API and looking at examples of Window, first, and when, but I can't seem to piece it together. Is this even possible with Window and aggregate functions or am I completely off the mark?
You won't be able to do it with just window functions and aggregations, you'll need a self join:
For the join:
df = sc.parallelize([[1, 1, 6],[1, 0, 3],[1, 2, 2],[1, 3, 5],[2, 0, 5],[2, 0, 7],[2, 3, 2],[2, 5, 9]]).toDF(["group","n1","n2"])
import pyspark.sql.functions as psf
df_r = df.select([df[c].alias("r_" + c) for c in df.columns])
df_join = df_r\
.join(df, (df_r.r_group == df.group)
& (df_r.r_n1 < df.n1)
& (df_r.r_n2 < df.n2)
& (psf.greatest(df.n1 - df_r.r_n1, df.n2 - df_r.r_n2) < i), "leftouter")\
.drop("group")
Now we can apply the window function to only keep the first row:
w = Window.partitionBy("r_group", "r_n1", "r_n2").orderBy("n1", "n2")
res = df_join\
.withColumn("rn", psf.row_number().over(w))\
.filter("rn = 1").drop("rn")
+-------+----+----+----+----+
|r_group|r_n1|r_n2| n1| n2|
+-------+----+----+----+----+
| 1| 0| 3| 1| 6|
| 1| 1| 6|null|null|
| 1| 2| 2| 3| 5|
| 1| 3| 5|null|null|
| 2| 0| 5| 5| 9|
| 2| 0| 7| 5| 9|
| 2| 3| 2|null|null|
| 2| 5| 9|null|null|
+-------+----+----+----+----+

Resources