I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()
Related
I have a pyspark dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5]})
I would like to create a new binary column is_consecutive that indicates if the values in the value column are consecutive by group.
The output should look like this:
foo = pd.DataFrame({'group': ['a','a','a','b','b','c','c','c'], 'value': [1,2,3,4,5,2,4,5],
'is_consecutive': [1,1,1,1,1,0,0,0]})
How could I do that in pyspark?
You can use lag to compare values with the previous row and check if they are consecutive, then use min to determine whether all rows are consecutive in a given group.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'consecutive',
F.coalesce(
F.col('value') - F.lag('value').over(Window.partitionBy('group').orderBy('value')) == 1,
F.lit(True)
).cast('int')
).withColumn(
'all_consecutive',
F.min('consecutive').over(Window.partitionBy('group'))
)
df2.show()
+-----+-----+-----------+---------------+
|group|value|consecutive|all_consecutive|
+-----+-----+-----------+---------------+
| c| 2| 1| 0|
| c| 4| 0| 0|
| c| 5| 1| 0|
| b| 4| 1| 1|
| b| 5| 1| 1|
| a| 1| 1| 1|
| a| 2| 1| 1|
| a| 3| 1| 1|
+-----+-----+-----------+---------------+
You can use lead and subtract the same with the existing value then find max of the window, once done , put a condition saying return 0 is max is >1 else return 1
w = Window.partitionBy("group").orderBy(F.monotonically_increasing_id())
(foo.withColumn("Diff",F.lead("value").over(w)-F.col("value"))
.withColumn("is_consecutive",F.when(F.max("Diff").over(w)>1,0).otherwise(1))
.drop("Diff")).show()
+-----+-----+--------------+
|group|value|is_consecutive|
+-----+-----+--------------+
| a| 1| 1|
| a| 2| 1|
| a| 3| 1|
| b| 4| 1|
| b| 5| 1|
| c| 2| 0|
| c| 4| 0|
| c| 5| 0|
+-----+-----+--------------+
I have a pyspark dataframe and want to add a column that adds values from a list in a repeating fashion. If this were just python, I would probably use itertools' cycle function. I don't know how to do this in pyspark.
names = ['Julia', 'Tim', 'Zoe']
My dataframe looks like this:
+-----+------+
| id_A| idx_B|
+-----+------+
| a| 0|
| b| 0|
| b| 2|
| b| 2|
| b| 2|
| b| 2|
+-----+------+
I want it to look like this:
+-----+------+--------+
| id_A| idx_B| names |
+-----+------+--------+
| a| 0| Julia|
| b| 0| Tim|
| b| 2| Zoe|
| b| 2| Julia|
| b| 2| Tim|
| b| 2| Zoe|
+-----+------+--------+
Here's one way.
1 - add a unique incremental id for your dataframe:
df = spark.createDataFrame(
df.rdd.zipWithIndex().map(lambda x: Row(*x[0], x[1]))
).toDF("id_A", "idx_B", "id")
df.show()
#+----+-----+---+
#|id_A|idx_B| id|
#+----+-----+---+
#| a| 0| 0|
#| b| 0| 1|
#| b| 2| 2|
#| b| 2| 3|
#| b| 2| 4|
#| b| 2| 5|
#+----+-----+---+
2 - create dataframe from the list of names:
names_df = spark.createDataFrame([(idx, name) for idx, name in enumerate(names)], ["name_id", "names"])
3 - join using modulo 3 (length of names list) in condition:
from pyspark.sql import functions as F
result = df.join(
names_df,
F.col("id") % 3 == F.col("name_id")
).orderBy("id").drop("id", "name_id")
result.show()
#+----+-----+-----+
#|id_A|idx_B|names|
#+----+-----+-----+
#| a| 0|Julia|
#| b| 0| Tim|
#| b| 2| Zoe|
#| b| 2|Julia|
#| b| 2| Tim|
#| b| 2| Zoe|
#+----+-----+-----+
How does one set the default value for pyspark.sql.functions.lag to a value within the current row?
For example, given:
testInput = [(1, 'a'),(2, 'c'),(3, 'e'),(1, 'a'),(1, 'b'),(1, 'b')]
columns = ['Col A', 'Col B']
df = sc.parallelize(testInput).toDF(columns)
df.show()
windowSpecification = Window.partitionBy(col('Col A')).orderBy(col('Col B'))
changedRows = col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)
df.select(col('Col A'), col('Col B'), changedRows.alias('New Col C')).show()
which outputs:
+-----+-----+
|Col A|Col B|
+-----+-----+
| 1| a|
| 2| c|
| 3| e|
| 1| a|
| 1| b|
| 1| b|
+-----+-----+
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| null|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| null|
| 2| c| null|
+-----+-----+---------+
I would like the output to look like:
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| false|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| false|
| 2| c| false|
+-----+-----+---------+
My current workaround is to add a second lag call to the changedRows, like so:
changedRows = (col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)) & F.lag(col('Col B'), 1).over(windowSpecification).isNotNull()
but this does not look clean to me.
I would like to do something like
changedRows = col('Col B') != F.lag(col('Col B'), 1, col('Col B')).over(windowSpecification)
but I get the error TypeError: 'Column' object is not callable.
You can use column values as parameters if you use pyspark.sql.functions.expr. In your case, make the following modification to changedRows:
changedRows = F.expr(
"`Col B` != lag(`Col B`, 1, `Col B`) over (PARTITION BY `Col A` ORDER BY `Col B`)"
)
df.select('Col A', 'Col B', changedRows.alias('New Col C')).show()
#+-----+-----+---------+
#|Col A|Col B|New Col C|
#+-----+-----+---------+
#| 1| a| false|
#| 1| a| false|
#| 1| b| true|
#| 1| b| false|
#| 3| e| false|
#| 2| c| false|
#+-----+-----+---------+
You have to refer to the column names in back ticks because of the space.
Below is a data frame in pyspark. I want to update the column val in data frame based on the values in tests column.
df.show()
+---------+----+---+
| tests| val|asd|
+---------+----+---+
| test1| Y| 1|
| test2| N| 2|
| test2| Y| 1|
| test1| N| 2|
| test1| N| 3|
| test3| N| 4|
| test4| Y| 5|
+---------+----+---+
I want to update the value when the any given test has val Y then all val's of that particular tests should be updated to Y. if not then what ever values they have.
basically I want the data frame to be like below.
result_df.show()
+---------+----+---+
| tests| val|asd|
+---------+----+---+
| test1| Y| 1|
| test2| Y| 2|
| test2| Y| 1|
| test1| Y| 2|
| test1| Y| 3|
| test3| N| 4|
| test4| Y| 5|
+---------+----+---+
What should I do to achieve that.
Use max window function and selectExpr:
df.selectExpr(
'tests', 'max(val) over (partition by tests) as val', 'asd'
).show()
+-----+---+---+
|tests|val|asd|
+-----+---+---+
|test4| Y| 5|
|test3| N| 4|
|test1| Y| 1|
|test1| Y| 2|
|test1| Y| 3|
|test2| Y| 2|
|test2| Y| 1|
+-----+---+---+
Here is a solution.
First we find out for each test whether it has val Y.
import pyspark.sql.functions as sf
by_test = df.groupBy('tests').agg(sf.sum((sf.col('val') == 'Y').cast('int')).alias('HasY'))
by_test.show()
+-----+----+
|tests|HasY|
+-----+----+
|test4| 1|
|test3| 0|
|test1| 1|
|test2| 1|
+-----+----+
Join back to the origine dataframe
df = df.join(by_test, on='tests')
df.show()
+-----+---+---+----+
|tests|val|asd|HasY|
+-----+---+---+----+
|test4| Y| 5| 1|
|test3| N| 4| 0|
|test1| Y| 1| 1|
|test1| N| 2| 1|
|test1| N| 3| 1|
|test2| N| 2| 1|
|test2| Y| 1| 1|
+-----+---+---+----+
Create a new column with the same name using when/otherwise
df = df.withColumn('val', sf.when(sf.col('HasY') > 0, 'Y').otherwise(sf.col('val')))
df = df.drop('HasY')
df.show()
+-----+---+---+
|tests|val|asd|
+-----+---+---+
|test4| Y| 5|
|test3| N| 4|
|test1| Y| 1|
|test1| Y| 2|
|test1| Y| 3|
|test2| Y| 2|
|test2| Y| 1|
+-----+---+---+
I would like to achieve the following thing:
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 20|
| z| x| 10|
| b| y| 7|
| z| y| 5|
| c| w| 1|
| z| w| 2|
+---+---+---+
should be reduced to
val df2 = Seq(("a","x",30),("b","y",12),("c","w",3)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 30|
| b| y| 12|
| c| w| 3|
+---+---+---+
I am aware of the dropDuplicates()command with its options. But for what I ould like to achieve this does not work. Somehow one has to detect the duplicates according to column _2 and then one has to remove the always the row with the z entry in _1 and add its _3 value to the the _3column that one keeps.
Thank you in advance.
As per your question this is what you are looking for
import spark.implicits._
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
val resultDf = df1.groupBy("_2").agg(collect_list("_1")(0).as("_1"), sum("_3").as("_3"))
Output:
+---+---+---+
| _2| _1| _3|
+---+---+---+
| x| a| 30|
| w| c| 3|
| y| b| 12|
+---+---+---+
You will get the result but the order is not guaranteed.