This might be related to pivoting, but I am not sure. Basically, what I want to achieve is the following binary transformation:
+-----------------+
| C1 | C2 |
+--------|--------+
| A | xxx |
| B | yyy |
| A | yyy |
| B | www |
| B | xxx |
| A | zzz |
| A | xxx |
| A | yyy |
+-----------------+
to
+--------------------------------------------+
| C1 | www | xxx | yyy | zzz |
+--------|--------|--------|--------|--------|
| A | 0 | 1 | 1 | 1 |
| B | 1 | 1 | 1 | 0 |
+--------------------------------------------+
How does one attain this in PySpark? Presence is 1 and absence is 0.
Yes, you will need pivot. But for aggregation, in your case it's best just to use F.first(F.lit(1)) and when you get nulls, just replace them with 0 using df.fillna(0).
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('A', 'xxx'),
('B', 'yyy'),
('A', 'yyy'),
('B', 'www'),
('B', 'xxx'),
('A', 'zzz'),
('A', 'xxx'),
('A', 'yyy')],
['C1', 'C2'])
df = df.groupBy('C1').pivot('C2').agg(F.first(F.lit(1)))
df = df.fillna(0)
df.show()
# +---+---+---+---+---+
# | C1|www|xxx|yyy|zzz|
# +---+---+---+---+---+
# | B| 1| 1| 1| 0|
# | A| 0| 1| 1| 1|
# +---+---+---+---+---+
Related
I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set
I'm interesting is there a way to use lead\lag to count something like this
First step: i have a dataframe
+----+-----------+------+
| id | timestamp | sess |
+----+-----------+------+
| xx | 1 | A |
+----+-----------+------+
| yy | 2 | A |
+----+-----------+------+
| zz | 1 | B |
+----+-----------+------+
| yy | 3 | B |
+----+-----------+------+
| tt | 4 | B |
+----+-----------+------+
And i want to collect id's that is previous to particular id partitioning by session_id
+----+---------+
| id | id_list |
+----+---------+
| yy | [xx,zz] |
+----+---------+
| xx | [] |
+----+---------+
| zz | [] |
+----+---------+
| tt | [yy] |
+----+---------+
You can create a window over the column sess and lag the IDs as you mentioned in the question. Then you can use groupBy with the aggregate function collect_list to get the output.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"sess").orderBy($"timestamp")
val df1 = df.withColumn("lagged", lag($"id", 1).over(w))
df1.select("id", "lagged").groupBy($"id").agg(collect_list($"lagged").as("id_list")).show
//+---+--------------------+
//| id| id_list|
//+---+--------------------+
//| tt| [yy]|
//| xx| []|
//| zz| []|
//| yy| [zz, xx]|
//+---+--------------------+
I Have a table take the table as dataframe.
id | Formula | Step | Value |
1 | A*(B+C) | A | 5 |
1 | A*(B+C) | B | 6 |
1 | A*(B+C) | C | 7 |
2 | A/B | A | 12 |
2 | A/B | B | 6 |
Expected Result data frame
Solution required using spark and scala.
id | Formula | Value |
1 | A*(B+C) | 65 |
2 | A/B | 2 |
scala> val df = Seq((1,"A*(B+C)","A",5),(1,"A*(B+C)","B",6),(1,"A*(B+C)","C",5),(2,"A/B","A",12),(2,"A/B","B",6)).toDF("ID","Formula","Step","Value")
df: org.apache.spark.sql.DataFrame = [ID: int, Formula: string ... 2 more fields]
scala> df.show
+---+-------+----+-----+
| ID|Formula|Step|Value|
+---+-------+----+-----+
| 1|A*(B+C)| A| 5|
| 1|A*(B+C)| B| 6|
| 1|A*(B+C)| C| 5|
| 2| A/B| A| 12|
| 2| A/B| B| 6|
+---+-------+----+-----+
I want the answer like this:
id | Formula | Value |
1 | A*(B+C) | 65 |
2 | A/B | 2 |
You can group by Formula and collect the Step & Value as a key value pair.
scala> df.groupBy($"Formula").agg(collect_list(map($"Step",$"Value")) as "map").show(false)
+-------+---------------------------------------+
|Formula|map |
+-------+---------------------------------------+
|A*(B+C)|[Map(A -> 5), Map(B -> 6), Map(C -> 5)]|
|A/B |[Map(A -> 12), Map(B -> 6)] |
+-------+---------------------------------------+
Now you can write a UDF to substitute the variable values from map over Formula and get the results.
val evalUDF = udf((valueMap: Map[String, Int], formula: String) => {
...
})
val output = df.withColumn("Value", evalUDF($"map", $"Formula"))
I have a pyspark dataframe with the following data:
| y | date | amount| id |
-----------------------------
| 1 | 2017-01-01 | 10 | 1 |
| 0 | 2017-01-01 | 2 | 1 |
| 1 | 2017-01-02 | 20 | 1 |
| 0 | 2017-01-02 | 3 | 1 |
| 1 | 2017-01-03 | 2 | 1 |
| 0 | 2017-01-03 | 5 | 1 |
I want to apply a window function, but apply the sum aggregate function only the columns with y==1, but still maintain the other columns.
The window that i would apply is:
w = Window \
.partitionBy(df.id) \
.orderBy(df.date.asc()) \
.rowsBetween(Window.unboundedPreceding, -1)
And the result dataframe would be like:
| y | date | amount| id | sum |
-----------------------------------
| 1 | 2017-01-01 | 10 | 1 | 0 |
| 0 | 2017-01-01 | 2 | 1 | 0 |
| 1 | 2017-01-02 | 20 | 1 | 10 | // =10 (considering only the row with y==1)
| 0 | 2017-01-02 | 3 | 1 | 10 | // same as above
| 1 | 2017-01-03 | 2 | 1 | 30 | // =10+20
| 0 | 2017-01-03 | 5 | 1 | 30 | // same as above
Is this feasible anyhow?
I tried to use the sum(when(df.y==1, df.amount)).over(w) but didn't return the correct results.
Actually it is difficult to handle it with using one window function. I think you should create some dummy columns first to calculate sum column. You can find my solution below.
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+---+----------+------+---+
| y| date|amount| id|
+---+----------+------+---+
| 1|2017-01-01| 10| 1|
| 0|2017-01-01| 2| 1|
| 1|2017-01-02| 20| 1|
| 0|2017-01-02| 3| 1|
| 1|2017-01-03| 2| 1|
| 0|2017-01-03| 5| 1|
+---+----------+------+---+
>>>
>>> df = df.withColumn('c1', F.when(F.col('y')==1,F.col('amount')).otherwise(0))
>>>
>>> window1 = Window.partitionBy(df.id).orderBy(df.date.asc()).rowsBetween(Window.unboundedPreceding, -1)
>>> df = df.withColumn('c2', F.sum(df.c1).over(window1)).fillna(0)
>>>
>>> window2 = Window.partitionBy(df.id).orderBy(df.date.asc())
>>> df = df.withColumn('c3', F.lag(df.c2).over(window2)).fillna(0)
>>>
>>> df = df.withColumn('sum', F.when(df.y==0,df.c3).otherwise(df.c2))
>>>
>>> df = df.select('y','date','amount','id','sum')
>>>
>>> df.show()
+---+----------+------+---+---+
| y| date|amount| id|sum|
+---+----------+------+---+---+
| 1|2017-01-01| 10| 1| 0|
| 0|2017-01-01| 2| 1| 0|
| 1|2017-01-02| 20| 1| 10|
| 0|2017-01-02| 3| 1| 10|
| 1|2017-01-03| 2| 1| 30|
| 0|2017-01-03| 5| 1| 30|
+---+----------+------+---+---+
This solution may not work if there if there is multiple y=1 or y=0 rows per day, please consider it
I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+