This question already has answers here:
Explode (transpose?) multiple columns in Spark SQL table
(3 answers)
Closed 4 years ago.
I have a dataframe in spark:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1,2,3 | 2,2,1 | 30,19,10
B | 3,5 | 5,8 | 18,40
Here all the columns are of string datatype.
How can I use explode function across multiple columns and create a new dataframe shown below:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1 | 2 | 30
A | 2 | 2 | 19
A | 3 | 1 | 10
B | 3 | 5 | 18
B | 5 | 8 | 40
Here in the new dataframe also, all the columns are of string datatype.
you need an UDF for that:
val df = Seq(
("A","1,2,3","2,2,1","30,19,10"),
("B","3,5","5,8","18,40")
).toDF("id","itemid","itemquant","itemprice")
val splitAndZip = udf((col1:String,col2:String,col3:String) => {
col1.split(',').zip(col2.split(',')).zip(col3.split(',')).map{case ((a,b),c) => (a,b,c)}
})
df
.withColumn("tmp",explode(splitAndZip($"itemId",$"itemquant",$"itemprice")))
.select(
$"id",
$"tmp._1".as("itemid"),
$"tmp._2".as("itemquant"),
$"tmp._3".as("itemprice")
)
.show()
+---+------+---------+---------+
| id|itemid|itemquant|itemprice|
+---+------+---------+---------+
| A| 1| 2| 30|
| A| 2| 2| 19|
| A| 3| 1| 10|
| B| 3| 5| 18|
| B| 5| 8| 40|
+---+------+---------+---------+
Related
I have a pyspark dataframe:
date | cust | amount | is_delinquent
---------------------------------------
1/1/20 | A | 5 | 0
13/1/20 | A | 1 | 0
15/1/20 | A | 3 | 1
19/1/20 | A | 4 | 0
20/1/20 | A | 4 | 1
27/1/20 | A | 2 | 0
1/2/20 | A | 2 | 0
5/2/20 | A | 1 | 0
1/1/20 | B | 7 | 0
1/1/20 | B | 5 | 0
Now I want to calculate the average of amount on a period windows of 30 days and filtering the column IS_DELINQUENT is equal to 0. It should skip when IS_DELINQUENT equal to 1 and replace as NaN.
My expected final dataframe is:
date | cust | amount | is_delinquent | avg_amount
----------------------------------------------------------
1/1/20 | A | 5 | 0 | null
13/1/20 | A | 1 | 0 | 5
15/1/20 | A | 3 | 1 | null
19/1/20 | A | 4 | 0 | 3
20/1/20 | A | 4 | 1 | null
27/1/20 | A | 2 | 0 | 3.333
1/2/20 | A | 2 | 0 | null
5/2/20 | A | 1 | 0 | 2
1/1/20 | B | 7 | 0 | null
9/1/20 | B | 5 | 0 | 7
without the filtering, my code would be like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
w_pay_30x = Window.partitionBy("cust").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(-days(30), -days(1))
data.withColumn("avg_amount", F.avg("amount").over(w_pay_30x)
Any idea how I can add this filter?
You can use when to calculate and show the average only if is_delinquent is equal to 0. Also you may want to include the month in the partition by clause of the window.
from pyspark.sql import functions as F, Window
days = lambda i: i * 86400
w_pay_30x = (Window.partitionBy("cust", F.month(F.to_timestamp('date', 'd/M/yy')))
.orderBy(F.to_timestamp('date', 'd/M/yy').cast('long'))
.rangeBetween(-days(30), -days(1))
)
data2 = data.withColumn(
'avg_amount',
F.when(
F.col('is_delinquent') == 0,
F.avg(
F.when(
F.col('is_delinquent') == 0,
F.col('amount')
)
).over(w_pay_30x)
)
).orderBy('cust', F.to_timestamp('date', 'd/M/yy'))
data2.show()
+-------+----+------+-------------+------------------+
| date|cust|amount|is_delinquent| avg_amount|
+-------+----+------+-------------+------------------+
| 1/1/20| A| 5| 0| null|
|13/1/20| A| 1| 0| 5.0|
|15/1/20| A| 3| 1| null|
|19/1/20| A| 4| 0| 3.0|
|20/1/20| A| 4| 1| null|
|27/1/20| A| 2| 0|3.3333333333333335|
| 1/2/20| A| 2| 0| null|
| 5/2/20| A| 1| 0| 2.0|
| 1/1/20| B| 7| 0| null|
| 9/1/20| B| 5| 0| 7.0|
+-------+----+------+-------------+------------------+
Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+
I have a pyspark dataframe with the following data:
| y | date | amount| id |
-----------------------------
| 1 | 2017-01-01 | 10 | 1 |
| 0 | 2017-01-01 | 2 | 1 |
| 1 | 2017-01-02 | 20 | 1 |
| 0 | 2017-01-02 | 3 | 1 |
| 1 | 2017-01-03 | 2 | 1 |
| 0 | 2017-01-03 | 5 | 1 |
I want to apply a window function, but apply the sum aggregate function only the columns with y==1, but still maintain the other columns.
The window that i would apply is:
w = Window \
.partitionBy(df.id) \
.orderBy(df.date.asc()) \
.rowsBetween(Window.unboundedPreceding, -1)
And the result dataframe would be like:
| y | date | amount| id | sum |
-----------------------------------
| 1 | 2017-01-01 | 10 | 1 | 0 |
| 0 | 2017-01-01 | 2 | 1 | 0 |
| 1 | 2017-01-02 | 20 | 1 | 10 | // =10 (considering only the row with y==1)
| 0 | 2017-01-02 | 3 | 1 | 10 | // same as above
| 1 | 2017-01-03 | 2 | 1 | 30 | // =10+20
| 0 | 2017-01-03 | 5 | 1 | 30 | // same as above
Is this feasible anyhow?
I tried to use the sum(when(df.y==1, df.amount)).over(w) but didn't return the correct results.
Actually it is difficult to handle it with using one window function. I think you should create some dummy columns first to calculate sum column. You can find my solution below.
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+---+----------+------+---+
| y| date|amount| id|
+---+----------+------+---+
| 1|2017-01-01| 10| 1|
| 0|2017-01-01| 2| 1|
| 1|2017-01-02| 20| 1|
| 0|2017-01-02| 3| 1|
| 1|2017-01-03| 2| 1|
| 0|2017-01-03| 5| 1|
+---+----------+------+---+
>>>
>>> df = df.withColumn('c1', F.when(F.col('y')==1,F.col('amount')).otherwise(0))
>>>
>>> window1 = Window.partitionBy(df.id).orderBy(df.date.asc()).rowsBetween(Window.unboundedPreceding, -1)
>>> df = df.withColumn('c2', F.sum(df.c1).over(window1)).fillna(0)
>>>
>>> window2 = Window.partitionBy(df.id).orderBy(df.date.asc())
>>> df = df.withColumn('c3', F.lag(df.c2).over(window2)).fillna(0)
>>>
>>> df = df.withColumn('sum', F.when(df.y==0,df.c3).otherwise(df.c2))
>>>
>>> df = df.select('y','date','amount','id','sum')
>>>
>>> df.show()
+---+----------+------+---+---+
| y| date|amount| id|sum|
+---+----------+------+---+---+
| 1|2017-01-01| 10| 1| 0|
| 0|2017-01-01| 2| 1| 0|
| 1|2017-01-02| 20| 1| 10|
| 0|2017-01-02| 3| 1| 10|
| 1|2017-01-03| 2| 1| 30|
| 0|2017-01-03| 5| 1| 30|
+---+----------+------+---+---+
This solution may not work if there if there is multiple y=1 or y=0 rows per day, please consider it
I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+
This question already has an answer here:
How could I order by sum, within a DataFrame in PySpark?
(1 answer)
Closed 4 years ago.
I have a dataframe:
# +---+--------+---------+
# | id| rank | value |
# +---+--------+---------+
# | 1| A | 10 |
# | 2| B | 46 |
# | 3| D | 8 |
# | 4| C | 8 |
# +---+--------+---------+
I want to sort it by value, then rank. This seems like it should be simple, but I'm not seeing how it's done in the documentation or SO for Pyspark, only for R and Scala.
This is how it should look after sorting, .show() should print:
# +---+--------+---------+
# | id| rank | value |
# +---+--------+---------+
# | 4| C | 8 |
# | 3| D | 8 |
# | 1| A | 10 |
# | 2| B | 46 |
# +---+--------+---------+
df.orderBy(["value", "rank"], ascending=[1, 1])
Reference: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy
say your dataframe is stored in a variable called df
you'd do df.orderBy('value').show() to get it sorted