I'm trying to transform a python notebook in pyspark pipeline and I'm blocked by... it seems a simple problem ....
I have this dataframe after a count aggregation group By Id:
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 4 | 6 |
And I want this :
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 2 | 0 |
| 3 | 0 |
| 4 | 6 |
| 5 | 0 |
I have tried to add a [0,1,3,4,5] array in each rows, then explode outter this array, then tried to find a way to keep the rows I need but it's seems a bit complicated for this simple case.
DO you have any tips ?
Thx in advance
original.show()
+---+-----+
| id|count|
+---+-----+
| 1| 12|
| 3| 15|
+---+-----+
df = spark.createDataFrame([(0,0),(1,0),(2,0),(3,0),(4,0),(5,0)],['id', 'default_count'])
df.show()
+---+-------------+
| id|default_count|
+---+-------------+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 0|
+---+-------------+
result=original.join(df, on='id', how='right').withColumn('count', F.coalesce(F.col('count'), F.col('default_count'))).orderBy(F.col('id')).drop(F.col('default_count'))
+---+-----+
| id|count|
+---+-----+
| 0| 0|
| 1| 12|
| 2| 0|
| 3| 15|
| 4| 0|
| 5| 0|
+---+-----+
df.show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 4| 6|
+---+-----+
extra_rows = spark.createDataFrame([(2, 0),
(3, 0),
(5, 0)],
['Id', 'count'])
df.unionByName(extra_rows).orderBy('Id').show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
+---+-----+
Related
I am trying to stitch few event rows in dataframe together based on time difference between them. I have created a new column in dataframe which represent time difference with the previous row using lag. The dataframe looks as follows:
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
I want to stitch the rows if the time_diff with the previous event is less than 160.
For this, I was planning to assign the new row numbers to all the events which are within 160 time of each other and then take groupby on new row number
For the above dataframe I wanted the output as:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 1|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
I wrote a program as follows:
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
but the output I got was as follows:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
Can someone help me out in resolving this?
Thanks
So you want the previous value of the column currently being populated which is not possible, so to achieve this we can do following:
window = Window.partitionBy('id').orderBy('row_number')
df3=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))\
.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
To explain:
First we generate the row value for every row which is greater than 160 else null
df2=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))
df2.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| null|
| x| 3| 102| null|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| null|
+---+----------+---------+--------------+
Then we fill the dataframe with last value using this
df3=df2.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
df3.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
Hope it solves your question.
I have a dataframe in the form:
+---------+-------+------------+------------------+--------+-------+
| quarter | month | month_rank | unique_customers | units | sales |
+---------+-------+------------+------------------+--------+-------+
-
| 1 | 1 | 1 | 15 | 30 | 1000 |
--------------------------------------------------------------------
| 1 | 2 | 2 | 20 | 35 | 1200 |
--------------------------------------------------------------------
| 1 | 3 | 3 | 18 | 40 | 1500 |
--------------------------------------------------------------------
| 2 | 4 | 1 | 10 | 25 | 800 |
--------------------------------------------------------------------
| 2 | 5 | 2 | 25 | 50 | 2000 |
--------------------------------------------------------------------
| 2 | 6 | 3 | 28 | 45 | 1800 |
...
I am trying to group on quarter and track the monthly sales in a columnar fashion such as the following:
+---------+--------------+------------+------------------+--------+-------+
| quarter | month_rank1 | rank1_unique_customers | rank1_units | rank1_sales | month_rank2 | rank2_unique_customers | rank2_units | rank2_sales | month_rank3 | rank3_unique_customers | rank3_units | rank3_sales |
+---------+--------------+------------+------------------+--------+-------+
| 1 | 1 | 15|30|1000| 2 |20|35|1200 | 3 |18|40|1500
---------------------------------------------------------------------
| 2 | 4 | 10|25|800 | 5 |25|50|2000 | 6 |28|45|1800
---------------------------------------------------------------------
Is this achievable with multiple pivots? I have had no luck creating multiple columns from a pivot. I am thinking I might be able to achieve this result with windowing, but if anyone has run into a similar problem any suggestions would be greatly appriciated. Thank you!
Use pivot on month_rank column then agg other columns.
Example:
df=spark.createDataFrame([(1,1,1,15,30,1000),(1,2,2,20,35,1200),(1,3,3,18,40,1500),(2,4,1,10,25,800),(2,5,2,25,50,2000),(2,6,3,28,45,1800)],["quarter","month","month_rank","unique_customers","units","sales"])
df.show()
#+-------+-----+----------+----------------+-----+-----+
#|quarter|month|month_rank|unique_customers|units|sales|
#+-------+-----+----------+----------------+-----+-----+
#| 1| 1| 1| 15| 30| 1000|
#| 1| 2| 2| 20| 35| 1200|
#| 1| 3| 3| 18| 40| 1500|
#| 2| 4| 1| 10| 25| 800|
#| 2| 5| 2| 25| 50| 2000|
#| 2| 6| 3| 28| 45| 1800|
#+-------+-----+----------+----------------+-----+-----+
from pyspark.sql.functions import *
df1=df.\
groupBy("quarter").\
pivot("month_rank").\
agg(first(col("month")),first(col("unique_customers")),first(col("units")),first(col("sales")))
cols=["quarter","month_rank1","rank1_unique_customers","rank1_units","rank1_sales","month_rank2","rank2_unique_customers","rank2_units","rank2_sales","month_rank3","rank3_unique_customers","rank3_units","rank3_sales"]
df1.toDF(*cols).show()
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#|quarter|month_rank1|rank1_unique_customers|rank1_units|rank1_sales|month_rank2|rank2_unique_customers|rank2_units|rank2_sales|month_rank3|rank3_unique_customers|rank3_units|rank3_sales|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#| 1| 1| 15| 30| 1000| 2| 20| 35| 1200| 3| 18| 40| 1500|
#| 2| 4| 10| 25| 800| 5| 25| 50| 2000| 6| 28| 45| 1800|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
id pid tran1 tran2
1 1,2,3,4 5 3
2 2,4 10 6
3 3 15 9
4 4 20 12
I have the above data set.
I need to perform an aggregation on tran1 and tran2 columns for all the elements in pid column for a given id. For example, for id=1: I will be aggregating (summing) data from records with id equals 1 or 2 or 3 or 4.
The desired output is:
id pid tran1 tran2
1 1,2,3,4 50 30
2 2,4 30 18
3 3 15 9
4 4 20 12
scala> df.show
+---+-------+-----+-----+
| id| pid|tran1|tran2|
+---+-------+-----+-----+
| 1|1,2,3,4| 5| 3|
| 2| 2,4| 10| 6|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+-------+-----+-----+
scala> val df1 = df.withColumn("pid", explode(split(col("pid"), ",")))
scala> val df2 = df1.alias("df1").join(df.alias("df"), col("df1.pid") === col("df.id"),"left").select(col("df1.id"),col("df1.pid"),col("df.tran1"),col("df.tran2"))
scala> df2.show
+---+---+-----+-----+
| id|pid|tran1|tran2|
+---+---+-----+-----+
| 1| 1| 5| 3|
| 1| 2| 10| 6|
| 1| 3| 15| 9|
| 1| 4| 20| 12|
| 2| 2| 10| 6|
| 2| 4| 20| 12|
| 3| 3| 15| 9|
| 4| 4| 20| 12|
+---+---+-----+-----+
scala> df2.groupBy(col("id")).agg(concat_ws(",",collect_list(col("pid"))).alias("pid"), sum(col("tran1")).alias("tran1"), sum(col("tran2")).alias("tran2")).orderBy(col("id")).show(false)
+---+-------+-----+-----+
|id |pid |tran1|tran2|
+---+-------+-----+-----+
|1 |1,2,3,4|50.0 |30.0 |
|2 |2,4 |30.0 |18.0 |
|3 |3 |15.0 |9.0 |
|4 |4 |20.0 |12.0 |
+---+-------+-----+-----+
I have created two data frames by executing below command.
test1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
test2=sc.parallelize([
("a",1,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test2.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
using test1 and test2 data-frames I need to produce new dataframe which should contain result like below .
+---+--------+----------+------------+------------+
|SID|SSection|test1SRank|test2SRank | flag |
+---+--------+----------+------------+------------+
| a| 1| 1 | 1 | same_rank |
| b| 2| 2 | 3 |rank_changed|
| d| 4| 2 | 0 |No_rank |
| e| 4| 1 | 1 |same_rank |
| c| 3| 4 | 4 |same_rank |
| f| 4| 0 | 2 |new_rank |
+---+--------+----------+------------+------------+
above result I want to produce by comparison between test1 and test2 data frames using combination of columns SID and SSection and comparison between ranks.
for example :
1) SID (a) and SSection (1): in test1rank is1 and test2rank is 1 so my flag value should be same_rank.
2) SID (b) and SSection (2): in test1rank is 2 and test2rank is 3 here rank was changed so my flag value should be rank_changed.
3) SID (d) and SSection (4): in test1rank is 2 and in test2rank he lost his rank, so my flag value should be No_rank
4) SID (f) and SSection (4): in test1rank is he was not performed well so he don't have any rank and in test2rank he performed well his rank is 2, so my flag value should be New_rank
This should give you what you want:
from pyspark.sql import functions as f
test3=test1.withColumnRenamed('SRank','test1SRank')\
.join(test2.drop('SSection')\
.withColumnRenamed('SRank','test2SRank'), on='SID', how='outer')\
.fillna(0)
test3=test3.withColumn('flag', f.expr("case when test1SRank=0 and test2SRank>0 then 'new_rank'\
when test1SRank>0 and test2SRank=0 then 'No_rank'\
when test1SRank=test2SRank then 'same_rank'\
else 'rank_changed' end"))
test3.orderBy('SID').show()
Explanation: Outer join the data frame so you have test1 and test2 scores for all SIDs. Then fill nulls with 0 and perform the flag logic with a sql case when statement.
I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+