How to convert PySpark pipeline rdd (tuple inside tuple) into Data Frame? - python-3.x

I have a PySpark pipeline RDD like bellow
(1,([1,2,3,4],[5,3,4,5])
(2,([1,2,4,5],[4,5,6,7])
I want to generate Data Frame like below:
Id sid cid
1 1 5
1 2 3
1 3 4
1 4 5
2 1 4
2 2 5
2 4 6
2 5 7
Please help me on this.

If you have an RDD like this one,
rdd = sc.parallelize([
(1, ([1,2,3,4], [5,3,4,5])),
(2, ([1,2,4,5], [4,5,6,7]))
])
I would just use RDDs:
rdd.flatMap(lambda rec:
((rec[0], sid, cid) for sid, cid in zip(rec[1][0], rec[1][1]))
).toDF(["id", "sid", "cid"]).show()
# +---+---+---+
# | id|sid|cid|
# +---+---+---+
# | 1| 1| 5|
# | 1| 2| 3|
# | 1| 3| 4|
# | 1| 4| 5|
# | 2| 1| 4|
# | 2| 2| 5|
# | 2| 4| 6|
# | 2| 5| 7|
# +---+---+---+

Related

How to groupby by consective 1s in column in pyspark and keep groups with specific size

I want to split my pyspark dataframe in groups with monotonically increasing trend and keep the groups with size greater than 10.
here i tried some part of code,
from pyspark.sql import functions as F, Window
df = df1.withColumn(
"FLAG_INCREASE",
F.when(
F.col("x")
> F.lag("x").over(Window.partitionBy("x1").orderBy("time")),
1,
).otherwise(0),
)
i don't know how to do groupby by consective one's in pyspark... if anyone have better solution for this
same thing in pandas we can do like this :
df=df1.groupby((df1['x'].diff() < 0).cumsum())
how to convert this code to pyspark ?
example dataframe:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 4
9 3
10 2
11 1
12 2
13 3
14 4
15 5
16 5
17 6
expected output
group1:
x
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
group2:
x
0 1
1 2
2 3
3 4
4 5
5 5
6 6
I'll map out all the steps (and keep all columns in the output) to replicate (df1['x'].diff() < 0).cumsum() which is easy to calculate using a lag.
However, it is important that your data has an ID column that has the order of the dataframe because unlike pandas, spark does not retain dataframe's sorting (due to its distributed nature). For this example, I've assumed that your data has an ID column named idx, which is the index you printed in your example input.
# input data
data_sdf.show(5)
# +---+---+
# |idx|val|
# +---+---+
# | 0| 1|
# | 1| 2|
# | 2| 2|
# | 3| 2|
# | 4| 3|
# +---+---+
# only showing top 5 rows
# calculating the group column
data_sdf. \
withColumn('diff_with_prevval',
func.col('val') - func.lag('val').over(wd.orderBy('idx'))
). \
withColumn('diff_lt_0',
func.coalesce((func.col('diff_with_prevval') < 0).cast('int'), func.lit(0))
). \
withColumn('diff_lt_0_cumsum',
func.sum('diff_lt_0').over(wd.orderBy('idx').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+---+-----------------+---------+----------------+
# |idx|val|diff_with_prevval|diff_lt_0|diff_lt_0_cumsum|
# +---+---+-----------------+---------+----------------+
# | 0| 1| null| 0| 0|
# | 1| 2| 1| 0| 0|
# | 2| 2| 0| 0| 0|
# | 3| 2| 0| 0| 0|
# | 4| 3| 1| 0| 0|
# | 5| 3| 0| 0| 0|
# | 6| 4| 1| 0| 0|
# | 7| 5| 1| 0| 0|
# | 8| 4| -1| 1| 1|
# | 9| 3| -1| 1| 2|
# | 10| 2| -1| 1| 3|
# | 11| 1| -1| 1| 4|
# | 12| 2| 1| 0| 4|
# | 13| 3| 1| 0| 4|
# | 14| 4| 1| 0| 4|
# | 15| 5| 1| 0| 4|
# | 16| 5| 0| 0| 4|
# | 17| 6| 1| 0| 4|
# +---+---+-----------------+---------+----------------+
You can now use the diff_lt_0_cumsum column in your groupBy() for further use.

Loop 3 times and add a new value each time to a new column in spark DF

I want create 3 rows for every row in pysaprk DF. I wan to add a new column called loopVar=(val1,val2,val3). Three different values must be added as a value in each loop. Any idea how do I do it ?
Original:
a b c
1 2 3
1 2 3
Condition 1: loop = 1 and b is not null then loopvar =va1
Condition 2: loop = 2 and b is not null then loopvar =va2
Condition 3: loop = 3 and c is not null then loopvar =va3
Output :
a b c loopvar
1 2 3 val1
1 2 3 vall
1 2 3 val2
1 2 3 val2
1 2 3 val3
1 2 3 val3
Use a crossJoin:
df = spark.createDataFrame([[1,2,3], [1,2,3]]).toDF('a','b','c')
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
df2 = spark.createDataFrame([['val1'], ['val2'], ['val3']]).toDF('loopvar')
df2.show()
+-------+
|loopvar|
+-------+
| val1|
| val2|
| val3|
+-------+
df3 = df.crossJoin(df2)
df3.show()
+---+---+---+-------+
| a| b| c|loopvar|
+---+---+---+-------+
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
+---+---+---+-------+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Spark generate occurrence matrix

I have input transactions as shown
apples,mangos,eggs
milk,oranges,eggs
milk, cereals
mango,apples
I have to generate a Spark dataframe of co-occurrence matrix like this.
apple mango milk cereals eggs
apple 2 2 0 0 1
mango 2 2 0 0 1
milk 0 0 2 1 1
cereals 0 0 1 1 0
eggs 1 1 1 0 2
Apples and mango are bought twice together, so in a matrix[apple][mango] =2.
I am stuck in ideas for implementing this? Any suggestions would be great help. I am using PySpark to implement this.
If data looks like this:
df = spark.createDataFrame(
["apples,mangos,eggs", "milk,oranges,eggs", "milk,cereals", "mangos,apples"],
"string"
).toDF("basket")
Import
from pyspark.sql.functions import split, explode, monotonically_increasing_id
Split and explode:
long = (df
.withColumn("id", monotonically_increasing_id())
.select("id", explode(split("basket", ","))))
Self-join and corsstab
long.withColumnRenamed("col", "col_").join(long, ["id"]).stat.crosstab("col_", "col").show()
# +--------+------+-------+----+------+----+-------+
# |col__col|apples|cereals|eggs|mangos|milk|oranges|
# +--------+------+-------+----+------+----+-------+
# | cereals| 0| 1| 0| 0| 1| 0|
# | eggs| 1| 0| 2| 1| 1| 1|
# | milk| 0| 1| 1| 0| 2| 1|
# | mangos| 2| 0| 1| 2| 0| 0|
# | apples| 2| 0| 1| 2| 0| 0|
# | oranges| 0| 0| 1| 0| 1| 1|
# +--------+------+-------+----+------+----+-------+

How to compare two dataframes and add new flag column in pyspark?

I have created two data frames by executing below command.
test1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
test2=sc.parallelize([
("a",1,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test2.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
using test1 and test2 data-frames I need to produce new dataframe which should contain result like below .
+---+--------+----------+------------+------------+
|SID|SSection|test1SRank|test2SRank | flag |
+---+--------+----------+------------+------------+
| a| 1| 1 | 1 | same_rank |
| b| 2| 2 | 3 |rank_changed|
| d| 4| 2 | 0 |No_rank |
| e| 4| 1 | 1 |same_rank |
| c| 3| 4 | 4 |same_rank |
| f| 4| 0 | 2 |new_rank |
+---+--------+----------+------------+------------+
above result I want to produce by comparison between test1 and test2 data frames using combination of columns SID and SSection and comparison between ranks.
for example :
1) SID (a) and SSection (1): in test1rank is1 and test2rank is 1 so my flag value should be same_rank.
2) SID (b) and SSection (2): in test1rank is 2 and test2rank is 3 here rank was changed so my flag value should be rank_changed.
3) SID (d) and SSection (4): in test1rank is 2 and in test2rank he lost his rank, so my flag value should be No_rank
4) SID (f) and SSection (4): in test1rank is he was not performed well so he don't have any rank and in test2rank he performed well his rank is 2, so my flag value should be New_rank
This should give you what you want:
from pyspark.sql import functions as f
test3=test1.withColumnRenamed('SRank','test1SRank')\
.join(test2.drop('SSection')\
.withColumnRenamed('SRank','test2SRank'), on='SID', how='outer')\
.fillna(0)
test3=test3.withColumn('flag', f.expr("case when test1SRank=0 and test2SRank>0 then 'new_rank'\
when test1SRank>0 and test2SRank=0 then 'No_rank'\
when test1SRank=test2SRank then 'same_rank'\
else 'rank_changed' end"))
test3.orderBy('SID').show()
Explanation: Outer join the data frame so you have test1 and test2 scores for all SIDs. Then fill nulls with 0 and perform the flag logic with a sql case when statement.

Resources