Mapping column from arrays in Pyspark - apache-spark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |

You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

Related

pyspark detect change of categorical variable

I have a spark dataframe consisting of two columns.
+-----------------------+-----------+
| Metric|Recipe_name|
+-----------------------+-----------+
| 100. | A |
| 200. | A |
| 300. | A |
| 10. | A |
| 20. | A |
| 10. | B |
| 20. | B |
| 10. | A |
| 20. | A |
| .. | .. |
| .. | .. |
| 10. | B |
The dataframe is time ordered ( you can imagine there is a increasing timestamp column ). I need to add a column 'Cycles'. There are two scenarios when I say a new cycle begins :
If the same recipe is running lets say recipe 'A', and the value of Metric decreases (with respect to the last row) then a new cycle begins.
Lets say we switch from current recipe 'A' to second recipe 'B' and switch back to recipe 'A' we say a new cycle for recipe 'A' has begun.
So in the end i would like to have a column 'Cycle' which looks like this :
+-----------------------+-----------+-----------+
| Metric|Recipe_name| Cycle|
+-----------------------+-----------+-----------+
| 100. | A | 0 |
| 200. | A | 0 |
| 300. | A | 0 |
| 10. | A | 1 |
| 20. | A | 1 |
| 10. | B | 0 |
| 20. | B | 0 |
| 10. | A | 2 |
| 20. | A | 2 |
| .. | .. | 2 |
| .. | .. | 2 |
| 10. | B | 1 |
So it means recipe A has cycle 0 then metric decreases and cycle changes to 1.
Then a new recipe starts B so it has a new cycle 0.
Then again we get back to recipe A we say a new cycle begins for recipe A and with respect to last cycle number it has cycle 2 ( and similarly for recipe B).
In total there are 200 recipes.
Thanks for the help.
Replace my order column to your ordering column. Compare your condition by using lag function where the Recipe_name column is being partitioned.
w = Window.partitionBy('Recipe_name').orderBy('order')
df.withColumn('Cycle', when(col('Metric') < lag('Metric', 1, 0).over(w), 1).otherwise(0)) \
.withColumn('Cycle', sum('Cycle').over(w)) \
.orderBy('order') \
.show()
+------+-----------+-----+
|Metric|Recipe_name|Cycle|
+------+-----------+-----+
| 100| A| 0|
| 200| A| 0|
| 300| A| 0|
| 10| A| 1|
| 20| A| 1|
| 10| B| 0|
| 20| B| 0|
| 10| A| 2|
| 20| A| 2|
| 10| B| 1|
+------+-----------+-----+

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

Possibility to use lead and lag functions together with groupBy in Spark

I'm interesting is there a way to use lead\lag to count something like this
First step: i have a dataframe
+----+-----------+------+
| id | timestamp | sess |
+----+-----------+------+
| xx | 1 | A |
+----+-----------+------+
| yy | 2 | A |
+----+-----------+------+
| zz | 1 | B |
+----+-----------+------+
| yy | 3 | B |
+----+-----------+------+
| tt | 4 | B |
+----+-----------+------+
And i want to collect id's that is previous to particular id partitioning by session_id
+----+---------+
| id | id_list |
+----+---------+
| yy | [xx,zz] |
+----+---------+
| xx | [] |
+----+---------+
| zz | [] |
+----+---------+
| tt | [yy] |
+----+---------+
You can create a window over the column sess and lag the IDs as you mentioned in the question. Then you can use groupBy with the aggregate function collect_list to get the output.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"sess").orderBy($"timestamp")
val df1 = df.withColumn("lagged", lag($"id", 1).over(w))
df1.select("id", "lagged").groupBy($"id").agg(collect_list($"lagged").as("id_list")).show
//+---+--------------------+
//| id| id_list|
//+---+--------------------+
//| tt| [yy]|
//| xx| []|
//| zz| []|
//| yy| [zz, xx]|
//+---+--------------------+

Randomly Split DataFrame by Unique Values in One Column

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

Resources