Pyspark agg function to "explode" rows into columns - apache-spark

Basically, I have a dataframe that looks like this:
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
and my desired output is this:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
So basically I want to "explode" the index column into new columns after I groupby id. Btw, the id counts are the same and each id has the same set of index values. I'm using pyspark.

using pivot you can achieve the desired output.
from pyspark.sql import functions as F
df = spark.createDataFrame([[1,"a","a11","a12"],[1,"b","b11","b12"],[2,"a","a21","a22"],[2,"b","b21","b22"]],["id","index","col1","col2"])
df.show()
+---+-----+----+----+
| id|index|col1|col2|
+---+-----+----+----+
| 1| a| a11| a12|
| 1| b| b11| b12|
| 2| a| a21| a22|
| 2| b| b21| b22|
+---+-----+----+----+
using pivot
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
Rename Column
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
Note rearrange column based on your requirement.

Related

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

Combine dataframes columns consisting of multiple values - Spark

I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set

I Have a table take the table as a dataframe required answer is in spark scala

I Have a table take the table as dataframe.
id | Formula | Step | Value |
1 | A*(B+C) | A | 5 |
1 | A*(B+C) | B | 6 |
1 | A*(B+C) | C | 7 |
2 | A/B | A | 12 |
2 | A/B | B | 6 |
Expected Result data frame
Solution required using spark and scala.
id | Formula | Value |
1 | A*(B+C) | 65 |
2 | A/B | 2 |
scala> val df = Seq((1,"A*(B+C)","A",5),(1,"A*(B+C)","B",6),(1,"A*(B+C)","C",5),(2,"A/B","A",12),(2,"A/B","B",6)).toDF("ID","Formula","Step","Value")
df: org.apache.spark.sql.DataFrame = [ID: int, Formula: string ... 2 more fields]
scala> df.show
+---+-------+----+-----+
| ID|Formula|Step|Value|
+---+-------+----+-----+
| 1|A*(B+C)| A| 5|
| 1|A*(B+C)| B| 6|
| 1|A*(B+C)| C| 5|
| 2| A/B| A| 12|
| 2| A/B| B| 6|
+---+-------+----+-----+
I want the answer like this:
id | Formula | Value |
1 | A*(B+C) | 65 |
2 | A/B | 2 |
You can group by Formula and collect the Step & Value as a key value pair.
scala> df.groupBy($"Formula").agg(collect_list(map($"Step",$"Value")) as "map").show(false)
+-------+---------------------------------------+
|Formula|map |
+-------+---------------------------------------+
|A*(B+C)|[Map(A -> 5), Map(B -> 6), Map(C -> 5)]|
|A/B |[Map(A -> 12), Map(B -> 6)] |
+-------+---------------------------------------+
Now you can write a UDF to substitute the variable values from map over Formula and get the results.
val evalUDF = udf((valueMap: Map[String, Int], formula: String) => {
...
})
val output = df.withColumn("Value", evalUDF($"map", $"Formula"))

Randomly Split DataFrame by Unique Values in One Column

I have a pyspark DataFrame like the following:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val12 | val22 | 1 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
Each row has a groupId and multiple rows can have the same groupId.
I want to randomly split this data into two datasets. But all the data having a particular groupId must be in one of the splits.
This means that if d1.groupId = d2.groupId, then d1 and d2 are in the same split.
For example:
# Split 1:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val11 | val21 | 0 |
| val13 | val23 | 2 |
| val14 | val24 | 0 |
+--------+--------+-----------+
# Split 2:
+--------+--------+-----------+
| col1 | col2 | groupId |
+--------+--------+-----------+
| val12 | val22 | 1 |
| val15 | val25 | 1 |
| val16 | val26 | 1 |
+--------+--------+-----------+
What is the good way to do it on PySpark? Can I use the randomSplit method somehow?
You can use randomSplit to split just the distinct groupIds, and then use the results to split the source DataFrame using join.
For example:
split1, split2 = df.select("groupId").distinct().randomSplit(weights=[0.5, 0.5], seed=0)
split1.show()
#+-------+
#|groupId|
#+-------+
#| 1|
#+-------+
split2.show()
#+-------+
#|groupId|
#+-------+
#| 0|
#| 2|
#+-------+
Now join these back to the original DataFrame:
df1 = df.join(split1, on="groupId", how="inner")
df2 = df.join(split2, on="groupId", how="inner")
df1.show()
3+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 1|val12|val22|
#| 1|val15|val25|
#| 1|val16|val26|
#+-------+-----+-----+
df2.show()
#+-------+-----+-----+
#|groupId| col1| col2|
#+-------+-----+-----+
#| 0|val11|val21|
#| 0|val14|val24|
#| 2|val13|val23|
#+-------+-----+-----+

Spark : How do I exploded data and add column name also in pyspark or scala spark?

Spark: I want explode multiple columns and consolidate as single column with column name as separate row.
Input data:
+-----------+-----------+-----------+
| ASMT_ID | WORKER | LABOR |
+-----------+-----------+-----------+
| 1 | A1,A2,A3| B1,B2 |
+-----------+-----------+-----------+
| 2 | A1,A4 | B1 |
+-----------+-----------+-----------+
Expected Output:
+-----------+-----------+-----------+
| ASMT_ID |WRK_CODE |WRK_DETL |
+-----------+-----------+-----------+
| 1 | A1 | WORKER |
+-----------+-----------+-----------+
| 1 | A2 | WORKER |
+-----------+-----------+-----------+
| 1 | A3 | WORKER |
+-----------+-----------+-----------+
| 1 | B1 | LABOR |
+-----------+-----------+-----------+
| 1 | B2 | LABOR |
+-----------+-----------+-----------+
| 2 | A1 | WORKER |
+-----------+-----------+-----------+
| 2 | A4 | WORKER |
+-----------+-----------+-----------+
| 2 | B1 | LABOR |
+-----------+-----------+-----------+
PFA: Input image
Not the best case probably but a couple of explodes and unionAll is all you need.
import org.apache.spark.sql.functions._
df1.show
+-------+--------+-----+
|ASMT_ID| WORKER|LABOR|
+-------+--------+-----+
| 1|A1,A2,A3|B1,B2|
| 2| A1,A4| B1|
+-------+--------+-----+
df1.cache
val workers = df1.drop("LABOR")
.withColumn("WRK_CODE" , explode(split($"WORKER" , ",") ) )
.withColumn("WRK_DETL", lit("WORKER"))
.drop("WORKER")
val labors = df1.drop("WORKER")
.withColumn("WRK_CODE" , explode(split($"LABOR", ",") ) )
.withColumn("WRK_DETL", lit("LABOR") )
.drop("LABOR")
workers.unionAll(labors).orderBy($"ASMT_ID".asc , $"WRK_CODE".asc).show
+-------+--------+--------+
|ASMT_ID|WRK_CODE|WRK_DETL|
+-------+--------+--------+
| 1| A1| WORKER|
| 1| A2| WORKER|
| 1| A3| WORKER|
| 1| B1| LABOR|
| 1| B2| LABOR|
| 2| A1| WORKER|
| 2| A4| WORKER|
| 2| B1| LABOR|
+-------+--------+--------+

Resources