pyspark : Flattening of records coming from input file - apache-spark

I have the input csv file like below -
plant_id, system1_id, system2_id, system3_id
A1 s1-111 s2-111 s3-111
A2 s1-222 s2-222 s3-222
A3 s1-333 s2-333 s3-333
I want to flatten the record like this below
plant_id system_id system_name
A1 s1-111 system1
A1 s2-111 system2
A1 s3-111 system3
A2 s1-222 system1
A2 s2-222 system2
A2 s3-222 system3
A3 s1-333 system1
A3 s2-333 system2
A3 s3-333 system3
currently I am able to achieve it by creating a transposed pyspark df for each system column and then doing union at the end for all the df's. But it requires to write a long piece of code. Is there way to achieve it using few lines of code?

Use stack:
df2 = df.selectExpr(
'plant_id',
"""stack(
3,
system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id')
as (system_id, system_name)"""
)
df2.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-333| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+

1. Preparing the sample input data
from pyspark.sql import functions as F
sampleData = (('A1','s1-111','s2-111','s3-111'),
('A2','s1-222','s2-222','s3-222'),
('A3','s1-333','s2-222','s3-333')
)
2. Creating the list of input data columns
columns = ['plant_id','system1_id','system2_id','system3_id']
3. Creating the Spark DataFrame
df = spark.createDataFrame(data=sampleData, schema=columns)
df.show()
+--------+----------+----------+----------+
|plant_id|system1_id|system2_id|system3_id|
+--------+----------+----------+----------+
| A1| s1-111| s2-111| s3-111|
| A2| s1-222| s2-222| s3-222|
| A3| s1-333| s2-222| s3-333|
+--------+----------+----------+----------+
4. We are using the stack() function to separate multiple columns into rows. Here is the stack function syntax: stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.
finalDF = df.select('plant_id',F.expr("stack(3,system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id') as (system_id, system_name)"))
finalDF.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-222| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+

Related

Matching up arrays in PySpark

I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job.
df1:
item tag
1 AB
2 CD
3 EF
4 QQ
df2:
key1 key2 tags
A1 B1 [AB]
A1 B2 [AB, CD, EF]
A2 B1 [CD, EF]
A2 B3 [AB, EF, ZZ]
I would like to match up the array in df2 with the tag in df1, in the following way:
item key1 key2 tag
1 A1 B1 AB
1 A1 B2 AB
2 A1 B2 CD
2 A2 B1 CD
3 A1 B2 EF
3 A2 B1 EF
3 A2 B3 EF
So, the tag in df1 is used to expand the row based on the tag entries in df2. For example, item 1's tag "AB" occurs in the tags array in df2 for the first two rows.
Also note how 4 gets ignored as the tag QQ does not exist in any array in df2.
I know this is going to be an inner join, but I am not sure how to match up df1.tag with df2.tags to pull in key1 and key2.
Any assistance would be greatly appreciated.
You can do a join using an array_contains condition:
import pyspark.sql.functions as F
result = (df1.join(df2, F.array_contains(df2.tags, df1.tag))
.select('item', 'key1', 'key2', 'tag')
.orderBy('item', 'key1', 'key2')
)
result.show()
+----+----+----+---+
|item|key1|key2|tag|
+----+----+----+---+
| 1| A1| B1| AB|
| 1| A1| B2| AB|
| 1| A2| B3| AB|
| 2| A1| B2| CD|
| 2| A2| B1| CD|
| 3| A1| B2| EF|
| 3| A2| B1| EF|
| 3| A2| B3| EF|
+----+----+----+---+
import pyspark.sql.functions as F
df = df1.join(
df2.select('key1', 'key2', F.explode('tags').alias('tag')),
'tag',
'inner'
)
df.show()
# +---+----+----+----+
# |tag|item|key1|key2|
# +---+----+----+----+
# | EF| 3| A1| B2|
# | EF| 3| A2| B1|
# | EF| 3| A2| B3|
# | AB| 1| A1| B1|
# | AB| 1| A1| B2|
# | AB| 1| A2| B3|
# | CD| 2| A1| B2|
# | CD| 2| A2| B1|
# +---+----+----+----+

Calculate Spark column value depending on another row value on the same column

I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+

Case wise using mapping from columns to fill value in another column in a pyspark dataframe

I have a data frame with multiple columns:
+-----------+-----------+-----------+
| col1| col2| col3|
+-----------+-----------+-----------+
| s1| c1| p3|
| s2| c1| p3|
| s1| c3| p3|
| s3| c4| p4|
| s4| c5| p4|
| s2| c6| p4|
+-----------+-----------+-----------+
Now what I want to achieve is that I want to create a new column from mapping of multiple columns by using let's say a dict (since number of unique values are large, individual or case statements would be tedious).
The idea is to first map the values of col1, then if there are remaining null values in the new column, to map them from col2, then again if more null values, to map them from col3, and finally the remaining null values to be replaced by a str literal.:
col1_map = {'s1' : 'apple', 's3' : 'orange'}
col2_map = {'c1' : 'potato', 'c6' : 'tomato'}
col3_map = {'p3' : 'ball', 'p4' : 'bat'}
The final output would look like this:
+-----------+-----------+-----------+-----------+
| col1| col2| col3| col4|
+-----------+-----------+-----------+-----------+
| s1| c1| p3| apple|
| s2| c1| p3| potato|
| s1| c3| p3| apple|
| s3| c4| p4| orange|
| s4| c5| p4| bat|
| s2| c6| p4| tomato|
+-----------+-----------+-----------+-----------+
My approach so far is to create a new column. And then to
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping_expr = create_map([lit(x) for x in chain(*col1_dict.items())])
df = df.withColumn('col4', mapping_expr[df['col4']])
This will get the values in col4 from the mapping of col1. however My issue is that if I repeat this for col2, and there's already a mapped value from col1 in col4, the new mapping will replace that. I do not want that.
Does anyone have any suggestion to maintain this order of addition of values in the new column?
You did almost right, just that you need to use mapping_expr in sucession.
from pyspark.sql.functions import col, create_map, lit, when
from itertools import chain
values = [('s1','c1','p3'),('s2','c1','p3'),('s1','c3','p3'),('s3','c4','p4'),('s4','c5','p4'),('s2','c6','p4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| s1| c1| p3|
| s2| c1| p3|
| s1| c3| p3|
| s3| c4| p4|
| s4| c5| p4|
| s2| c6| p4|
+----+----+----+
Dictionary, as provided by you and creating it's mapping
col1_map = {'s1' : 'apple', 's3' : 'orange'}
col2_map = {'c1' : 'potato', 'c6' : 'tomato'}
col3_map = {'p3' : 'ball', 'p4' : 'bat'}
#Applying the mapping of dictionary.
mapping_expr1 = create_map([lit(x) for x in chain(*col1_map.items())])
mapping_expr2 = create_map([lit(x) for x in chain(*col2_map.items())])
mapping_expr3 = create_map([lit(x) for x in chain(*col3_map.items())])
Finally applying create_map() in succession. All I am doing in addition, is checking if after operating on col1/col2 we still have null, which can be checked using isNull() function.
df=df.withColumn('col4', mapping_expr1.getItem(col('col1')))
df=df.withColumn('col4', when(col('col4').isNull(),mapping_expr2.getItem(col('col2'))).otherwise(col('col4')))
df=df.withColumn('col4', when(col('col4').isNull(),mapping_expr3.getItem(col('col3'))).otherwise(col('col4')))
df.show()
+----+----+----+------+
|col1|col2|col3| col4|
+----+----+----+------+
| s1| c1| p3| apple|
| s2| c1| p3|potato|
| s1| c3| p3| apple|
| s3| c4| p4|orange|
| s4| c5| p4| bat|
| s2| c6| p4|tomato|
+----+----+----+------+

Spark : How do I exploded data and add column name also in pyspark or scala spark?

Spark: I want explode multiple columns and consolidate as single column with column name as separate row.
Input data:
+-----------+-----------+-----------+
| ASMT_ID | WORKER | LABOR |
+-----------+-----------+-----------+
| 1 | A1,A2,A3| B1,B2 |
+-----------+-----------+-----------+
| 2 | A1,A4 | B1 |
+-----------+-----------+-----------+
Expected Output:
+-----------+-----------+-----------+
| ASMT_ID |WRK_CODE |WRK_DETL |
+-----------+-----------+-----------+
| 1 | A1 | WORKER |
+-----------+-----------+-----------+
| 1 | A2 | WORKER |
+-----------+-----------+-----------+
| 1 | A3 | WORKER |
+-----------+-----------+-----------+
| 1 | B1 | LABOR |
+-----------+-----------+-----------+
| 1 | B2 | LABOR |
+-----------+-----------+-----------+
| 2 | A1 | WORKER |
+-----------+-----------+-----------+
| 2 | A4 | WORKER |
+-----------+-----------+-----------+
| 2 | B1 | LABOR |
+-----------+-----------+-----------+
PFA: Input image
Not the best case probably but a couple of explodes and unionAll is all you need.
import org.apache.spark.sql.functions._
df1.show
+-------+--------+-----+
|ASMT_ID| WORKER|LABOR|
+-------+--------+-----+
| 1|A1,A2,A3|B1,B2|
| 2| A1,A4| B1|
+-------+--------+-----+
df1.cache
val workers = df1.drop("LABOR")
.withColumn("WRK_CODE" , explode(split($"WORKER" , ",") ) )
.withColumn("WRK_DETL", lit("WORKER"))
.drop("WORKER")
val labors = df1.drop("WORKER")
.withColumn("WRK_CODE" , explode(split($"LABOR", ",") ) )
.withColumn("WRK_DETL", lit("LABOR") )
.drop("LABOR")
workers.unionAll(labors).orderBy($"ASMT_ID".asc , $"WRK_CODE".asc).show
+-------+--------+--------+
|ASMT_ID|WRK_CODE|WRK_DETL|
+-------+--------+--------+
| 1| A1| WORKER|
| 1| A2| WORKER|
| 1| A3| WORKER|
| 1| B1| LABOR|
| 1| B2| LABOR|
| 2| A1| WORKER|
| 2| A4| WORKER|
| 2| B1| LABOR|
+-------+--------+--------+

Find the top n elements for attribute combination in data frame spark [duplicate]

This question already has an answer here:
Spark sql top n per group
(1 answer)
Closed 5 years ago.
I have a data frame like below.
scala> ds.show
+----+----------+----------+-----+
| key|attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1| A1| B1| 10|
|mac2| A2| B1| 10|
|mac3| A2| B1| 10|
|mac1| A1| B2| 10|
|mac1| A1| B2| 10|
|mac3| A1| B1| 10|
|mac2| A2| B1| 10|
+----+----------+----------+-----+
For each value in attribute1, I want to find the top N keys and the aggregated value for that key.
Output:
aggregated value for key for attribute1 will be
+----+----------+-----+
| key|attribute1|value|
+----+----------+-----+
|mac1| A1| 30|
|mac2| A2| 20|
|mac3| A2| 10|
|mac3| A1| 10|
+----+----------+-----+
Now if N = 1 then the output will be A1 - (mac1,30) A2-(mac2,20)
How to achieve this in DataFrame/Dataset ?
I want to achieve this for all the attributes. In the above example I want to find for attribute1 and attribute2 as well.
Given the input dataframe as
+----+----------+----------+-----+
|key |attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
|mac3|A2 |B1 |10 |
|mac1|A1 |B2 |10 |
|mac1|A1 |B2 |10 |
|mac3|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
+----+----------+----------+-----+
and doing aggregation on the above input dataframe as
import org.apache.spark.sql.functions._
val groupeddf = df.groupBy("key", "attribute1").agg(sum("value").as("value"))
should give you
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac1|A1 |30.0 |
|mac3|A1 |10.0 |
|mac3|A2 |10.0 |
|mac2|A2 |20.0 |
+----+----------+-----+
now you can use Window function to generate ranks for each row in grouped data and filter rows with rank <= N as
val N = 1
val windowSpec = Window.partitionBy("attribute1").orderBy($"value".desc)
groupeddf.withColumn("rank", rank().over(windowSpec))
.filter($"rank" <= N)
.drop("rank")
which should give you the dataframe you desire.
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac2|A2 |20.0 |
|mac1|A1 |30.0 |
+----+----------+-----+

Resources