Matching up arrays in PySpark - apache-spark

I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job.
df1:
item tag
1 AB
2 CD
3 EF
4 QQ
df2:
key1 key2 tags
A1 B1 [AB]
A1 B2 [AB, CD, EF]
A2 B1 [CD, EF]
A2 B3 [AB, EF, ZZ]
I would like to match up the array in df2 with the tag in df1, in the following way:
item key1 key2 tag
1 A1 B1 AB
1 A1 B2 AB
2 A1 B2 CD
2 A2 B1 CD
3 A1 B2 EF
3 A2 B1 EF
3 A2 B3 EF
So, the tag in df1 is used to expand the row based on the tag entries in df2. For example, item 1's tag "AB" occurs in the tags array in df2 for the first two rows.
Also note how 4 gets ignored as the tag QQ does not exist in any array in df2.
I know this is going to be an inner join, but I am not sure how to match up df1.tag with df2.tags to pull in key1 and key2.
Any assistance would be greatly appreciated.

You can do a join using an array_contains condition:
import pyspark.sql.functions as F
result = (df1.join(df2, F.array_contains(df2.tags, df1.tag))
.select('item', 'key1', 'key2', 'tag')
.orderBy('item', 'key1', 'key2')
)
result.show()
+----+----+----+---+
|item|key1|key2|tag|
+----+----+----+---+
| 1| A1| B1| AB|
| 1| A1| B2| AB|
| 1| A2| B3| AB|
| 2| A1| B2| CD|
| 2| A2| B1| CD|
| 3| A1| B2| EF|
| 3| A2| B1| EF|
| 3| A2| B3| EF|
+----+----+----+---+

import pyspark.sql.functions as F
df = df1.join(
df2.select('key1', 'key2', F.explode('tags').alias('tag')),
'tag',
'inner'
)
df.show()
# +---+----+----+----+
# |tag|item|key1|key2|
# +---+----+----+----+
# | EF| 3| A1| B2|
# | EF| 3| A2| B1|
# | EF| 3| A2| B3|
# | AB| 1| A1| B1|
# | AB| 1| A1| B2|
# | AB| 1| A2| B3|
# | CD| 2| A1| B2|
# | CD| 2| A2| B1|
# +---+----+----+----+

Related

Pyspark : concat two spark df side ways without join efficiently

Hi I have sparse dataframe that was loaded by mergeschema option
DF
name A1 A2 B1 B2 ..... partitioned_name
A 1 1 null null partition_a
B 2 2 null null partition_a
A null null 3 4 partition_b
B null null 3 4 partition_b
to
DF
name A1 A2 B1 B2 .....
A 1 1 3 4
B 2 2 3 4
Any Best ideas without joining for efficiency (nor rdd because data is huge)? I was thinking about solutions like pandas concat(axis=1) since all the tables are sorted
If that pattern repeats and you don't mind hardcode the column names:
df = spark.createDataFrame(
[
('A','1','1','null','null','partition_a'),
('B','2','2','null','null','partition_a'),
('A','null','null','3','4','partition_b'),
('B','null','null','3','4','partition_b')
],
['name','A1','A2','B1','B2','partitioned_name']
)\
.withColumn('A1', F.col('A1').cast('integer'))\
.withColumn('A2', F.col('A2').cast('integer'))\
.withColumn('B1', F.col('B1').cast('integer'))\
.withColumn('B2', F.col('B2').cast('integer'))\
df.show()
import pyspark.sql.functions as F
cols_to_agg = [col for col in df.columns if col not in ["name", "partitioned_name"]]
df\
.groupby('name')\
.agg(F.sum('A1').alias('A1'),
F.sum('A2').alias('A2'),
F.sum('B1').alias('B1'),
F.sum('B2').alias('B2'))\
.show()
+----+----+----+----+----+----------------+
# |name| A1| A2| B1| B2|partitioned_name|
# +----+----+----+----+----+----------------+
# | A| 1| 1|null|null| partition_a|
# | B| 2| 2|null|null| partition_a|
# | A|null|null| 3| 4| partition_b|
# | B|null|null| 3| 4| partition_b|
# +----+----+----+----+----+----------------+
# +----+---+---+---+---+
# |name| A1| A2| B1| B2|
# +----+---+---+---+---+
# | A| 1| 1| 3| 4|
# | B| 2| 2| 3| 4|
# +----+---+---+---+---+

pyspark : Flattening of records coming from input file

I have the input csv file like below -
plant_id, system1_id, system2_id, system3_id
A1 s1-111 s2-111 s3-111
A2 s1-222 s2-222 s3-222
A3 s1-333 s2-333 s3-333
I want to flatten the record like this below
plant_id system_id system_name
A1 s1-111 system1
A1 s2-111 system2
A1 s3-111 system3
A2 s1-222 system1
A2 s2-222 system2
A2 s3-222 system3
A3 s1-333 system1
A3 s2-333 system2
A3 s3-333 system3
currently I am able to achieve it by creating a transposed pyspark df for each system column and then doing union at the end for all the df's. But it requires to write a long piece of code. Is there way to achieve it using few lines of code?
Use stack:
df2 = df.selectExpr(
'plant_id',
"""stack(
3,
system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id')
as (system_id, system_name)"""
)
df2.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-333| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+
1. Preparing the sample input data
from pyspark.sql import functions as F
sampleData = (('A1','s1-111','s2-111','s3-111'),
('A2','s1-222','s2-222','s3-222'),
('A3','s1-333','s2-222','s3-333')
)
2. Creating the list of input data columns
columns = ['plant_id','system1_id','system2_id','system3_id']
3. Creating the Spark DataFrame
df = spark.createDataFrame(data=sampleData, schema=columns)
df.show()
+--------+----------+----------+----------+
|plant_id|system1_id|system2_id|system3_id|
+--------+----------+----------+----------+
| A1| s1-111| s2-111| s3-111|
| A2| s1-222| s2-222| s3-222|
| A3| s1-333| s2-222| s3-333|
+--------+----------+----------+----------+
4. We are using the stack() function to separate multiple columns into rows. Here is the stack function syntax: stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.
finalDF = df.select('plant_id',F.expr("stack(3,system1_id, 'system1_id', system2_id, 'system2_id', system3_id, 'system3_id') as (system_id, system_name)"))
finalDF.show()
+--------+---------+-----------+
|plant_id|system_id|system_name|
+--------+---------+-----------+
| A1| s1-111| system1_id|
| A1| s2-111| system2_id|
| A1| s3-111| system3_id|
| A2| s1-222| system1_id|
| A2| s2-222| system2_id|
| A2| s3-222| system3_id|
| A3| s1-333| system1_id|
| A3| s2-222| system2_id|
| A3| s3-333| system3_id|
+--------+---------+-----------+

How to create column from an expression

The doc says:
# 2. Create from an expression
df.colName + 1
1 / df.colName
Can anyone explain the meaning and usage of the code?
It means the arithmetic operation with the old Column creates a new Column object:
df = spark.createDataFrame([[1], [2]], ['a'])
df.show()
+---+
| a|
+---+
| 1|
| 2|
+---+
df.a
# Column<b'a'>
df.a + 1
# Column<b'(a + 1)'>
1 / df.a
# Column<b'(1 / a)'>
df.a, df.a + 1 and 1 / df.a are all Column objects, what you want to ask is probably how to attach the column to the data frame, for which, you can use select:
df.select('a', (df.a + 1).alias('b')).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
Or withColumn:
df.withColumn('b', df.a + 1).show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+

Saving iteratively to a new DataFrame in Pyspark

I'm performing computations based on 3 different PySpark DataFrames.
This script works in the sense that it performs the computation as it should, however, I struggle with working properly with the results of said computation.
import sys
import numpy as np
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)
# Dummy Data
df = sqlContext.createDataFrame([[0,1,0,0,0],[1,1,0,0,1],[0,0,1,0,1],[1,0,1,1,0],[1,1,0,0,0]], ['p1', 'p2', 'p3', 'p4', 'p5'])
df.show()
+---+---+---+---+---+
| p1| p2| p3| p4| p5|
+---+---+---+---+---+
| 0| 1| 0| 0| 0|
| 1| 1| 0| 0| 1|
| 0| 0| 1| 0| 1|
| 1| 0| 1| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
# Values
values = sqlContext.createDataFrame([(0,1,'p1'),(None,1,'p2'),(0,0,'p3'),(None,0, 'p4'),(1,None,'p5')], ('f1', 'f2','index'))
values.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 0| 1| p1|
|null| 1| p2|
| 0| 0| p3|
|null| 0| p4|
| 1|null| p5|
+----+----+-----+
# Weights
weights = sqlContext.createDataFrame([(4,3,'p1'),(None,1,'p2'),(2,2,'p3'),(None, 3, 'p4'),(3,None,'p5')], ('f1', 'f2','index'))
weights.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 4| 3| p1|
|null| 1| p2|
| 2| 2| p3|
|null| 3| p4|
| 3|null| p5|
+----+----+-----+
# Function: it sums the vector W for the values of Row equal to the value of V and then divide by the length of V.
# If there a no similarities between Row and V outputs 0
def W_sum(row,v,w):
if len(w[row==v])>0:
return float(np.sum(w[row==v])/len(w))
else:
return 0.0
For each of the columns and for each row in Data, the above function is applied.
# We iterate over the columns of Values (except the last one called index)
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we select only the useful columns
df_select= df.select(defined_col)
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
df_select.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in df_select.columns))))
This gives :
+---+---+---+---+---+---+
| p1| p2| p3| p4| p5| f1|
+---+---+---+---+---+---+
| 0| 1| 0| 0| 0|2.0|
| 1| 1| 0| 0| 1|1.0|
| 0| 0| 1| 0| 1|2.0|
| 1| 0| 1| 1| 0|0.0|
| 1| 1| 0| 0| 0|0.0|
+---+---+---+---+---+---+
It added the column to the sliced DataFrame as I asked it to. The problem is that I would rather collect the data into a new one that I could access at the end to consult the results.
It it possible to grow (somewhat efficiently) a DataFrame in PySpark as I would with pandas?
Edit to make my goal clearer:
Ideally I would get a DataFrame with the just the computed columns, like this:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
There are some issues with your question...
First, your for loop will produce an error, since df_select in the last line is nowhere defined; there is also no assignment at the end (what does it produce?).
Assuming that df_select is actually your subsubsample dataframe, defined some lines before, and that your last line is something like
new_df = subsubsample.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in subsubsample.columns))))
then your problem starts getting more clear. Since
values.columns[:-1]
# ['f1', 'f2']
the result of the whole loop would be just
+---+---+---+---+---+
| p1| p2| p3| p4| f2|
+---+---+---+---+---+
| 0| 1| 0| 0|1.0|
| 1| 1| 0| 0|2.0|
| 0| 0| 1| 0|0.0|
| 1| 0| 1| 1|0.0|
| 1| 1| 0| 0|2.0|
+---+---+---+---+---+
i.e. with only the column f2 included (natural, since the results with f1 are simply overwritten).
Now, as I said, assuming that the situation is like this, and that your problem is actually how to have both columns f1 & f2 together rather in different dataframes, you can just forget subsubsample and append columns to your initial df, possibly dropping afterwards the unwanted ones:
init_cols = df.columns
init_cols
# ['p1', 'p2', 'p3', 'p4', 'p5']
new_df = df
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
new_df = new_df.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in defined_col)))) # change here
# drop initial columns:
for i in init_cols:
new_df = new_df.drop(i)
The resulting new_df will be:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
UPDATE (after comment): To force the division in your W_sum function to be a float, use:
from __future__ import division
new_df now will be:
+---------+----+
| f1| f2|
+---------+----+
| 2.0| 1.5|
|1.6666666|2.25|
|2.3333333|0.75|
| 0.0|0.75|
|0.6666667|2.25|
+---------+----+
with f2 exactly as it should be according to your comment.

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

Resources