Dynamically select the columns in a Spark dataframe - apache-spark

I have data like in the dataframe below. As you can see, there are columns "2019" and "2019_p", "2020" and "2020_p", "2021" and "2021_p".
I want to select the final columns dynamically where if "2019" is null, take the value of "2019_p" and if the value of "2020" is null, take the value of "2020_p" and the same applies to "2021" etc.
I want to select the columns dynamically without hardcoding the column names.
How do I achieve this?
I need output like this:

you can simplify ZygD's approach to just use a list comprehension with coalesce (without regex).
# following list can be created from a source dataframe as well
year_cols = ['2019', '2020', '2021']
# [k for k in data_sdf.columns if k.startswith('20') and not k.endswith('_p')]
data_sdf. \
select('id', 'type',
*[func.coalesce(c, c+'_p').alias(c) for c in year_cols]
). \
show()
# +---+----+----+----+----+
# | id|type|2019|2020|2021|
# +---+----+----+----+----+
# | 1| A| 50| 65| 40|
# | 1| B| 25| 75| 75|
# +---+----+----+----+----+
where the list comprehension would yield the following
[func.coalesce(c, c+'_p').alias(c) for c in year_cols]
# [Column<'coalesce(2019, 2019_p) AS `2019`'>,
# Column<'coalesce(2020, 2020_p) AS `2020`'>,
# Column<'coalesce(2021, 2021_p) AS `2021`'>]

Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'A', 50, None, 40, None, 65, None),
(1, 'B', None, 75, None, 25, None, 75)],
['Id', 'Type', '2019', '2020', '2021', '2019_p', '2020_p', '2021_p'])
One way could be this - using df.colRegex:
cols = list({c[:4] for c in df.columns if c not in ['Id', 'Type']})
df = df.select(
'Id', 'Type',
*[F.coalesce(*df.select(df.colRegex(f'`^{c}.*`')).columns).alias(c) for c in cols]
)
df.show()
# +---+----+----+----+----+
# | Id|Type|2020|2019|2021|
# +---+----+----+----+----+
# | 1| A| 65| 50| 40|
# | 1| B| 75| 25| 75|
# +---+----+----+----+----+
Also possible using startswith:
cols = list({c[:4] for c in df.columns if c not in ['Id', 'Type']})
df = df.select(
'Id', 'Type',
*[F.coalesce(*[x for x in df.columns if x.startswith(c)]).alias(c) for c in cols]
)

If you needed a one liner, create a dictionary of the columns and use k, value pair in the coalesce
df.select( 'Id','Type',*[coalesce(k,v).alias(k) for k,v in dict(zip(df.select(df.colRegex("`\\d{4}`")).columns,df.select(df.colRegex("`.*\\_\\D$`")).columns)).items()]).show()
+---+----+----+----+----+
| Id|Type|2019|2020|2021|
+---+----+----+----+----+
| 1| A| 50| 65| 40|
| 1| B| 25| 75| 75|
+---+----+----+----+----+

Related

Choose the column having more data

I have to select a column out of two columns which has more data or values in it using PySpark and keep it in my DataFrame.
For example, we have two columns A and B:
In example, the column B has more values so I will keep it in my DF for transformations. Similarly, I would take A, if A had more values. I think we can do it using if else conditions, but I'm not able to get the correct logic.
You could first aggregate the columns (count the values in each). This way you will get just 1 row which you could extract as dictionary using .head().asDict(). Then use Python's max(your_dict, key=your_dict.get) to get the dict's key having the max value (i.e. the name of the column which has maximum number of values). Then just select this column.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 7), (2, 4), (3, 7), (None, 8), (None, 4)], ['A', 'B'])
df.show()
# +----+---+
# | A| B|
# +----+---+
# | 1| 7|
# | 2| 4|
# | 3| 7|
# |null| 8|
# |null| 4|
# +----+---+
Scalable script using built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
df = df.select(max(val_cnt, key=val_cnt.get))
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Script for just 2 columns (A and B):
head = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head()
df = df.select('B' if head.B > head.A else 'A')
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Scalable script adjustable to more columns, without built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
key, val = '', -1
for k, v in val_cnt.items():
if v > val:
key, val = k, v
df = df.select(key)
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Create a data frame with the data
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
Define a condition to check for that
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
condition = fx.when((fx.col('A').isNotNull() & (fx.col('A')>fx.col('B'))),fx.col('A')).otherwise(fx.col('B'))
df_1 = df.withColumn('max_value_among_A_and_B',condition)
Print the dataframe
df_1.show()
Please check the below screenshot for details
or
If you want to pick up the whole column just based on the count. you can try this:
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
if df.select('A').count() > df.select('B').count():
pickcolumn = 'A'
else:
pickcolumn = 'B'
df_1 = df.withColumn('NewColumnm',col(pickcolumn)).drop('A','B')
df_1.show()

Drop a column with same name using column index in pyspark

This is my dataframe I'm trying to drop the duplicate columns with same name using index:
df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b'])
df.show()
Output:
+---+---+---+---+---+
| c| b| a| a| b|
+---+---+---+---+---+
| 1| 2| 3| 4| 5|
+---+---+---+---+---+
I got the index of the dataframe
col_dict = {x: col for x, col in enumerate(df.columns)}
col_dict
Output:
{0: 'c', 1: 'b', 2: 'a', 3: 'a', 4: 'b'}
Now i need to drop that duplicate column name with the same name
There is no method for droping columns using index. One way for achieving this is to rename the duplicate columns and then drop them.
Here is an example you can adapt:
df_cols = df.columns
# get index of the duplicate columns
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
# rename by adding suffix '_duplicated'
for i in duplicate_col_index:
df_cols[i] = df_cols[i] + '_duplicated'
# rename the column in DF
df = df.toDF(*df_cols)
# remove flagged columns
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
df.drop(*cols_to_remove).show()
+---+---+---+
| c| a| b|
+---+---+---+
| 1| 4| 5|
+---+---+---+

spark join by different matching levels

I have two spark dataframes:
df1 = sc.parallelize([
['a', '1', 'value1'],
['b', '1', 'value2'],
['c', '2', 'value3'],
['d', '4', 'value4'],
['e', '2', 'value5'],
['f', '4', 'value6']
]).toDF(('id1', 'id2', 'v1'))
df2 = sc.parallelize([
['a','1', 1],
['b','1', 1],
['y','2', 4],
['z','2', 4]
]).toDF(('id1', 'id2', 'v2'))
Each of them has fields id1 and id2 (and may contain a lot of id's).
At first, I need to match df1 with df2 by id1.
Then, I need to match all unmatched records from both dataframes by id2, etc.
My way is:
def joinA(df1,df2, field):
from pyspark.sql.functions import lit
L = 'L_'
R = 'R_'
Lfield = L+field
Rfield = R+field
# Taking field's names
df1n = df1.schema.names
df2n = df2.schema.names
newL = [L+fld for fld in df1n]
newR = [R+fld for fld in df2n]
# drop duplicates by input field
df1 = df1.toDF(*newL).dropDuplicates([Lfield])
df2 = df2.toDF(*newR).dropDuplicates([Rfield])
# matching records
df_full = df1.join(df2,df1[Lfield]==df2[Rfield],how = 'outer').cache()
# unmatched records from df1
df_left = df_full.where(df2[Rfield].isNull()).select(newL).toDF(*df1n)
# unmatched records from df2
df_right = df_full.where(df1[Lfield].isNull()).select(newR).toDF(*df2n)
# matched records and adding match level
df_inner = df_full.where(\
(~df1[Lfield].isNull()) & (~df2[Rfield].isNull())\
).withColumn('matchlevel',lit(field))
return df_left, df_inner, df_right
first_l,first_i,first_r = joinA(df1,df2,'id1')
second_l,second_i,second_r = joinA(first_l,first_r,'id2')
result = first_i.union(second_i)
Is there a way to make it easier?
Or some standard tools for this job?
Thank you,
Maks
I have another way to do it ... but I am not sure it is better than your solution :
from pyspark.sql import functions as F
id_cols = [cols for cols in df1.columns if cols != 'v1']
df1 = df1.withColumn("get_v2", F.lit(None))
df1 = df1.withColumn("match_level", F.lit(None))
for col in id_cols:
new_df1 = df1.join(
df2.select(
col,
"v2"
),
on=(
(df1[col] == df2[col])
& df1['get_v2'].isNull()
),
how='left'
)
new_df1 = new_df1.withColumn(
"get_v2",
F.coalesce(df1.get_v2, df2.v2)
).drop(df2[col]).drop(df2.v2)
new_df1 = new_df1.withColumn(
"match_level",
F.when(F.col("get_v2").isNotNull(), F.coalesce(F.col("match_level"), F.lit(col)))
)
df1 = new_df1
df1.show()
+---+---+---+------+------+-----------+
|id1|id2|id3| v1|get_v2|match_level|
+---+---+---+------+------+-----------+
| f| 4| 1|value6| 3| id3|
| d| 4| 1|value4| 3| id3|
| c| 2| 1|value3| 4| id2|
| c| 2| 1|value3| 4| id2|
| e| 2| 1|value5| 4| id2|
| e| 2| 1|value5| 4| id2|
| b| 1| 1|value2| 1| id1|
| a| 1| 1|value1| 1| id1|
+---+---+---+------+------+-----------+
this will result in N-joins where N is the number of ids you got.
EDIT : added match_level !

Explode 2 columns (2 lists) in the same time in pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()
Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))
You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Convert column of lists into one column of values in Pyspark [duplicate]

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()
Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))
You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Resources