Pyspark : Pass dynamic Column in UDF - python-3.x

Trying to send list of column one by one in UDF using for loop but getting error i.e Data frame not find col_name. currently in list list_col we have two column but it can be change .So i want to write a code which work for every list of column.In this code i am concatenating one row of column at a time and row value is in struct format i.e list inside a list . For every null i have to give space .
list_col=['pcxreport','crosslinediscount']
def struct_generater12(row):
list3 = []
main_str = ''
if(row is None):
list3.append(' ')
else:
for i in row:
temp = ''
if(i is None):
temp+= ' '
else:
for j in i:
if (j is None):
temp+= ' '
else:
temp+= str(j)
list3.append(temp)
for k in list3:
main_str +=k
return main_str
A = udf(struct_generater12,returnType=StringType())
# z = addlinterestdetail_FDF1.withColumn("Concated_pcxreport",A(addlinterestdetail_FDF1.pcxreport))
for i in range(0,len(list_col)-1):
struct_col='Concate_'
struct_col+=list_col[i]
col_name=list_col[i]
z = addlinterestdetail_FDF1.withColumn(struct_col,A(addlinterestdetail_FDF1.col_name))
struct_col=''
z.show()

addlinterestdetail_FDF1.col_name implies the column is named "col_name", you're not accessing the string contained in variable col_name.
When calling a UDF on a column, you can
use its string name directly: A(col_name)
or use pyspark sql function col:
import pyspark.sql.functions as psf
z = addlinterestdetail_FDF1.withColumn(struct_col,A(psf.col(col_name)))
You should consider using pyspark sql functions for concatenation instead of writing a UDF. First let's create a sample dataframe with nested structures:
import json
j = {'pcxreport':{'a': 'a', 'b': 'b'}, 'crosslinediscount':{'c': 'c', 'd': None, 'e': 'e'}}
jsonRDD = sc.parallelize([json.dumps(j)])
df = spark.read.json(jsonRDD)
df.printSchema()
df.show()
root
|-- crosslinediscount: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- e: string (nullable = true)
|-- pcxreport: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
+-----------------+---------+
|crosslinediscount|pcxreport|
+-----------------+---------+
| [c,null,e]| [a,b]|
+-----------------+---------+
We'll write a dictionary with nested column names:
list_col=['pcxreport','crosslinediscount']
list_subcols = dict()
for c in list_col:
list_subcols[c] = df.select(c+'.*').columns
Now we can "flatten" the StructType, replace None with ' ', and concatenate:
import itertools
import pyspark.sql.functions as psf
df.select([c + '.*' for c in list_col])\
.na.fill({c:' ' for c in list(itertools.chain.from_iterable(list_subcols.values()))})\
.select([psf.concat(*sc).alias(c) for c, sc in list_subcols.items()])\
.show()
+---------+-----------------+
|pcxreport|crosslinediscount|
+---------+-----------------+
| ab| c e|
+---------+-----------------+

Related

Reorder PySpark dataframe columns on specific sort logic

I have a PySpark dataframe with the below column order. I need to order it as per the 'branch'. How do I do it? df.select(sorted(df.columns)) doesn't seem to work the way I want.
Existing column order:
store_id,
store_name,
month_1_branch_A_profit,
month_1_branch_B_profit,
month_1_branch_C_profit,
month_1_branch_D_profit,
month_2_branch_A_profit,
month_2_branch_B_profit,
month_2_branch_C_profit,
month_2_branch_D_profit,
.
.
month_12_branch_A_profit,
month_12_branch_B_profit,
month_12_branch_C_profit,
month_12_branch_D_profit
Desired column order:
store_id,
store_name,
month_1_branch_A_profit,
month_2_branch_A_profit,
month_3_branch_A_profit,
month_4_branch_A_profit,
.
.
month_12_branch_A_profit,
month_1_branch_B_profit,
month_2_branch_B_profit,
month_3_branch_B_profit,
.
.
month_12_branch_B_profit,
..
You could manually build your list of columns.
col_fmt = 'month_{}_branch_{}_profit'
cols = ['store_id', 'store_name']
for branch in ['A', 'B', 'C', 'D']:
for i in range(1, 13):
cols.append(col_fmt.format(i, branch))
df.select(cols)
Alternatively, I'd recommend building a better dataframe that takes advantage of array + struct/map datatypes. E.g.
months - array (size 12)
- branches: map<string, struct>
- key: string (branch name)
- value: struct
- profit: float
This way, arrays would already be "sorted". Map order doesn't really matter, and it makes SQL queries specific to certain months and branches easier to read (and probably faster with predicate pushdowns)
You may need to use some python coding. In the following script I split the column names based on underscore _ and then sorted according to elements [3] (branch name) and [1] (month value).
Input df:
cols = ['store_id',
'store_name',
'month_1_branch_A_profit',
'month_1_branch_B_profit',
'month_1_branch_C_profit',
'month_1_branch_D_profit',
'month_2_branch_A_profit',
'month_2_branch_B_profit',
'month_2_branch_C_profit',
'month_2_branch_D_profit',
'month_12_branch_A_profit',
'month_12_branch_B_profit',
'month_12_branch_C_profit',
'month_12_branch_D_profit']
df = spark.createDataFrame([], ','.join([f'{c} int' for c in cols]))
Script:
branch_cols = [c for c in df.columns if c not in{'store_id', 'store_name'}]
d = {tuple(c.split('_')):c for c in branch_cols}
df = df.select(
'store_id', 'store_name',
*[d[c] for c in sorted(d, key=lambda x: f'{x[3]}_{int(x[1]):02}')]
)
df.printSchema()
# root
# |-- store_id: integer (nullable = true)
# |-- store_name: integer (nullable = true)
# |-- month_1_branch_A_profit: integer (nullable = true)
# |-- month_2_branch_A_profit: integer (nullable = true)
# |-- month_12_branch_A_profit: integer (nullable = true)
# |-- month_1_branch_B_profit: integer (nullable = true)
# |-- month_2_branch_B_profit: integer (nullable = true)
# |-- month_12_branch_B_profit: integer (nullable = true)
# |-- month_1_branch_C_profit: integer (nullable = true)
# |-- month_2_branch_C_profit: integer (nullable = true)
# |-- month_12_branch_C_profit: integer (nullable = true)
# |-- month_1_branch_D_profit: integer (nullable = true)
# |-- month_2_branch_D_profit: integer (nullable = true)
# |-- month_12_branch_D_profit: integer (nullable = true)

How to convert all int dtypes to double simultanously on PySpark

here's my dataset
DataFrame[column1: double, column2: double, column3: int, column4: int, column5: int, ... , column300: int]
What I want is
DataFrame[column1: double, column2: double, column3: double, column4: double, column5: double, ... , column300: double]
What I did
dataset.withColumn("column3", datalabel.column3.cast(DoubleType()))
It is too manual, can you show me how to do that?
You can use list comprehensions to construct the converted field list.
import pyspark.sql.functions as F
...
cols = [F.col(field[0]).cast('double') if field[1] == 'int' else F.col(field[0]) for field in df.dtypes]
df = df.select(cols)
df.printSchema()
You first need to filter out your int column types from your available schema.
Then in conjunction with reduce you can iterate through the DataFrame to cast them to your choice
reduce is a very important & useful functionality that can be utilise to navigate any iterative use case(s) within Spark in general
Data Preparation
df = pd.DataFrame({
'id':[f'id{i}' for i in range(0,10)],
'col1': [i for i in range(80,90)],
'col2': [i for i in range(5,15)],
'col3': [6,7,5,3,4,2,9,12,4,10]
})
sparkDF = sql.createDataFrame(df)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
Identification
sparkDF.dtypes
## [('id', 'string'), ('col1', 'bigint'), ('col2', 'bigint'), ('col3', 'bigint')]
long_double_list = [ col for col,dtyp in sparkDF.dtypes if dtyp == 'bigint' ]
long_double_list
## ['col1', 'col2', 'col3']
Reduce
sparkDF = reduce(lambda df,c: df.withColumn(c,F.col(c).cast(DoubleType()))
,long_double_list
,sparkDF
)
sparkDF.printSchema()
root
|-- id: string (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
VectorAssembler converts integer values to floating point values in multiple columns. You can separate a vector column into columns and rename the columns as below.
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.functions import vector_to_array
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
# Create an input DataFrame
N_COLUMNS = 3
pdf_int = pd.DataFrame(
data={f"column{x}": [x, x*2] for x in range(1, N_COLUMNS+1)},
dtype=np.int64)
pdf_double = pd.DataFrame(
data={f"column{x}": [x+0.5, x*2+0.5] for x in range(N_COLUMNS+1, N_COLUMNS*2+1)},
dtype=np.float64)
pdf_input = pd.concat([pdf_int, pdf_double], axis=1)
df_input = spark.createDataFrame(pdf_input)
col_names = df_input.columns
df_input.show()
# Convert all integer columns to double and leave double values unchanged
df_output = VectorAssembler().setInputCols(col_names).setOutputCol("vector"). \
transform(df_input).select("vector").\
withColumn("array", vector_to_array("vector")).select("array"). \
select([col("array")[i] for i in range(len(col_names))]). \
toDF(*col_names)
df_output.show()
print(type(df_output))
# Verify all values are equal except types between the input and output
pdf_output = df_output.toPandas()
assert pdf_input.astype("float64").equals(pdf_output)
assert df_input.schema != df_output.schema

Modify nested property inside Struct column with PySpark

I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?
Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)
One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+

How to split dataframe into multiple dataframes by their column datatypes using SparkSQL?

Below is the sample dataframe, I want to split this into multiple dataframes or rdd's based on their datatype
ID:Int
Name:String
Joining_Date: Date
I have 100+ columns in my data frame, Is there any inbuilt method to achieve this logic?
As far as I know there is not a build-in functionality to achieve that, nevertheless here is a way to separate one dataframe into multiple dataframes based on the column type.
First lets create some data:
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, LongType, DateType
df = spark.createDataFrame([
(0, 11, "t1", "s1", "2019-10-01"),
(0, 22, "t2", "s2", "2019-02-11"),
(1, 23, "t3", "s3", "2018-01-10"),
(1, 24, "t4", "s4", "2019-10-01")], ["i1", "i2", "s1", "s2", "date"])
df = df.withColumn("date", col("date").cast("date"))
# df.printSchema()
# root
# |-- i1: long (nullable = true)
# |-- i2: long (nullable = true)
# |-- s1: string (nullable = true)
# |-- s2: string (nullable = true)
# |-- date: date (nullable = true)
Then we will group the columns of the previous dataframe into a dictionary where the key with be the column type and the value a list with the columns that correspond to that type:
d = {}
# group cols into a dict by type
for c in df.schema:
key = c.dataType
if not key in d.keys():
d[key] = [c.name]
else:
d[key].append(c.name)
d
# {DateType: ['date'], StringType: ['s1', 's2'], LongType: ['i1', 'i2']}
Then we iterate though the keys(col types) and we generate the schema along with the corresponding empty dataframe for each item of the dictionary:
type_dfs = {}
# create schema for each type
for k in d.keys():
schema = StructType(
[
StructField(cname , k) for cname in d[k]
])
# finally create an empty df with that schema
type_dfs[str(k)] = spark.createDataFrame(sc.emptyRDD(), schema)
type_dfs
# {'DateType': DataFrame[date: date],
# 'StringType': DataFrame[s1: string, s2: string],
# 'LongType': DataFrame[i1: bigint, i2: bigint]}
Finally we can use the generated dataframes by accessing each item of the type_dfs:
type_dfs['StringType'].printSchema()
# root
# |-- s1: string (nullable = true)
# |-- s2: string (nullable = true)

How to add column to exploded struct in Spark?

Say I have the following data:
{"id":1, "payload":[{"foo":1, "lol":2},{"foo":2, "lol":2}]}
I would like to explode the payload and add a column to it, like this:
df = df.select('id', F.explode('payload').alias('data'))
df = df.withColumn('data.bar', F.col('data.foo') * 2)
However this results in a dataframe with three columns:
id
data
data.bar
I expected the data.bar to be part of the data struct...
How can I add a column to the exploded struct, instead of adding a top-level column?
df = df.withColumn('data', f.struct(
df['data']['foo'].alias('foo'),
(df['data']['foo'] * 2).alias('bar')
))
This will result in:
root
|-- id: long (nullable = true)
|-- data: struct (nullable = false)
| |-- col1: long (nullable = true)
| |-- bar: long (nullable = true)
UPDATE:
def func(x):
tmp = x.asDict()
tmp['foo'] = tmp.get('foo', 0) * 100
res = zip(*tmp.items())
return Row(*res[0])(*res[1])
df = df.withColumn('data', f.UserDefinedFunction(func, StructType(
[StructField('foo', StringType()), StructField('lol', StringType())]))(df['data']))
P.S.
Spark almost do not support inplace opreation.
So every time you want to do inplace, you need to do replace actually.

Resources