Modify nested property inside Struct column with PySpark - apache-spark

I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?

Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)

One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+

Related

In spark dataframe for map column how to update values with a constant for all keys

I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps

Using flatMap / reduce: dealing with rows containing a list of rows

I have a dataframe containing an array of rows on each row
I want to aggregate all the inner rows into one dataframe
Below is what I have / achieved:
This
df.select('*').take(1)
Gives me this:
[
Row(
body=[
Row(a=1, b=1),
Row(a=2, b=2)
]
)
]
So doing this:
df.rdd.flatMap(lambda x: x).collect()
I get this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
So I am forced to do this:
df.rdd.flatMap(lambda x: x).flatMap(lambda x: x)
So I can achieve the below:
[
Row(a=1, b=1)
Row(a=2, b=2)
]
Using the result above, I can finally convert it to a dataframe and save somewhere. Which is what I want. But calling flatMap twice doesnt look right.
I tried to the same by using Reduce, just like the following code:
flatRdd = df.rdd.flatMap(lambda x: x)
dfMerged = reduce(DataFrame.unionByName, [flatRdd])
The second argument of reduce must be iterable, so I was forced to add [flatRdd]. Sadly it gives me this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
There is certainlly a better way to achieve what I want.
IIUC, you can explode and then flatten the resulting Rows using the .* syntax.
Suppose you start with the following DataFrame:
df.show()
#+----------------+
#| body|
#+----------------+
#|[[1, 1], [2, 2]]|
#+----------------+
with the schema:
df.printSchema()
#root
# |-- body: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- a: long (nullable = true)
# | | |-- b: long (nullable = true)
You can first explode the body column:
from pyspark.sql.functions import explode
df = df.select(explode("body").alias("exploded"))
df.show()
#+--------+
#|exploded|
#+--------+
#| [1, 1]|
#| [2, 2]|
#+--------+
Now flatten the exploded column:
df = df.select("exploded.*")
df.show()
#+---+---+
#| a| b|
#+---+---+
#| 1| 1|
#| 2| 2|
#+---+---+
Now if you were to call collect, you'd get the desired output:
print(df.collect())
#[Row(a=1, b=1), Row(a=2, b=2)]
See also:
Querying Spark SQL DataFrame with complex types
You don't need to run flatMap() on the Row object, just refer it directly with the key:
>>> df.rdd.flatMap(lambda x: x.body).collect()
[Row(a=1, b=1), Row(a=2, b=2)]

How to use Spark SQL SPLIT function to pass input to Spark SQL IN parameter [duplicate]

I have a dataframe with two columns(one string and one array of string):
root
|-- user: string (nullable = true)
|-- users: array (nullable = true)
| |-- element: string (containsNull = true)
How can I filter the dataframe so that the result dataframe only contains rows that user is in users?
Quick and simple:
import org.apache.spark.sql.functions.expr
df.where(expr("array_contains(users, user)")
Sure, It's possible and not so hard. To achieve this you may use a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val df = sc.parallelize(Array(
("1", Array("1", "2", "3")),
("2", Array("1", "2", "2", "3")),
("3", Array("1", "2"))
)).toDF("user", "users")
val inArray = udf((id: String, array: scala.collection.mutable.WrappedArray[String]) => array.contains(id), BooleanType)
df.where(inArray($"user", $"users")).show()
The output is:
+----+------------+
|user| users|
+----+------------+
| 1| [1, 2, 3]|
| 2|[1, 2, 2, 3]|
+----+------------+

How to use UDF to return multiple columns?

Is it possible to create a UDF which would return the set of columns?
I.e. having a data frame as follows:
| Feature1 | Feature2 | Feature 3 |
| 1.3 | 3.4 | 4.5 |
Now I would like to extract a new feature, which can be described as a vector of let's say two elements (e.g. as seen in a linear regression - slope and offset). Desired dataset shall look as follows:
| Feature1 | Feature2 | Feature 3 | Slope | Offset |
| 1.3 | 3.4 | 4.5 | 0.5 | 3 |
Is it possible to create multiple columns with single UDF or do I need to follow the rule: "single column per single UDF"?
Struct method
You can define the udf function as
def myFunc: (String => (String, String)) = { s => (s.toLowerCase, s.toUpperCase)}
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
and use .* as
val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select("Feature1", "Feature2", "Feature 3", "newCol.*")
I have returned Tuple2 for testing purpose (higher order tuples can be used according to how many multiple columns are required) from udf function and it would be treated as struct column. Then you can use .* to select all the elements in separate columns and finally rename them.
You should have output as
+--------+--------+---------+---+---+
|Feature1|Feature2|Feature 3|_1 |_2 |
+--------+--------+---------+---+---+
|1.3 |3.4 |4.5 |3.4|3.4|
+--------+--------+---------+---+---+
You can rename _1 and _2
Array method
udf function should return an array
def myFunc: (String => Array[String]) = { s => Array("s".toLowerCase, s.toUpperCase)}
import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)
And the you can select elements of the array and use alias to rename them
val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select($"Feature1", $"Feature2", $"Feature 3", $"newCol"(0).as("Slope"), $"newCol"(1).as("Offset"))
You should have
+--------+--------+---------+-----+------+
|Feature1|Feature2|Feature 3|Slope|Offset|
+--------+--------+---------+-----+------+
|1.3 |3.4 |4.5 |s |3.4 |
+--------+--------+---------+-----+------+
Also, you can return the case class:
case class NewFeatures(slope: Double, offset: Int)
val getNewFeatures = udf { s: String =>
NewFeatures(???, ???)
}
df
.withColumn("newF", getNewFeatures($"Feature1"))
.select($"Feature1", $"Feature2", $"Feature3", $"newF.slope", $"newF.offset")
I miss an explanation about how to assign the multiples values in the case class to several columns in the dataframe.
So, in summary, a complete example in Scala
import org.apache.spark.sql.functions.udf
val df = Seq((1L, 3.0, "a"), (2L, -1.0, "b"), (3L, 0.0, "c")).toDF("x", "y", "z")
case class Foobar(foo: Double, bar: Double)
val foobarUdf = udf((x: Long, y: Double, z: String) =>
Foobar(x * y, z.head.toInt * y))
val df1 = df.withColumn("foo", foobarUdf($"x", $"y", $"z").getField("foo")).withColumn("bar", foobarUdf($"x", $"y", $"z").getField("bar"))
If you check the schema of the df1 dataframe, you'll get
scala> df1.printSchema
root
|-- x: long (nullable = false)
|-- y: double (nullable = false)
|-- z: string (nullable = true)
|-- foo: double (nullable = true)
|-- bar: double (nullable = true)

Pyspark : Pass dynamic Column in UDF

Trying to send list of column one by one in UDF using for loop but getting error i.e Data frame not find col_name. currently in list list_col we have two column but it can be change .So i want to write a code which work for every list of column.In this code i am concatenating one row of column at a time and row value is in struct format i.e list inside a list . For every null i have to give space .
list_col=['pcxreport','crosslinediscount']
def struct_generater12(row):
list3 = []
main_str = ''
if(row is None):
list3.append(' ')
else:
for i in row:
temp = ''
if(i is None):
temp+= ' '
else:
for j in i:
if (j is None):
temp+= ' '
else:
temp+= str(j)
list3.append(temp)
for k in list3:
main_str +=k
return main_str
A = udf(struct_generater12,returnType=StringType())
# z = addlinterestdetail_FDF1.withColumn("Concated_pcxreport",A(addlinterestdetail_FDF1.pcxreport))
for i in range(0,len(list_col)-1):
struct_col='Concate_'
struct_col+=list_col[i]
col_name=list_col[i]
z = addlinterestdetail_FDF1.withColumn(struct_col,A(addlinterestdetail_FDF1.col_name))
struct_col=''
z.show()
addlinterestdetail_FDF1.col_name implies the column is named "col_name", you're not accessing the string contained in variable col_name.
When calling a UDF on a column, you can
use its string name directly: A(col_name)
or use pyspark sql function col:
import pyspark.sql.functions as psf
z = addlinterestdetail_FDF1.withColumn(struct_col,A(psf.col(col_name)))
You should consider using pyspark sql functions for concatenation instead of writing a UDF. First let's create a sample dataframe with nested structures:
import json
j = {'pcxreport':{'a': 'a', 'b': 'b'}, 'crosslinediscount':{'c': 'c', 'd': None, 'e': 'e'}}
jsonRDD = sc.parallelize([json.dumps(j)])
df = spark.read.json(jsonRDD)
df.printSchema()
df.show()
root
|-- crosslinediscount: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- e: string (nullable = true)
|-- pcxreport: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
+-----------------+---------+
|crosslinediscount|pcxreport|
+-----------------+---------+
| [c,null,e]| [a,b]|
+-----------------+---------+
We'll write a dictionary with nested column names:
list_col=['pcxreport','crosslinediscount']
list_subcols = dict()
for c in list_col:
list_subcols[c] = df.select(c+'.*').columns
Now we can "flatten" the StructType, replace None with ' ', and concatenate:
import itertools
import pyspark.sql.functions as psf
df.select([c + '.*' for c in list_col])\
.na.fill({c:' ' for c in list(itertools.chain.from_iterable(list_subcols.values()))})\
.select([psf.concat(*sc).alias(c) for c, sc in list_subcols.items()])\
.show()
+---------+-----------------+
|pcxreport|crosslinediscount|
+---------+-----------------+
| ab| c e|
+---------+-----------------+

Resources