Unable to replace a particular value from an array column in Pyspark - python-3.x

I have a column in my DF where data type is :
testcolumn:array
--element: struct
-----id:integer
-----configName: string
-----desc:string
-----configparam:array
--------element:map
-------------key:string
-------------value:string
testcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[{"removeit":"[]"}]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[{"removeit":"[]"}]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[{"removeit":"[]"}]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[{"removeit":"[]"}]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I want to remove where configparam is "configparam":[{"removeit":"[]"}] to "configparam":[]
expecting output:
outputcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I have tried this code but it is not giving me output:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,':[{"removeit":"[]"}]','[]')))
it will be really great if someone can help me.

You have to perform a chain of explode, filter and groupBy operations to achieve this.
First, explode array/struct/map columns to reach to the nested column:
df = df.withColumn("id", F.col("testcolumn")["id"])
df = df.withColumn("configName", F.col("testcolumn")["configName"])
df = df.withColumn("desc", F.col("testcolumn")["desc"])
df = df.withColumn("configparam_exploded", F.explode(F.col("testcolumn")["configparam"]))
df = df.select(df.columns + [F.explode(F.col("configparam_exploded"))])
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn |id |configName|desc|configparam_exploded |key |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{1, test1, Ram1, [{removeit -> []}]} |1 |test1 |Ram1|{removeit -> []} |removeit |[] |
|{2, test2, Ram2, [{removeit -> []}]} |2 |test2 |Ram2|{removeit -> []} |removeit |[] |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
Then, filter data as required:
df = df.filter((F.col("key") != "removeit") | (F.col("value") != "[]"))
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn |id |configName|desc|configparam_exploded |key |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
Finally, groupBy all individual columns back to original packing:
df = df.withColumn("configparam_map", F.map_from_entries(F.array(F.struct("key", "value"))))
df = df.groupBy(["id", "configName", "desc"]).agg(F.collect_list("configparam_map").alias("configparam"))
df = df.withColumn("testcolumn", F.struct("id", "configName", "desc", "configparam"))
df = df.drop("id", "configName", "desc", "configparam")
+-------------------------------------------------------+
|testcolumn |
+-------------------------------------------------------+
|{3, test3, Ram1, [{paramId -> 4}, {paramvalue -> 200}]}|
+-------------------------------------------------------+
Sample dataset used to reproduce the problem:
schema = StructType([StructField('testcolumn', StructType([StructField('id', IntegerType(), True), StructField('configName', StringType(), True), StructField('desc', StringType(), True), StructField('configparam', ArrayType(MapType(StringType(), StringType(), True), True), True)]), True)])
data = [
Row(Row(1, "test1", "Ram1", [{"removeit":"[]"}])),
Row(Row(2, "test2", "Ram2", [{"removeit":"[]"}])),
Row(Row(3, "test3", "Ram1", [{"paramId":"4","paramvalue":"200"}]))
]
df = spark.createDataFrame(data = data, schema = schema)

Your testcolumn is an array of struct so you cannot do a string operation as it is.
You can do something like this. This will empty configparam completely when it contains a key "removeit".
Example:
"configparam":[{"removeit":[], "otherparam": "value"}] -> "configparam": []
Spark 3.1.0+
array_has_remove = lambda y: ~F.array_contains(F.map_keys(y), 'removeit')
df = (df.withColumn('outputcolumn',
F.transform('testcolumn',
lambda x: x.withField('configparam',
F.filter(x['configparam'], array_has_remove)
)
)
))
Ref:
withField, filter, array_contains, map_keys
<Spark3.1.0
I tried without explode but this is complex. If you don't like this complex, you can try using explode and aggregate.
df = (# extract configparam to a column for easier access.
df.withColumn('configparam', F.expr('transform(testcolumn, x -> x.configparam)'))
# Return empty array if there is a "removeit" otherwise return the original object.
.withColumn('configparam',
F.expr('transform(configparam, x ->
case when array_contains(map_keys(x[0]), "removeit") then array()
else x end)'))
# Patch the transformed configparam with the rest of testcolumn
.withColumn('outputcolumn',
F.expr('transform(testcolumn, (x, i) -> struct(x.id, x.configName, x.desc, configparam[i] as configparam))'))
.drop('configparam'))
Result
Row(testcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[{'removeit': '[]'}]), Row(id=2, configName='test2', desc='Ram2', configparam=[{'removeit': '[]'}]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])],
outputcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[]), Row(id=2, configName='test2', desc='Ram2', configparam=[]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])])

Related

Modify nested property inside Struct column with PySpark

I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?
Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)
One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+

In spark dataframe for map column how to update values with a constant for all keys

I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

How to remove words that have less than three letters in PySpark?

I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
# in col('text')], ''))
# df_clean.show()
I expect to see:
id | text
1 | [good]
2 | [You, are]
This does it, you can decide to exclude row or not, I added an extra column and filtered out, but options are yours:
from pyspark.sql import functions as f
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
(3, ['ok'])
]
df = spark.createDataFrame(vals, columns)
#df.show()
df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()
# This is the actual piece of logic you are looking for.
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()
returns:
+---+--------------+--------------+
| id| text|text_left_over|
+---+--------------+--------------+
| 1| [I, am, good]| [good]|
| 2|[You, are, ok]| [You, are]|
| 3| [ok]| []|
+---+--------------+--------------+
+---+--------------+
| id|text_left_over|
+---+--------------+
| 1| [good]|
| 2| [You, are]|
+---+--------------+
This is the solution
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))

Spark GroupBy Aggregate functions

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)
It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

Resources