Find array intersection for each row in Pyspark - apache-spark

I have dataframe:
df_example = pd.DataFrame({'user1': ['u1', 'u1', 'u1', 'u5', 'u5', 'u5', 'u7','u7','u6'],
'user2': ['u2', 'u3', 'u4', 'u2', 'u4','u6','u8','u3','u6']})
sdf = spark.createDataFrame(df_example)
userreposts_gr = sdf.groupby('user1').agg(F.collect_list('user2').alias('all_user2'))
userreposts_gr.show()
+-----+------------+
|user1| all_user2|
+-----+------------+
| u1|[u4, u2, u3]|
| u7| [u8, u3]|
| u5|[u4, u2, u6]|
| u6| [u6]|
+-----+------------+
I want for each user1 to see the intersections for all_user2.Create a new column that has the maximum intersection with the user1
+-----+------------+------------------------------+
|user1|all_user2 |new_col |
+-----+------------+------------------------------+
|u1 |[u2, u3, u4]|{max_count -> 2, user -> 'u5'}|
|u5 |[u2, u4, u6]|{max_count -> 2, user -> 'u1'}|
|u7 |[u8, u3] |{max_count -> 1, user -> 'u1'}|
|u6 |[u6] |{max_count -> 1, user -> 'u5'}|
+-----+------------+------------------------------+

Step 1: Cross join to find all the combination of user1 pair
Step 2: Find the length of the intersected array
Step 3: Rank by the length and select the largest value in each user
output = userreposts_gr\
.selectExpr(
'user1', 'all_user2 AS arr1'
).crossJoin(
userreposts_gr.selectExpr('user1 AS user2', 'all_user2 AS arr2')
).withColumn(
'intersection', func.size(func.array_intersect('arr1', 'arr2'))
)
output = output\
.filter(
func.col('user1') != func.col('user2')
).withColumn(
'ranking', func.rank().over(Window.partitionBy('user1').orderBy(func.desc('intersection')))
)
output = output\
.filter(
func.col('ranking') == 1
).withColumn(
'new_col', func.create_map(func.lit('max_count'), func.col('intersection'), func.lit('user'), func.col('user2'))
)
output = output\
.selectExpr(
'user1', 'arr1 AS all_user2', 'new_col'
)
output.show(100, False)
+-----+------------+----------------------------+
|user1|all_user2 |new_col |
+-----+------------+----------------------------+
|u1 |[u2, u3, u4]|{max_count -> 2, user -> u5}|
|u5 |[u2, u4, u6]|{max_count -> 2, user -> u1}|
|u6 |[u6] |{max_count -> 1, user -> u5}|
|u7 |[u8, u3] |{max_count -> 1, user -> u1}|
+-----+------------+----------------------------+

Related

PySpark: Convert Map Column Keys Using Dictionary

I have a PySpark DataFrame with a map column as below:
root
|-- id: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
The map_col has keys which need to be converted based on a dictionary. For example, the dictionary might be:
mapping = {'a': '1', 'b': '2', 'c': '5', 'd': '8' }
So, the DataFrame needs to change from:
[Row(id=123, map_col={'a': 0.0, 'b': -42.19}),
Row(id=456, map_col={'a': 13.25, 'c': -19.6, 'd': 15.6})]
to the following:
[Row(id=123, map_col={'1': 0.0, '2': -42.19}),
Row(id=456, map_col={'1': 13.25, '5': -19.6, '8': 15.6})]
I see that transform_keys is an option if I could write-out the dictionary, but it's too large and dynamically-generated earlier in the workflow. I think an explode/pivot could also work, but seems non-performant?
Any ideas?
Edit: Added a bit to show that size of map in map_col is not uniform.
an approach using RDD transformation.
def updateKey(theDict, mapDict):
"""
update theDict's key using mapDict
"""
updDict = []
for item in theDict.items():
updDict.append((mapDict[item[0]] if item[0] in mapDict.keys() else item[0], item[1]))
return dict(updDict)
data_sdf.rdd. \
map(lambda r: (r[0], r[1], updateKey(r[1], mapping))). \
toDF(['id', 'map_col', 'new_map_col']). \
show(truncate=False)
# +---+-----------------------------------+-----------------------------------+
# |id |map_col |new_map_col |
# +---+-----------------------------------+-----------------------------------+
# |123|{a -> 0.0, b -> -42.19, e -> 12.12}|{1 -> 0.0, 2 -> -42.19, e -> 12.12}|
# |456|{a -> 13.25, c -> -19.6, d -> 15.6}|{8 -> 15.6, 1 -> 13.25, 5 -> -19.6}|
# +---+-----------------------------------+-----------------------------------+
P.S., I added a new key within the map_col's first row to show what happens if no mapping is available
transform_keys can use a lambda, as shown in the example, it's not just limited to an expr. However, the lambda or Python callable will need to utilize a function either defined in pyspark.sql.functions, a Column method, or a Scala UDF, so using a Python UDF which refers to the mapping dictionary object isn't currently possible with this mechanism. However, we can make use of the when function to apply the mapping, by unrolling the key-value pairs in the mapping into chained when conditions. See the below example to illustrate the idea:
from typing import Dict, Callable
from functools import reduce
from pyspark.sql.functions import Column, when, transform_keys
from pyspark.sql import SparkSession
def apply_mapping(mapping: Dict[str, str]) -> Callable[[Column, Column], Column]:
def convert_mapping_into_when_conditions(key: Column, _: Column) -> Column:
initial_key, initial_value = mapping.popitem()
initial_condition = when(key == initial_key, initial_value)
return reduce(lambda x, y: x.when(key == y[0], y[1]), mapping.items(), initial_condition)
return convert_mapping_into_when_conditions
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("Temp")\
.getOrCreate()
df = spark.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ("id", "data"))
mapping = {'foo': 'a', 'bar': 'b'}
df.select(transform_keys(
"data", apply_mapping(mapping)).alias("data_transformed")
).show(truncate=False)
The output of the above is:
+---------------------+
|data_transformed |
+---------------------+
|{b -> 2.0, a -> -2.0}|
+---------------------+
which demonstrates the defined mapping (foo -> a, bar -> b) was successfully applied to the column. The apply_mapping function should be generic enough to copy and utilize in your own pipeline.
Another way:
Use itertools to create an expression to inject into pysparks transform_keys function. Used upper just in case. Code below
from itertools import chain
m_expr1 = create_map([lit(x) for x in chain(*mapping.items())])
new =df.withColumn('new_map_col',transform_keys("map_col", lambda k, _: upper(m_expr1[k])))
new.show(truncate=False)
+---+-----------------------------------+-----------------------------------+
|id |map_col |new_map_col |
+---+-----------------------------------+-----------------------------------+
|123|{a -> 0.0, b -> -42.19} |{1 -> 0.0, 2 -> -42.19} |
|456|{a -> 13.25, c -> -19.6, d -> 15.6}|{1 -> 13.25, 5 -> -19.6, 8 -> 15.6}|
+---+-----------------------------------+-----------------------------------+

Unable to replace a particular value from an array column in Pyspark

I have a column in my DF where data type is :
testcolumn:array
--element: struct
-----id:integer
-----configName: string
-----desc:string
-----configparam:array
--------element:map
-------------key:string
-------------value:string
testcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[{"removeit":"[]"}]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[{"removeit":"[]"}]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[{"removeit":"[]"}]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[{"removeit":"[]"}]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I want to remove where configparam is "configparam":[{"removeit":"[]"}] to "configparam":[]
expecting output:
outputcolumn
Row1:
[{"id":1,"configName":"test1","desc":"Ram1","configparam":[]},
{"id":2,"configName":"test2","desc":"Ram2","configparam":[]},
{"id":3,"configName":"test3","desc":"Ram1","configparam":[{"paramId":"4","paramvalue":"200"}]}]
Row2:
[{"id":11,"configName":"test11","desc":"Ram11","configparam":[]},
{"id":33,"configName":"test33","desc":"Ram33","configparam":[{"paramId":"43","paramvalue":"300"}]},
{"id":6,"configName":"test26","desc":"Ram26","configparam":[]},
{"id":93,"configName":"test93","desc":"Ram93","configparam":[{"paramId":"93","paramvalue":"3009"}]}
]
I have tried this code but it is not giving me output:
test=df.withColumn('outputcolumn',F.expr("translate"(testcolumn,x-> replace(x,':[{"removeit":"[]"}]','[]')))
it will be really great if someone can help me.
You have to perform a chain of explode, filter and groupBy operations to achieve this.
First, explode array/struct/map columns to reach to the nested column:
df = df.withColumn("id", F.col("testcolumn")["id"])
df = df.withColumn("configName", F.col("testcolumn")["configName"])
df = df.withColumn("desc", F.col("testcolumn")["desc"])
df = df.withColumn("configparam_exploded", F.explode(F.col("testcolumn")["configparam"]))
df = df.select(df.columns + [F.explode(F.col("configparam_exploded"))])
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn |id |configName|desc|configparam_exploded |key |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{1, test1, Ram1, [{removeit -> []}]} |1 |test1 |Ram1|{removeit -> []} |removeit |[] |
|{2, test2, Ram2, [{removeit -> []}]} |2 |test2 |Ram2|{removeit -> []} |removeit |[] |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
Then, filter data as required:
df = df.filter((F.col("key") != "removeit") | (F.col("value") != "[]"))
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|testcolumn |id |configName|desc|configparam_exploded |key |value|
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramId |4 |
|{3, test3, Ram1, [{paramId -> 4, paramvalue -> 200}]}|3 |test3 |Ram1|{paramId -> 4, paramvalue -> 200}|paramvalue|200 |
+-----------------------------------------------------+---+----------+----+---------------------------------+----------+-----+
Finally, groupBy all individual columns back to original packing:
df = df.withColumn("configparam_map", F.map_from_entries(F.array(F.struct("key", "value"))))
df = df.groupBy(["id", "configName", "desc"]).agg(F.collect_list("configparam_map").alias("configparam"))
df = df.withColumn("testcolumn", F.struct("id", "configName", "desc", "configparam"))
df = df.drop("id", "configName", "desc", "configparam")
+-------------------------------------------------------+
|testcolumn |
+-------------------------------------------------------+
|{3, test3, Ram1, [{paramId -> 4}, {paramvalue -> 200}]}|
+-------------------------------------------------------+
Sample dataset used to reproduce the problem:
schema = StructType([StructField('testcolumn', StructType([StructField('id', IntegerType(), True), StructField('configName', StringType(), True), StructField('desc', StringType(), True), StructField('configparam', ArrayType(MapType(StringType(), StringType(), True), True), True)]), True)])
data = [
Row(Row(1, "test1", "Ram1", [{"removeit":"[]"}])),
Row(Row(2, "test2", "Ram2", [{"removeit":"[]"}])),
Row(Row(3, "test3", "Ram1", [{"paramId":"4","paramvalue":"200"}]))
]
df = spark.createDataFrame(data = data, schema = schema)
Your testcolumn is an array of struct so you cannot do a string operation as it is.
You can do something like this. This will empty configparam completely when it contains a key "removeit".
Example:
"configparam":[{"removeit":[], "otherparam": "value"}] -> "configparam": []
Spark 3.1.0+
array_has_remove = lambda y: ~F.array_contains(F.map_keys(y), 'removeit')
df = (df.withColumn('outputcolumn',
F.transform('testcolumn',
lambda x: x.withField('configparam',
F.filter(x['configparam'], array_has_remove)
)
)
))
Ref:
withField, filter, array_contains, map_keys
<Spark3.1.0
I tried without explode but this is complex. If you don't like this complex, you can try using explode and aggregate.
df = (# extract configparam to a column for easier access.
df.withColumn('configparam', F.expr('transform(testcolumn, x -> x.configparam)'))
# Return empty array if there is a "removeit" otherwise return the original object.
.withColumn('configparam',
F.expr('transform(configparam, x ->
case when array_contains(map_keys(x[0]), "removeit") then array()
else x end)'))
# Patch the transformed configparam with the rest of testcolumn
.withColumn('outputcolumn',
F.expr('transform(testcolumn, (x, i) -> struct(x.id, x.configName, x.desc, configparam[i] as configparam))'))
.drop('configparam'))
Result
Row(testcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[{'removeit': '[]'}]), Row(id=2, configName='test2', desc='Ram2', configparam=[{'removeit': '[]'}]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])],
outputcolumn=[Row(id=1, configName='test1', desc='Ram1', configparam=[]), Row(id=2, configName='test2', desc='Ram2', configparam=[]), Row(id=3, configName='test3', desc='Ram1', configparam=[{'paramId': '4', 'paramvalue': '200'}])])

How to create a map column with rolling window aggregates per each key

Problem description
I need help with a pyspark.sql function that will create a new variable aggregating records over a specified Window() into a map of key-value pairs.
Reproducible Data
df = spark.createDataFrame(
[
('AK', "2022-05-02", 1651449600, 'US', 3),
('AK', "2022-05-03", 1651536000, 'ON', 1),
('AK', "2022-05-04", 1651622400, 'CO', 1),
('AK', "2022-05-06", 1651795200, 'AK', 1),
('AK', "2022-05-06", 1651795200, 'US', 5)
],
["state", "ds", "ds_num", "region", "count"]
)
df.show()
# +-----+----------+----------+------+-----+
# |state| ds| ds_num|region|count|
# +-----+----------+----------+------+-----+
# | AK|2022-05-02|1651449600| US| 3|
# | AK|2022-05-03|1651536000| ON| 1|
# | AK|2022-05-04|1651622400| CO| 1|
# | AK|2022-05-06|1651795200| AK| 1|
# | AK|2022-05-06|1651795200| US| 5|
# +-----+----------+----------+------+-----+
Partial solutions
Sets of regions over a window frame:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
df.withColumn('regions_4W',
F.collect_set('region').over(
Window().partitionBy('state').orderBy('ds_num').rangeBetween(-days(27),0)))\
.sort('ds')\
.show()
# +-----+----------+----------+------+-----+----------------+
# |state| ds| ds_num|region|count| regions_4W|
# +-----+----------+----------+------+-----+----------------+
# | AK|2022-05-02|1651449600| US| 3| [US]|
# | AK|2022-05-03|1651536000| ON| 1| [US, ON]|
# | AK|2022-05-04|1651622400| CO| 1| [CO, US, ON]|
# | AK|2022-05-06|1651795200| AK| 1|[CO, US, ON, AK]|
# | AK|2022-05-06|1651795200| US| 5|[CO, US, ON, AK]|
# +-----+----------+----------+------+-----+----------------+
Map of counts per each state and ds
df\
.groupby('state', 'ds', 'ds_num')\
.agg(F.map_from_entries(F.collect_list(F.struct("region", "count"))).alias("count_rolling_4W"))\
.sort('ds')\
.show()
# +-----+----------+----------+------------------+
# |state| ds| ds_num| count_rolling_4W|
# +-----+----------+----------+------------------+
# | AK|2022-05-02|1651449600| {US -> 3}|
# | AK|2022-05-03|1651536000| {ON -> 1}|
# | AK|2022-05-04|1651622400| {CO -> 1}|
# | AK|2022-05-06|1651795200|{AK -> 1, US -> 5}|
# +-----+----------+----------+------------------+
Desired Output
What I need is a map aggregating data per each key present in the specified window
+-----+----------+----------+------------------------------------+
|state| ds| ds_num| count_rolling_4W|
+-----+----------+----------+------------------------------------+
| AK|2022-05-02|1651449600| {US -> 3}|
| AK|2022-05-03|1651536000| {US -> 3, ON -> 1}|
| AK|2022-05-04|1651622400| {US -> 3, ON -> 1, CO -> 1}|
| AK|2022-05-06|1651795200|{US -> 8, ON -> 1, CO -> 1, AK -> 1}|
+-----+----------+----------+------------------------------------+
You can use higher order functions transform and aggregate like this:
from pyspark.sql import Window, functions as F
w = Window.partitionBy('state').orderBy('ds_num').rowsBetween(-days(27), 0)
df1 = (df.withColumn('regions', F.collect_set('region').over(w))
.withColumn('counts', F.collect_list(F.struct('region', 'count')).over(w))
.withColumn('counts',
F.transform(
'regions',
lambda x: F.aggregate(
F.filter('counts', lambda y: y['region'] == x),
F.lit(0),
lambda acc, v: acc + v['count']
)
))
.withColumn('count_rolling_4W', F.map_from_arrays('regions', 'counts'))
.drop('counts', 'regions')
)
df1.show(truncate=False)
#+-----+----------+----------+------+-----+------------------------------------+
# |state|ds |ds_num |region|count|count_rolling_4W |
# +-----+----------+----------+------+-----+------------------------------------+
# |AK |2022-05-02|1651449600|US |3 |{US -> 3} |
# |AK |2022-05-03|1651536000|ON |1 |{US -> 3, ON -> 1} |
# |AK |2022-05-04|1651622400|CO |1 |{CO -> 1, US -> 3, ON -> 1} |
# |AK |2022-05-06|1651795200|AK |1 |{CO -> 1, US -> 3, ON -> 1, AK -> 1}|
# |AK |2022-05-06|1651795200|US |5 |{CO -> 1, US -> 8, ON -> 1, AK -> 1}|
# +-----+----------+----------+------+-----+------------------------------------+
Great question. This method will use 2 windows and 2 higher order functions (aggregate and map_from_entries)
from pyspark.sql import functions as F, Window
w1 = Window.partitionBy('state', 'region').orderBy('ds')
w2 = Window.partitionBy('state').orderBy('ds')
df = df.withColumn('acc_count', F.sum('count').over(w1))
df = df.withColumn('maps', F.collect_set(F.struct('region', 'acc_count')).over(w2))
df = df.groupBy('state', 'ds', 'ds_num').agg(
F.aggregate(
F.first('maps'),
F.create_map(F.first('region'), F.first('acc_count')),
lambda m, x: F.map_zip_with(m, F.map_from_entries(F.array(x)), lambda k, v1, v2: F.greatest(v2, v1))
).alias('count_rolling_4W')
)
df.show(truncate=0)
# +-----+----------+----------+------------------------------------+
# |state|ds |ds_num |count_rolling_4W |
# +-----+----------+----------+------------------------------------+
# |AK |2022-05-02|1651449600|{US -> 3} |
# |AK |2022-05-03|1651536000|{ON -> 1, US -> 3} |
# |AK |2022-05-04|1651622400|{CO -> 1, US -> 3, ON -> 1} |
# |AK |2022-05-06|1651795200|{AK -> 1, US -> 8, ON -> 1, CO -> 1}|
# +-----+----------+----------+------------------------------------+
Assuming that the state, ds, ds_num and region columns in your source dataframe are unique (they can be seen as primary key), this snippet would do the work:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
df.alias('a').join(df.alias('b'), 'state') \
.where((F.col('a.ds_num') - F.col('b.ds_num')).between(0, days(27))) \
.select('state', 'a.ds', 'a.ds_num', 'b.region', 'b.count') \
.dropDuplicates() \
.groupBy('state', 'ds', 'ds_num', 'region').sum('count') \
.groupBy('state', 'ds', 'ds_num') \
.agg(F.map_from_entries(F.collect_list(F.struct("region", "sum(count)"))).alias("count_rolling_4W")) \
.orderBy('a.ds') \
.show(truncate=False)
Results:
+-----+----------+----------+------------------------------------+
|state|ds |ds_num |count_rolling_4W |
+-----+----------+----------+------------------------------------+
|AK |2022-05-02|1651449600|{US -> 3} |
|AK |2022-05-03|1651536000|{US -> 3, ON -> 1} |
|AK |2022-05-04|1651622400|{US -> 3, ON -> 1, CO -> 1} |
|AK |2022-05-06|1651795200|{US -> 8, ON -> 1, CO -> 1, AK -> 1}|
+-----+----------+----------+------------------------------------+
It may seem complex, but it's just a windowing rewritten as a cross join for better control over the results.

In spark dataframe for map column how to update values with a constant for all keys

I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps

Spark GroupBy Aggregate functions

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)
It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

Resources