I have a PySpark DataFrame with a map column as below:
root
|-- id: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
The map_col has keys which need to be converted based on a dictionary. For example, the dictionary might be:
mapping = {'a': '1', 'b': '2', 'c': '5', 'd': '8' }
So, the DataFrame needs to change from:
[Row(id=123, map_col={'a': 0.0, 'b': -42.19}),
Row(id=456, map_col={'a': 13.25, 'c': -19.6, 'd': 15.6})]
to the following:
[Row(id=123, map_col={'1': 0.0, '2': -42.19}),
Row(id=456, map_col={'1': 13.25, '5': -19.6, '8': 15.6})]
I see that transform_keys is an option if I could write-out the dictionary, but it's too large and dynamically-generated earlier in the workflow. I think an explode/pivot could also work, but seems non-performant?
Any ideas?
Edit: Added a bit to show that size of map in map_col is not uniform.
an approach using RDD transformation.
def updateKey(theDict, mapDict):
"""
update theDict's key using mapDict
"""
updDict = []
for item in theDict.items():
updDict.append((mapDict[item[0]] if item[0] in mapDict.keys() else item[0], item[1]))
return dict(updDict)
data_sdf.rdd. \
map(lambda r: (r[0], r[1], updateKey(r[1], mapping))). \
toDF(['id', 'map_col', 'new_map_col']). \
show(truncate=False)
# +---+-----------------------------------+-----------------------------------+
# |id |map_col |new_map_col |
# +---+-----------------------------------+-----------------------------------+
# |123|{a -> 0.0, b -> -42.19, e -> 12.12}|{1 -> 0.0, 2 -> -42.19, e -> 12.12}|
# |456|{a -> 13.25, c -> -19.6, d -> 15.6}|{8 -> 15.6, 1 -> 13.25, 5 -> -19.6}|
# +---+-----------------------------------+-----------------------------------+
P.S., I added a new key within the map_col's first row to show what happens if no mapping is available
transform_keys can use a lambda, as shown in the example, it's not just limited to an expr. However, the lambda or Python callable will need to utilize a function either defined in pyspark.sql.functions, a Column method, or a Scala UDF, so using a Python UDF which refers to the mapping dictionary object isn't currently possible with this mechanism. However, we can make use of the when function to apply the mapping, by unrolling the key-value pairs in the mapping into chained when conditions. See the below example to illustrate the idea:
from typing import Dict, Callable
from functools import reduce
from pyspark.sql.functions import Column, when, transform_keys
from pyspark.sql import SparkSession
def apply_mapping(mapping: Dict[str, str]) -> Callable[[Column, Column], Column]:
def convert_mapping_into_when_conditions(key: Column, _: Column) -> Column:
initial_key, initial_value = mapping.popitem()
initial_condition = when(key == initial_key, initial_value)
return reduce(lambda x, y: x.when(key == y[0], y[1]), mapping.items(), initial_condition)
return convert_mapping_into_when_conditions
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("Temp")\
.getOrCreate()
df = spark.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ("id", "data"))
mapping = {'foo': 'a', 'bar': 'b'}
df.select(transform_keys(
"data", apply_mapping(mapping)).alias("data_transformed")
).show(truncate=False)
The output of the above is:
+---------------------+
|data_transformed |
+---------------------+
|{b -> 2.0, a -> -2.0}|
+---------------------+
which demonstrates the defined mapping (foo -> a, bar -> b) was successfully applied to the column. The apply_mapping function should be generic enough to copy and utilize in your own pipeline.
Another way:
Use itertools to create an expression to inject into pysparks transform_keys function. Used upper just in case. Code below
from itertools import chain
m_expr1 = create_map([lit(x) for x in chain(*mapping.items())])
new =df.withColumn('new_map_col',transform_keys("map_col", lambda k, _: upper(m_expr1[k])))
new.show(truncate=False)
+---+-----------------------------------+-----------------------------------+
|id |map_col |new_map_col |
+---+-----------------------------------+-----------------------------------+
|123|{a -> 0.0, b -> -42.19} |{1 -> 0.0, 2 -> -42.19} |
|456|{a -> 13.25, c -> -19.6, d -> 15.6}|{1 -> 13.25, 5 -> -19.6, 8 -> 15.6}|
+---+-----------------------------------+-----------------------------------+
Related
I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?
Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)
One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+
I have two columns with type Map[String, Integer], I want to use withColumn to add a column to represent equality of the two maps.
I try df.withColumn("euqal", col("A") === col("B")), it throws exception EqualTo does not support ordering on type map<string,int>
We can do this using UDF:
import spark.implicits._
import org.apache.spark.sql.functions._
//UDF
val mapEqual = udf((col1: Map[StringType, IntegerType], col2: Map[StringType, IntegerType]) => {
col1.sameElements(col2)
})
//sample DF
var df = List((Map("1" -> 1), Map("1" -> 1)), (Map("2" -> 1), Map("2" -> 10)))
.toDF("col1", "col2")
df.withColumn("equal", mapEqual('col1, 'col2)).show(false)
+--------+---------+-----+
|col1 |col2 |equal|
+--------+---------+-----+
|[1 -> 1]|[1 -> 1] |true |
|[2 -> 1]|[2 -> 10]|false|
+--------+---------+-----+
I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps
I have spark DataFrame having 3 columns(id: Int, x_axis: Array[Int], y_axis: Array[Int]) with some sample data below:
want to get basic statistics of y_axis column for each row in dataframe. Output would be something like:
I have tried explode and then describe, but could not figure out expected output.
Any help or reference is much apprecieated
As you suggest, you could explode the Y column and then use a window over id to compute all the statistics you are interested in. Nevertheless, you want to re aggregate your data afterwards so you would generate a huge intermediate result for nothing.
Spark does not have a lot of predefined functions for arrays. Therefore the easiest way to achieve what you want is probably a UDF:
val extractFeatures = udf( (x : Seq[Int]) => {
val mean = x.sum.toDouble/x.size
val variance = x.map(i=> i*i).sum.toDouble/x.size - mean*mean
val std = scala.math.sqrt(variance)
Map("count" -> x.size.toDouble,
"mean" -> mean,
"std" -> std,
"min" -> x.min.toDouble,
"max" -> x.max.toDouble)
})
val df = sc
.parallelize(Seq((1,Seq(1,2,3,4,5)), (2,Seq(1,2,1,4))))
.toDF("id", "y")
.withColumn("described_y", extractFeatures('y))
.show(false)
+---+---------------+---------------------------------------------------------------------------------------------+
|id |y |described_y |
+---+---------------+---------------------------------------------------------------------------------------------+
|1 |[1, 2, 3, 4, 5]|Map(count -> 5.0, mean -> 3.0, min -> 1.0, std -> 1.4142135623730951, max -> 5.0, var -> 2.0)|
|2 |[1, 2, 1, 4] |Map(count -> 4.0, mean -> 2.0, min -> 1.0, std -> 1.224744871391589, max -> 4.0, var -> 1.5) |
+---+---------------+---------------------------------------------------------------------------------------------+
And btw, the stddev you calculated is actually the variance. You need to take the square root to get the standard deviation.
Although I am having an answer for what I want to achieve, the problem is that it's way to slow. The data set is not very large. It's 50GB in total but the affected part is probably just between 5 to 10GB of data. However, the following is what I require, but it's way to slow And by slow I mean it was running for an hour and it didn't terminate.
df_ = spark.createDataFrame([
('1', 'hello how are are you today'),
('1', 'hello how are you'),
('2', 'hello are you here'),
('2', 'how is it'),
('3', 'hello how are you'),
('3', 'hello how are you'),
('4', 'hello how is it you today')
], schema=['label', 'text'])
tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
tokens = tokenizer.transform(df_)
token_counts.groupby('label')\
.agg(F.collect_list(F.struct(F.col('token'), F.col('count'))).alias('text'))\
.show(truncate=False)
Which gives me the token count for each label:
+-----+----------------------------------------------------------------+
|label|text |
+-----+----------------------------------------------------------------+
|3 |[[are,2], [how,2], [hello,2], [you,2]] |
|1 |[[today,1], [how,2], [are,3], [you,2], [hello,2]] |
|4 |[[hello,1], [how,1], [is,1], [today,1], [you,1], [it,1]] |
|2 |[[hello,1], [are,1], [you,1], [here,1], [is,1], [how,1], [it,1]]|
+-----+----------------------------------------------------------------+
However, I think the call to explode() is way too expensive for this.
I don't know but it might be faster to count the tokens in each "dokument" and later merge it in a groupBy():
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
.toDF(schema=['label', 'text'])\
.show()
Gives the counts:
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[[are,2], [hello,...|
| 1|[[are,1], [hello,...|
| 2|[[are,1], [hello,...|
| 2|[[how,1], [it,1],...|
| 3|[[are,1], [hello,...|
| 3|[[are,1], [hello,...|
| 4|[[you,1], [today,...|
+-----+--------------------+
Is there a way to merge those token counts in a more efficient way?
If groups defined by id are largish the obvious target for improvement is shuffle size. Instead of shuffling text, shuffle labels. First vectorize input
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline
pipeline_model = Pipeline(stages=[
Tokenizer(inputCol='text', outputCol='tokens'),
CountVectorizer(inputCol='tokens', outputCol='vectors')
]).fit(df_)
df_vec = pipeline_model.transform(df_).select("label", "vectors")
Then aggregate:
from pyspark.ml.linalg import SparseVector, DenseVector
from collections import defaultdict
def seq_func(acc, v):
if isinstance(v, SparseVector):
for i in v.indices:
acc[int(i)] += v[int(i)]
if isinstance(v, DenseVector):
for i in len(v):
acc[int(i)] += v[int(i)]
return acc
def comb_func(acc1, acc2):
for k, v in acc2.items():
acc1[k] += v
return acc1
aggregated = rdd.aggregateByKey(defaultdict(int), seq_func, comb_func)
And map back to the required output:
vocabulary = pipeline_model.stages[-1].vocabulary
def f(x, vocabulary=vocabulary):
# For list of tuples use [(vocabulary[i], float(v)) for i, v in x.items()]
return {vocabulary[i]: float(v) for i, v in x.items()}
aggregated.mapValues(f).toDF(["id", "text"]).show(truncate=False)
# +---+-------------------------------------------------------------------------------------+
# |id |text |
# +---+-------------------------------------------------------------------------------------+
# |4 |[how -> 1.0, today -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0] |
# |3 |[how -> 2.0, hello -> 2.0, are -> 2.0, you -> 2.0] |
# |1 |[how -> 2.0, hello -> 2.0, are -> 3.0, you -> 2.0, today -> 1.0] |
# |2 |[here -> 1.0, how -> 1.0, are -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]|
# +---+-------------------------------------------------------------------------------------+
This worth trying only if text part is considerably large - otherwise all required transformations between DataFrame and Python objects might be more expensive than collecting_list.