spark DataFrame using explode and describe functions - apache-spark

I have spark DataFrame having 3 columns(id: Int, x_axis: Array[Int], y_axis: Array[Int]) with some sample data below:
want to get basic statistics of y_axis column for each row in dataframe. Output would be something like:
I have tried explode and then describe, but could not figure out expected output.
Any help or reference is much apprecieated

As you suggest, you could explode the Y column and then use a window over id to compute all the statistics you are interested in. Nevertheless, you want to re aggregate your data afterwards so you would generate a huge intermediate result for nothing.
Spark does not have a lot of predefined functions for arrays. Therefore the easiest way to achieve what you want is probably a UDF:
val extractFeatures = udf( (x : Seq[Int]) => {
val mean = x.sum.toDouble/x.size
val variance = x.map(i=> i*i).sum.toDouble/x.size - mean*mean
val std = scala.math.sqrt(variance)
Map("count" -> x.size.toDouble,
"mean" -> mean,
"std" -> std,
"min" -> x.min.toDouble,
"max" -> x.max.toDouble)
})
val df = sc
.parallelize(Seq((1,Seq(1,2,3,4,5)), (2,Seq(1,2,1,4))))
.toDF("id", "y")
.withColumn("described_y", extractFeatures('y))
.show(false)
+---+---------------+---------------------------------------------------------------------------------------------+
|id |y |described_y |
+---+---------------+---------------------------------------------------------------------------------------------+
|1 |[1, 2, 3, 4, 5]|Map(count -> 5.0, mean -> 3.0, min -> 1.0, std -> 1.4142135623730951, max -> 5.0, var -> 2.0)|
|2 |[1, 2, 1, 4] |Map(count -> 4.0, mean -> 2.0, min -> 1.0, std -> 1.224744871391589, max -> 4.0, var -> 1.5) |
+---+---------------+---------------------------------------------------------------------------------------------+
And btw, the stddev you calculated is actually the variance. You need to take the square root to get the standard deviation.

Related

PySpark: Convert Map Column Keys Using Dictionary

I have a PySpark DataFrame with a map column as below:
root
|-- id: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
The map_col has keys which need to be converted based on a dictionary. For example, the dictionary might be:
mapping = {'a': '1', 'b': '2', 'c': '5', 'd': '8' }
So, the DataFrame needs to change from:
[Row(id=123, map_col={'a': 0.0, 'b': -42.19}),
Row(id=456, map_col={'a': 13.25, 'c': -19.6, 'd': 15.6})]
to the following:
[Row(id=123, map_col={'1': 0.0, '2': -42.19}),
Row(id=456, map_col={'1': 13.25, '5': -19.6, '8': 15.6})]
I see that transform_keys is an option if I could write-out the dictionary, but it's too large and dynamically-generated earlier in the workflow. I think an explode/pivot could also work, but seems non-performant?
Any ideas?
Edit: Added a bit to show that size of map in map_col is not uniform.
an approach using RDD transformation.
def updateKey(theDict, mapDict):
"""
update theDict's key using mapDict
"""
updDict = []
for item in theDict.items():
updDict.append((mapDict[item[0]] if item[0] in mapDict.keys() else item[0], item[1]))
return dict(updDict)
data_sdf.rdd. \
map(lambda r: (r[0], r[1], updateKey(r[1], mapping))). \
toDF(['id', 'map_col', 'new_map_col']). \
show(truncate=False)
# +---+-----------------------------------+-----------------------------------+
# |id |map_col |new_map_col |
# +---+-----------------------------------+-----------------------------------+
# |123|{a -> 0.0, b -> -42.19, e -> 12.12}|{1 -> 0.0, 2 -> -42.19, e -> 12.12}|
# |456|{a -> 13.25, c -> -19.6, d -> 15.6}|{8 -> 15.6, 1 -> 13.25, 5 -> -19.6}|
# +---+-----------------------------------+-----------------------------------+
P.S., I added a new key within the map_col's first row to show what happens if no mapping is available
transform_keys can use a lambda, as shown in the example, it's not just limited to an expr. However, the lambda or Python callable will need to utilize a function either defined in pyspark.sql.functions, a Column method, or a Scala UDF, so using a Python UDF which refers to the mapping dictionary object isn't currently possible with this mechanism. However, we can make use of the when function to apply the mapping, by unrolling the key-value pairs in the mapping into chained when conditions. See the below example to illustrate the idea:
from typing import Dict, Callable
from functools import reduce
from pyspark.sql.functions import Column, when, transform_keys
from pyspark.sql import SparkSession
def apply_mapping(mapping: Dict[str, str]) -> Callable[[Column, Column], Column]:
def convert_mapping_into_when_conditions(key: Column, _: Column) -> Column:
initial_key, initial_value = mapping.popitem()
initial_condition = when(key == initial_key, initial_value)
return reduce(lambda x, y: x.when(key == y[0], y[1]), mapping.items(), initial_condition)
return convert_mapping_into_when_conditions
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("Temp")\
.getOrCreate()
df = spark.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ("id", "data"))
mapping = {'foo': 'a', 'bar': 'b'}
df.select(transform_keys(
"data", apply_mapping(mapping)).alias("data_transformed")
).show(truncate=False)
The output of the above is:
+---------------------+
|data_transformed |
+---------------------+
|{b -> 2.0, a -> -2.0}|
+---------------------+
which demonstrates the defined mapping (foo -> a, bar -> b) was successfully applied to the column. The apply_mapping function should be generic enough to copy and utilize in your own pipeline.
Another way:
Use itertools to create an expression to inject into pysparks transform_keys function. Used upper just in case. Code below
from itertools import chain
m_expr1 = create_map([lit(x) for x in chain(*mapping.items())])
new =df.withColumn('new_map_col',transform_keys("map_col", lambda k, _: upper(m_expr1[k])))
new.show(truncate=False)
+---+-----------------------------------+-----------------------------------+
|id |map_col |new_map_col |
+---+-----------------------------------+-----------------------------------+
|123|{a -> 0.0, b -> -42.19} |{1 -> 0.0, 2 -> -42.19} |
|456|{a -> 13.25, c -> -19.6, d -> 15.6}|{1 -> 13.25, 5 -> -19.6, 8 -> 15.6}|
+---+-----------------------------------+-----------------------------------+

Calculate a sequence of Markov chain values

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like this
entity | Probability | value
A | 0.8 | 10
A | 0.6 | 15
A | 0.3 | 20
B | 0.8 | 10
Then, for entity A, I'm expecting the avg value to be 0.8*10 + (1-0.8)*0.6*15 + (1-0.8)*(1-0.6)*0.3*20 + (1-0.8)*(1-0.6)*(1-0.3)*MAX_VALUE_DEFINED.
How could I achieve this in Spark using DataFrame agg func? I found it's challenging given the complexity to groupBy entity and compute the sequence of results.
You can use UDF to perform such custom calculations. The idea is using collect_list to group all probab and values of A into one place so you can loop through it. However, collect_list does not respect the order of your records, therefore might lead to the wrong calculation. One way to fix it is generating ID for each row using monotonically_increasing_id
import pyspark.sql.functions as F
#F.pandas_udf('double')
def markov_udf(values):
def markov(lst):
# you can implement your markov logic here
s = 0
for i, prob, val in lst:
s += prob
return s
return values.apply(markov)
(df
.withColumn('id', F.monotonically_increasing_id())
.groupBy('entity')
.agg(F.array_sort(F.collect_list(F.array('id', 'probability', 'value'))).alias('values'))
.withColumn('markov', markov_udf('values'))
.show(10, False)
)
+------+------------------------------------------------------+------+
|entity|values |markov|
+------+------------------------------------------------------+------+
|B |[[3.0, 0.8, 10.0]] |0.8 |
|A |[[0.0, 0.8, 10.0], [1.0, 0.6, 15.0], [2.0, 0.3, 20.0]]|1.7 |
+------+------------------------------------------------------+------+
There may be a better solution, but I think this does what you needed.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 0.8, 10),
('A', 0.6, 15),
('A', 0.3, 20),
('B', 0.8, 10)],
['entity', 'Probability', 'value']
)
w_desc = W.partitionBy('entity').orderBy(F.desc('value'))
w_asc = W.partitionBy('entity').orderBy('value')
df = df.withColumn('_ent_max_val', F.max('value').over(w_desc))
df = df.withColumn('_prob2', 1 - F.col('Probability'))
df = df.withColumn('_cum_prob2', F.product('_prob2').over(w_asc) / F.col('_prob2'))
df = (df.groupBy('entity')
.agg(F.round((F.max('_ent_max_val') * F.product('_prob2')
+ F.sum(F.col('_cum_prob2') * F.col('Probability') * F.col('value'))
),2).alias('mean_value'))
)
df.show()
# +------+----------+
# |entity|mean_value|
# +------+----------+
# | A| 11.4|
# | B| 10.0|
# +------+----------+

Pyspark: How to filter a Dataframe on a MapType column? (as in the style of isin() )

When i want to filter a Dataframe on a MapType column in the style of a isin(), what would be the best strategy?
So basically I want to get all rows of a dataframe where the contents of a MapType column match one of the entries in a list of MapType-"instances". Could also be a join on that column, but all the methods I tried so far fail because EqualTo does not support ordering on type map.
Apart from the straight forward methods of using isin() or join() I also came up with the idea to dump the map to json using to_json() and then filter on the Json strings, but this seems to randomly order the keys so that this string comparison isn't reliable either?
Is there something easy that I'm missing? How would you recommend to tackle this?
Example df:
+----+---------------------------------------------------------+
|key |metric |
+----+---------------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6) |
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2) |
|as3d|Map(metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0) |
+----+---------------------------------------------------------+
filter ( pseudocode ):
df.where(metric.isin([
Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6),
Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2)
])
Desired output:
----+---------------------------------------------------------+
|key |metric |
+----+---------------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6) |
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2) |
+----+---------------------------------------------------------+
Comparing 2 map columns in spark is not that obvious. For each key in the first map you need to check if you have the same value in the second one. Same for keys.
It might be more simple to use UDF, as in Python you can check dict equality :
from pyspark.sql import functions as F
map_equals = F.udf(lambda x, y: x == y, BooleanType())
# create map1 literal to filter with
map1 = F.create_map(*[
F.lit(x) for x in chain(*{"metric1": 1.3, "metric2": 6.3, "metric3": 7.6}.items())
])
df1 = df.filter(map_equals("metric", map1))
df1.show(truncate=False)
#+----+------------------------------------------------+
#|key |metric |
#+----+------------------------------------------------+
#|123k|[metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6]|
#+----+------------------------------------------------+
Another way is to add the map literals you want to filter with as columns and check if for every key in metric you get the same value from that literal map.
Here's an example using transfrom on map keys array with array_min to create filter expression. (if array_min returns true this means all the values are equal):
filter_map_literal = F.create_map(*[
F.lit(x) for x in chain(*{"metric1": 1.3, "metric2": 6.3, "metric3": 7.6}.items())
])
df1 = df.withColumn("filter_map", filter_map_literal).filter(
F.array_min(F.expr("""transform(map_keys(metric),
x -> if(filter_map[x] = metric[x], true, false)
)""")
)
).drop("filter_map")
Not the most elegant way to compare equality of maps: you can collect the map keys, compare the value of each key in both maps, and make sure that all the values are the same. I guess it's better to construct a filter df, and do a semi join, rather than passing them using isin:
Sample df and filter df:
df.show(truncate=False)
+----+------------------------------------------------+
|key |metric |
+----+------------------------------------------------+
|123k|[metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6]|
|d23d|[metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2]|
|as3d|[metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0]|
+----+------------------------------------------------+
filter_df = df.select('metric').limit(2)
filter_df.show(truncate=False)
+------------------------------------------------+
|metric |
+------------------------------------------------+
|[metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6]|
|[metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2]|
+------------------------------------------------+
Filtering method:
import pyspark.sql.functions as F
result = df.alias('df').join(
filter_df.alias('filter_df'),
F.expr("""
aggregate(
transform(
concat(map_keys(df.metric), map_keys(filter_df.metric)),
x -> filter_df.metric[x] = df.metric[x]
),
true,
(acc, x) -> acc and x
)"""),
'left_semi'
)
result.show(truncate=False)
+----+------------------------------------------------+
|key |metric |
+----+------------------------------------------------+
|123k|[metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6]|
|d23d|[metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2]|
+----+------------------------------------------------+

Counting words after grouping records (Part 2)

Although I am having an answer for what I want to achieve, the problem is that it's way to slow. The data set is not very large. It's 50GB in total but the affected part is probably just between 5 to 10GB of data. However, the following is what I require, but it's way to slow And by slow I mean it was running for an hour and it didn't terminate.
df_ = spark.createDataFrame([
('1', 'hello how are are you today'),
('1', 'hello how are you'),
('2', 'hello are you here'),
('2', 'how is it'),
('3', 'hello how are you'),
('3', 'hello how are you'),
('4', 'hello how is it you today')
], schema=['label', 'text'])
tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
tokens = tokenizer.transform(df_)
token_counts.groupby('label')\
.agg(F.collect_list(F.struct(F.col('token'), F.col('count'))).alias('text'))\
.show(truncate=False)
Which gives me the token count for each label:
+-----+----------------------------------------------------------------+
|label|text |
+-----+----------------------------------------------------------------+
|3 |[[are,2], [how,2], [hello,2], [you,2]] |
|1 |[[today,1], [how,2], [are,3], [you,2], [hello,2]] |
|4 |[[hello,1], [how,1], [is,1], [today,1], [you,1], [it,1]] |
|2 |[[hello,1], [are,1], [you,1], [here,1], [is,1], [how,1], [it,1]]|
+-----+----------------------------------------------------------------+
However, I think the call to explode() is way too expensive for this.
I don't know but it might be faster to count the tokens in each "dokument" and later merge it in a groupBy():
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
.toDF(schema=['label', 'text'])\
.show()
Gives the counts:
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[[are,2], [hello,...|
| 1|[[are,1], [hello,...|
| 2|[[are,1], [hello,...|
| 2|[[how,1], [it,1],...|
| 3|[[are,1], [hello,...|
| 3|[[are,1], [hello,...|
| 4|[[you,1], [today,...|
+-----+--------------------+
Is there a way to merge those token counts in a more efficient way?
If groups defined by id are largish the obvious target for improvement is shuffle size. Instead of shuffling text, shuffle labels. First vectorize input
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline
pipeline_model = Pipeline(stages=[
Tokenizer(inputCol='text', outputCol='tokens'),
CountVectorizer(inputCol='tokens', outputCol='vectors')
]).fit(df_)
df_vec = pipeline_model.transform(df_).select("label", "vectors")
Then aggregate:
from pyspark.ml.linalg import SparseVector, DenseVector
from collections import defaultdict
def seq_func(acc, v):
if isinstance(v, SparseVector):
for i in v.indices:
acc[int(i)] += v[int(i)]
if isinstance(v, DenseVector):
for i in len(v):
acc[int(i)] += v[int(i)]
return acc
def comb_func(acc1, acc2):
for k, v in acc2.items():
acc1[k] += v
return acc1
aggregated = rdd.aggregateByKey(defaultdict(int), seq_func, comb_func)
And map back to the required output:
vocabulary = pipeline_model.stages[-1].vocabulary
def f(x, vocabulary=vocabulary):
# For list of tuples use [(vocabulary[i], float(v)) for i, v in x.items()]
return {vocabulary[i]: float(v) for i, v in x.items()}
aggregated.mapValues(f).toDF(["id", "text"]).show(truncate=False)
# +---+-------------------------------------------------------------------------------------+
# |id |text |
# +---+-------------------------------------------------------------------------------------+
# |4 |[how -> 1.0, today -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0] |
# |3 |[how -> 2.0, hello -> 2.0, are -> 2.0, you -> 2.0] |
# |1 |[how -> 2.0, hello -> 2.0, are -> 3.0, you -> 2.0, today -> 1.0] |
# |2 |[here -> 1.0, how -> 1.0, are -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]|
# +---+-------------------------------------------------------------------------------------+
This worth trying only if text part is considerably large - otherwise all required transformations between DataFrame and Python objects might be more expensive than collecting_list.

Spark Multiple Joins

Using spark context, I would like to perform multiple joins between
rdd's, where the number of rdd's to be joined should be dynamic.
I would like the result to be unfolded, for example:
val rdd1 = sc.parallelize(List((1,1.0),(11,11.0), (111,111.0)))
val rdd2 = sc.parallelize(List((1,2.0),(11,12.0), (111,112.0)))
val rdd3 = sc.parallelize(List((1,3.0),(11,13.0), (111,113.0)))
val rdd4 = sc.parallelize(List((1,4.0),(11,14.0), (111,114.0)))
val rdd11 = rdd1.join(rdd2).join(rdd3).join(rdd4)
.foreach(println)
generates the following output:
(11,(((11.0,12.0),13.0),14.0))
(111,(((111.0,112.0),113.0),114.0))
(1,(((1.0,2.0),3.0),4.0))
I would like to:
Unfold the values, e.g first line should read:
(11, 11.0, 12.0, 13.0, 14.0).
Do it dynamically so that it can work on any dynamic number
of rdd's to be joined.
Any ideas would be appreciated,
Eli.
Instead of using join, I would use union followed by groupByKey to achieve what you desire.
Here is what I would do -
val emptyRdd = sc.emptyRDD[(Int, Double)]
val listRdds = List(rdd1, rdd2, rdd3, rdd4) // satisfy your dynamic number of rdds
val unioned = listRdds.fold(emptyRdd)(_.union(_))
val grouped = unioned.groupByKey
grouped.collect().foreach(println(_))
This will yields the result:
(1,CompactBuffer(1.0, 2.0, 3.0, 4.0))
(11,CompactBuffer(11.0, 12.0, 13.0, 14.0))
(111,CompactBuffer(111.0, 112.0, 113.0, 114.0))
Updated:
If you would still like to use join, this is how to do it with somewhat complicated foldLeft functions -
val joined = rddList match {
case head::tail => tail.foldLeft(head.mapValues(Array(_)))(_.join(_).mapValues {
case (arr: Array[Double], d: Double) => arr :+ d
})
case Nil => sc.emptyRDD[(Int, Array[Double])]
}
And joined.collect will yield
res14: Array[(Int, Array[Double])] = Array((1,Array(1.0, 2.0, 3.0, 4.0)), (11,Array(11.0, 12.0, 13.0, 14.0)), (111,Array(111.0, 112.0, 113.0, 114.0)))
Others with this problem may find groupWith helpful. From the docs:
>>> w = sc.parallelize([("a", 5), ("b", 6)])
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> z = sc.parallelize([("b", 42)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(w.groupWith(x, y, z).collect()))]
[('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

Resources