Although I am having an answer for what I want to achieve, the problem is that it's way to slow. The data set is not very large. It's 50GB in total but the affected part is probably just between 5 to 10GB of data. However, the following is what I require, but it's way to slow And by slow I mean it was running for an hour and it didn't terminate.
df_ = spark.createDataFrame([
('1', 'hello how are are you today'),
('1', 'hello how are you'),
('2', 'hello are you here'),
('2', 'how is it'),
('3', 'hello how are you'),
('3', 'hello how are you'),
('4', 'hello how is it you today')
], schema=['label', 'text'])
tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
tokens = tokenizer.transform(df_)
token_counts.groupby('label')\
.agg(F.collect_list(F.struct(F.col('token'), F.col('count'))).alias('text'))\
.show(truncate=False)
Which gives me the token count for each label:
+-----+----------------------------------------------------------------+
|label|text |
+-----+----------------------------------------------------------------+
|3 |[[are,2], [how,2], [hello,2], [you,2]] |
|1 |[[today,1], [how,2], [are,3], [you,2], [hello,2]] |
|4 |[[hello,1], [how,1], [is,1], [today,1], [you,1], [it,1]] |
|2 |[[hello,1], [are,1], [you,1], [here,1], [is,1], [how,1], [it,1]]|
+-----+----------------------------------------------------------------+
However, I think the call to explode() is way too expensive for this.
I don't know but it might be faster to count the tokens in each "dokument" and later merge it in a groupBy():
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
.toDF(schema=['label', 'text'])\
.show()
Gives the counts:
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[[are,2], [hello,...|
| 1|[[are,1], [hello,...|
| 2|[[are,1], [hello,...|
| 2|[[how,1], [it,1],...|
| 3|[[are,1], [hello,...|
| 3|[[are,1], [hello,...|
| 4|[[you,1], [today,...|
+-----+--------------------+
Is there a way to merge those token counts in a more efficient way?
If groups defined by id are largish the obvious target for improvement is shuffle size. Instead of shuffling text, shuffle labels. First vectorize input
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline
pipeline_model = Pipeline(stages=[
Tokenizer(inputCol='text', outputCol='tokens'),
CountVectorizer(inputCol='tokens', outputCol='vectors')
]).fit(df_)
df_vec = pipeline_model.transform(df_).select("label", "vectors")
Then aggregate:
from pyspark.ml.linalg import SparseVector, DenseVector
from collections import defaultdict
def seq_func(acc, v):
if isinstance(v, SparseVector):
for i in v.indices:
acc[int(i)] += v[int(i)]
if isinstance(v, DenseVector):
for i in len(v):
acc[int(i)] += v[int(i)]
return acc
def comb_func(acc1, acc2):
for k, v in acc2.items():
acc1[k] += v
return acc1
aggregated = rdd.aggregateByKey(defaultdict(int), seq_func, comb_func)
And map back to the required output:
vocabulary = pipeline_model.stages[-1].vocabulary
def f(x, vocabulary=vocabulary):
# For list of tuples use [(vocabulary[i], float(v)) for i, v in x.items()]
return {vocabulary[i]: float(v) for i, v in x.items()}
aggregated.mapValues(f).toDF(["id", "text"]).show(truncate=False)
# +---+-------------------------------------------------------------------------------------+
# |id |text |
# +---+-------------------------------------------------------------------------------------+
# |4 |[how -> 1.0, today -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0] |
# |3 |[how -> 2.0, hello -> 2.0, are -> 2.0, you -> 2.0] |
# |1 |[how -> 2.0, hello -> 2.0, are -> 3.0, you -> 2.0, today -> 1.0] |
# |2 |[here -> 1.0, how -> 1.0, are -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]|
# +---+-------------------------------------------------------------------------------------+
This worth trying only if text part is considerably large - otherwise all required transformations between DataFrame and Python objects might be more expensive than collecting_list.
Related
I have a PySpark DataFrame with a map column as below:
root
|-- id: long (nullable = true)
|-- map_col: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
The map_col has keys which need to be converted based on a dictionary. For example, the dictionary might be:
mapping = {'a': '1', 'b': '2', 'c': '5', 'd': '8' }
So, the DataFrame needs to change from:
[Row(id=123, map_col={'a': 0.0, 'b': -42.19}),
Row(id=456, map_col={'a': 13.25, 'c': -19.6, 'd': 15.6})]
to the following:
[Row(id=123, map_col={'1': 0.0, '2': -42.19}),
Row(id=456, map_col={'1': 13.25, '5': -19.6, '8': 15.6})]
I see that transform_keys is an option if I could write-out the dictionary, but it's too large and dynamically-generated earlier in the workflow. I think an explode/pivot could also work, but seems non-performant?
Any ideas?
Edit: Added a bit to show that size of map in map_col is not uniform.
an approach using RDD transformation.
def updateKey(theDict, mapDict):
"""
update theDict's key using mapDict
"""
updDict = []
for item in theDict.items():
updDict.append((mapDict[item[0]] if item[0] in mapDict.keys() else item[0], item[1]))
return dict(updDict)
data_sdf.rdd. \
map(lambda r: (r[0], r[1], updateKey(r[1], mapping))). \
toDF(['id', 'map_col', 'new_map_col']). \
show(truncate=False)
# +---+-----------------------------------+-----------------------------------+
# |id |map_col |new_map_col |
# +---+-----------------------------------+-----------------------------------+
# |123|{a -> 0.0, b -> -42.19, e -> 12.12}|{1 -> 0.0, 2 -> -42.19, e -> 12.12}|
# |456|{a -> 13.25, c -> -19.6, d -> 15.6}|{8 -> 15.6, 1 -> 13.25, 5 -> -19.6}|
# +---+-----------------------------------+-----------------------------------+
P.S., I added a new key within the map_col's first row to show what happens if no mapping is available
transform_keys can use a lambda, as shown in the example, it's not just limited to an expr. However, the lambda or Python callable will need to utilize a function either defined in pyspark.sql.functions, a Column method, or a Scala UDF, so using a Python UDF which refers to the mapping dictionary object isn't currently possible with this mechanism. However, we can make use of the when function to apply the mapping, by unrolling the key-value pairs in the mapping into chained when conditions. See the below example to illustrate the idea:
from typing import Dict, Callable
from functools import reduce
from pyspark.sql.functions import Column, when, transform_keys
from pyspark.sql import SparkSession
def apply_mapping(mapping: Dict[str, str]) -> Callable[[Column, Column], Column]:
def convert_mapping_into_when_conditions(key: Column, _: Column) -> Column:
initial_key, initial_value = mapping.popitem()
initial_condition = when(key == initial_key, initial_value)
return reduce(lambda x, y: x.when(key == y[0], y[1]), mapping.items(), initial_condition)
return convert_mapping_into_when_conditions
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("Temp")\
.getOrCreate()
df = spark.createDataFrame([(1, {"foo": -2.0, "bar": 2.0})], ("id", "data"))
mapping = {'foo': 'a', 'bar': 'b'}
df.select(transform_keys(
"data", apply_mapping(mapping)).alias("data_transformed")
).show(truncate=False)
The output of the above is:
+---------------------+
|data_transformed |
+---------------------+
|{b -> 2.0, a -> -2.0}|
+---------------------+
which demonstrates the defined mapping (foo -> a, bar -> b) was successfully applied to the column. The apply_mapping function should be generic enough to copy and utilize in your own pipeline.
Another way:
Use itertools to create an expression to inject into pysparks transform_keys function. Used upper just in case. Code below
from itertools import chain
m_expr1 = create_map([lit(x) for x in chain(*mapping.items())])
new =df.withColumn('new_map_col',transform_keys("map_col", lambda k, _: upper(m_expr1[k])))
new.show(truncate=False)
+---+-----------------------------------+-----------------------------------+
|id |map_col |new_map_col |
+---+-----------------------------------+-----------------------------------+
|123|{a -> 0.0, b -> -42.19} |{1 -> 0.0, 2 -> -42.19} |
|456|{a -> 13.25, c -> -19.6, d -> 15.6}|{1 -> 13.25, 5 -> -19.6, 8 -> 15.6}|
+---+-----------------------------------+-----------------------------------+
I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like this
entity | Probability | value
A | 0.8 | 10
A | 0.6 | 15
A | 0.3 | 20
B | 0.8 | 10
Then, for entity A, I'm expecting the avg value to be 0.8*10 + (1-0.8)*0.6*15 + (1-0.8)*(1-0.6)*0.3*20 + (1-0.8)*(1-0.6)*(1-0.3)*MAX_VALUE_DEFINED.
How could I achieve this in Spark using DataFrame agg func? I found it's challenging given the complexity to groupBy entity and compute the sequence of results.
You can use UDF to perform such custom calculations. The idea is using collect_list to group all probab and values of A into one place so you can loop through it. However, collect_list does not respect the order of your records, therefore might lead to the wrong calculation. One way to fix it is generating ID for each row using monotonically_increasing_id
import pyspark.sql.functions as F
#F.pandas_udf('double')
def markov_udf(values):
def markov(lst):
# you can implement your markov logic here
s = 0
for i, prob, val in lst:
s += prob
return s
return values.apply(markov)
(df
.withColumn('id', F.monotonically_increasing_id())
.groupBy('entity')
.agg(F.array_sort(F.collect_list(F.array('id', 'probability', 'value'))).alias('values'))
.withColumn('markov', markov_udf('values'))
.show(10, False)
)
+------+------------------------------------------------------+------+
|entity|values |markov|
+------+------------------------------------------------------+------+
|B |[[3.0, 0.8, 10.0]] |0.8 |
|A |[[0.0, 0.8, 10.0], [1.0, 0.6, 15.0], [2.0, 0.3, 20.0]]|1.7 |
+------+------------------------------------------------------+------+
There may be a better solution, but I think this does what you needed.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 0.8, 10),
('A', 0.6, 15),
('A', 0.3, 20),
('B', 0.8, 10)],
['entity', 'Probability', 'value']
)
w_desc = W.partitionBy('entity').orderBy(F.desc('value'))
w_asc = W.partitionBy('entity').orderBy('value')
df = df.withColumn('_ent_max_val', F.max('value').over(w_desc))
df = df.withColumn('_prob2', 1 - F.col('Probability'))
df = df.withColumn('_cum_prob2', F.product('_prob2').over(w_asc) / F.col('_prob2'))
df = (df.groupBy('entity')
.agg(F.round((F.max('_ent_max_val') * F.product('_prob2')
+ F.sum(F.col('_cum_prob2') * F.col('Probability') * F.col('value'))
),2).alias('mean_value'))
)
df.show()
# +------+----------+
# |entity|mean_value|
# +------+----------+
# | A| 11.4|
# | B| 10.0|
# +------+----------+
I want to drop rows in a PySpark DataFrame where a certain column contains an empty map. How do I do this? I can't seem to declare a typed empty MapType against which to compare my column. I have seen that in Scala, you can use typedLit, but there seems to be no such equivalent in PySpark. I have also tried using lit(...) and casting to a struct<string,int> but I have found no acceptable argument for lit() (tried using None which returns null and {} which is an error).
I'm sure this is trivial but I haven't seen any docs on this!
Here is a solution using pyspark size build-in function:
from pyspark.sql.functions import col, size
df = spark.createDataFrame(
[(1, {1:'A'} ),
(2, {2:'B'} ),
(3, {3:'C'} ),
(4, {}),
(5, None)]
).toDF("id", "map")
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- map: map (nullable = true)
# | |-- key: long
# | |-- value: string (valueContainsNull = true)
df.withColumn("is_empty", size(col("map")) <= 0).show()
# +---+--------+--------+
# | id| map|is_empty|
# +---+--------+--------+
# | 1|[1 -> A]| false|
# | 2|[2 -> B]| false|
# | 3|[3 -> C]| false|
# | 4| []| true|
# | 5| null| true|
# +---+--------+--------+
Note that the condition is size <= 0 since in the case of null the function returns -1 (if the spark.sql.legacy.sizeOfNull setting is true otherwise it will return null). Here you can find more details.
Generic solution: comparing Map column and literal Map
For a more generic solution we can use the build-in function size in combination with a UDF which append the string key + value of each item into a sorted list (thank you #jxc for pointing out the problem with the previous version). The hypothesis here will be that two maps are equal when:
they have the same size
the string representation of key + value is identical between the items of the maps
The literal map is created from an arbitrary python dictionary combining keys and values via map_from_arrays:
from pyspark.sql.functions import udf, lit, size, when, map_from_arrays, array
df = spark.createDataFrame([
[1, {}],
[2, {1:'A', 2:'B', 3:'C'}],
[3, {1:'A', 2:'B'}]
]).toDF("key", "map")
dict = { 1:'A' , 2:'B' }
map_keys_ = array([lit(k) for k in dict.keys()])
map_values_ = array([lit(v) for v in dict.values()])
tmp_map = map_from_arrays(map_keys_, map_values_)
to_strlist_udf = udf(lambda d: sorted([str(k) + str(d[k]) for k in d.keys()]))
def map_equals(m1, m2):
return when(
(size(m1) == size(m2)) &
(to_strlist_udf(m1) == to_strlist_udf(m2)), True
).otherwise(False)
df = df.withColumn("equals", map_equals(df["map"], tmp_map))
df.show(10, False)
# +---+------------------------+------+
# |key|map |equals|
# +---+------------------------+------+
# |1 |[] |false |
# |2 |[1 -> A, 2 -> B, 3 -> C]|false |
# |3 |[1 -> A, 2 -> B] |true |
# +---+------------------------+------+
Note: As you can see the pyspark == operator works pretty well for array comparison as well.
Here is a use case I am looking for with pyspark. I currently have a dataframe with word tokens and want to build a vocabulary followed by replace word with the index in the vocabulary. Here is my dataframe
>>> wordDataFrame.show(10, False)
+---+-------------------------------------------------+
|id |words |
+---+-------------------------------------------------+
|0 |[hi, i, heard, about, spark] |
|1 |[i, wish, java, could, use, case, spark, classes]|
+---+-------------------------------------------------+
When I use the CountVectorizer
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(binary=True)\
.setInputCol("words")\
.setOutputCol("countVec")\
.setToLowercase(True)
.setMinTF(1)\
.setMinDF(1)
fittedCV = cv.fit(wordDataFrame)
fittedCV.transform(wordDataFrame).show(2, False)
+---+-------------------------------------------------+---------------------------------------------------------+
|id |words |features |
+---+-------------------------------------------------+---------------------------------------------------------+
|0 |[hi, i, heard, about, spark] |(11,[0,1,6,8,9],[1.0,1.0,1.0,1.0,1.0]) |
|1 |[i, wish, java, could, use, case, spark, classes]|(11,[0,1,2,3,4,5,7,10],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+---+-------------------------------------------------+---------------------------------------------------------+
Next here is how my vocabulary looks like
>>> from pprint import pprint
>>> pprint(dict([(i, x) for i,x in enumerate(fittedCV.vocabulary)]))
{0: 'i',
1: 'spark',
2: 'wish',
3: 'use',
4: 'case',
5: 'java',
6: 'hi',
7: 'could',
8: 'about',
9: 'heard',
10: 'classes'}
What I am looking for is this
[hi, i , heard, about, spark] -> [6, 0, 9, 8, 1] instead of [0,1,6,8,9]
Basically maintaining the order of the token. I tried looking in the documents but looks like all of the vectorizer lose position. For my case I need to maintain position as this feature will go into a LSTM layer further downstream
I recently had a use case similar to yours. I ended up using StringIndexer:
l = [
(0, ["hi", "i", "heard", "about", "spark"]),
(1, ["i", "wish", "java", "could", "use", "case", "spark", "classes"])
]
wordDataFrame = spark.createDataFrame(l, ['id', 'words'])
wordDataFrame.show()
.
+---+--------------------+
| id| words|
+---+--------------------+
| 0|[hi, i, heard, ab...|
| 1|[i, wish, java, c...|
+---+--------------------+
.
from pyspark.ml.feature import StringIndexer
# define indexer
indexer = StringIndexer(inputCol="word_strings", outputCol="word_index")
# use explode to map col<array<string>> => col<string>
# fit indexer on col<string>
indexer = indexer.fit(
wordDataFrame
.select(F.explode(F.col("words")).alias("word_strings"))
)
print(indexer.labels)
.
['i', 'spark', 'heard', 'classes', 'java', 'could', 'use', 'hi', 'case', 'about', 'wish']
.
indexedWordDataFrame = (
indexer
.transform(
# use explode to map col<array<string>> => col<string>
# use indexer to transform col<string> => col<double>
wordDataFrame
.withColumn("word_strings", F.explode(F.col("words")))
)
# use groupby + collect_list to map col<double> => col<array<double>>
.groupby("id", "words")
.agg(F.collect_list("word_index").alias("word_index_array"))
)
indexedWordDataFrame.orderBy("id").show()
.
+---+--------------------+--------------------+
| id| words| word_index_array|
+---+--------------------+--------------------+
| 0|[hi, i, heard, ab...|[7.0, 0.0, 2.0, 9...|
| 1|[i, wish, java, c...|[0.0, 10.0, 4.0, ...|
+---+--------------------+--------------------+
HTH
I have spark DataFrame having 3 columns(id: Int, x_axis: Array[Int], y_axis: Array[Int]) with some sample data below:
want to get basic statistics of y_axis column for each row in dataframe. Output would be something like:
I have tried explode and then describe, but could not figure out expected output.
Any help or reference is much apprecieated
As you suggest, you could explode the Y column and then use a window over id to compute all the statistics you are interested in. Nevertheless, you want to re aggregate your data afterwards so you would generate a huge intermediate result for nothing.
Spark does not have a lot of predefined functions for arrays. Therefore the easiest way to achieve what you want is probably a UDF:
val extractFeatures = udf( (x : Seq[Int]) => {
val mean = x.sum.toDouble/x.size
val variance = x.map(i=> i*i).sum.toDouble/x.size - mean*mean
val std = scala.math.sqrt(variance)
Map("count" -> x.size.toDouble,
"mean" -> mean,
"std" -> std,
"min" -> x.min.toDouble,
"max" -> x.max.toDouble)
})
val df = sc
.parallelize(Seq((1,Seq(1,2,3,4,5)), (2,Seq(1,2,1,4))))
.toDF("id", "y")
.withColumn("described_y", extractFeatures('y))
.show(false)
+---+---------------+---------------------------------------------------------------------------------------------+
|id |y |described_y |
+---+---------------+---------------------------------------------------------------------------------------------+
|1 |[1, 2, 3, 4, 5]|Map(count -> 5.0, mean -> 3.0, min -> 1.0, std -> 1.4142135623730951, max -> 5.0, var -> 2.0)|
|2 |[1, 2, 1, 4] |Map(count -> 4.0, mean -> 2.0, min -> 1.0, std -> 1.224744871391589, max -> 4.0, var -> 1.5) |
+---+---------------+---------------------------------------------------------------------------------------------+
And btw, the stddev you calculated is actually the variance. You need to take the square root to get the standard deviation.