Calculate a sequence of Markov chain values - apache-spark

I have a Spark question, so for the input for each entity k I have a sequence of probability p_i with a value associated v_i, for example the data can look like this
entity | Probability | value
A | 0.8 | 10
A | 0.6 | 15
A | 0.3 | 20
B | 0.8 | 10
Then, for entity A, I'm expecting the avg value to be 0.8*10 + (1-0.8)*0.6*15 + (1-0.8)*(1-0.6)*0.3*20 + (1-0.8)*(1-0.6)*(1-0.3)*MAX_VALUE_DEFINED.
How could I achieve this in Spark using DataFrame agg func? I found it's challenging given the complexity to groupBy entity and compute the sequence of results.

You can use UDF to perform such custom calculations. The idea is using collect_list to group all probab and values of A into one place so you can loop through it. However, collect_list does not respect the order of your records, therefore might lead to the wrong calculation. One way to fix it is generating ID for each row using monotonically_increasing_id
import pyspark.sql.functions as F
#F.pandas_udf('double')
def markov_udf(values):
def markov(lst):
# you can implement your markov logic here
s = 0
for i, prob, val in lst:
s += prob
return s
return values.apply(markov)
(df
.withColumn('id', F.monotonically_increasing_id())
.groupBy('entity')
.agg(F.array_sort(F.collect_list(F.array('id', 'probability', 'value'))).alias('values'))
.withColumn('markov', markov_udf('values'))
.show(10, False)
)
+------+------------------------------------------------------+------+
|entity|values |markov|
+------+------------------------------------------------------+------+
|B |[[3.0, 0.8, 10.0]] |0.8 |
|A |[[0.0, 0.8, 10.0], [1.0, 0.6, 15.0], [2.0, 0.3, 20.0]]|1.7 |
+------+------------------------------------------------------+------+

There may be a better solution, but I think this does what you needed.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 0.8, 10),
('A', 0.6, 15),
('A', 0.3, 20),
('B', 0.8, 10)],
['entity', 'Probability', 'value']
)
w_desc = W.partitionBy('entity').orderBy(F.desc('value'))
w_asc = W.partitionBy('entity').orderBy('value')
df = df.withColumn('_ent_max_val', F.max('value').over(w_desc))
df = df.withColumn('_prob2', 1 - F.col('Probability'))
df = df.withColumn('_cum_prob2', F.product('_prob2').over(w_asc) / F.col('_prob2'))
df = (df.groupBy('entity')
.agg(F.round((F.max('_ent_max_val') * F.product('_prob2')
+ F.sum(F.col('_cum_prob2') * F.col('Probability') * F.col('value'))
),2).alias('mean_value'))
)
df.show()
# +------+----------+
# |entity|mean_value|
# +------+----------+
# | A| 11.4|
# | B| 10.0|
# +------+----------+

Related

Cosine Similarity rows in a dataframe of pandas

I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.
I have load it into a dataframe of pandas as follows:
old_df['Vector']=old_df.apply(lambda row:
np.array(np.matrix(row.Vector)).ravel(), axis = 1)
l=[]
for a in old_df['Vector']:
l.append(a)
A=np.array(l)
similarities = cosine_similarity(A)
The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.
Could you pls help me to solve this issue.
Thank you.
|Index | GUID | Vector |
|-------|-------|---------------------------------------|
|36099 | b770 |[-0.04870541 -0.02133574 0.03180726] |
|36098 | 808f |[ 0.0732905 -0.05331331 0.06378368] |
|36097 | b111 |[ 0.01994788 0.00417582 -0.09615131] |
|36096 | b6b5 |[0.025697 -0.08277534 -0.0124591] |
|36083 | 9b07 |[ 0.025697 -0.08277534 -0.0124591] |
|36082 | b9ed |[-0.00952298 0.06188576 -0.02636449] |
|36081 | a5b6 |[0.00432161 0.02264584 -0.0341924] |
|36080 | 9891 |[ 0.08732156 0.00649456 -0.02014138] |
|36079 | ba40 |[0.05407356 -0.09085554 -0.07671648] |
|36078 | 9dff |[-0.09859556 0.04498474 -0.01839088] |
|36077 | a423 |[-0.06124249 0.06774347 -0.05234318] |
|36076 | 81c4 |[0.07278682 -0.10460124 -0.06572364] |
|36075 | 9f88 |[0.09830415 0.05489364 -0.03916228] |
|36074 | adb8 |[0.03149953 -0.00486591 0.01380711] |
|36073 | 9765 |[0.00673934 0.0513557 -0.09584251] |
|36072 | aff4 |[-0.00097896 0.0022945 0.01643319] |
Example code to get top k cosine similarities and they corresponding GUID and row ID:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))
vectors = []
for v in df['Vector']:
vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix
similarities = cosine_similarity(A)
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
k = 2
if k > vectors_num:
K = vectors_num
# Get top k similarities and pair GUID in ascending order
top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
top_k_similarities = similarities[top_k_indexes]
top_k_pair_GUID = []
for indexes in top_k_indexes:
pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
top_k_pair_GUID.append(pair_GUID)
print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
Outputs:
Data:
GUID Vector
0 b770 [-0.1, -0.2, 0.3]
1 808f [0.1, -0.2, -0.3]
2 b111 [-0.1, 0.2, -0.3]
Similarities:
[[-2. -0.42857143 -0.85714286]
[-2. -2. 0.28571429]
[-2. -2. -2. ]]
top_k_indexes:
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
top_k_pair_GUID:
[('b770', '808f'), ('808f', 'b111')]
top_k_similarities:
[-0.42857143 0.28571429]

How to concat two ArrayType(StringType()) columns element-wise in Pyspark?

I have two ArrayType(StringType()) columns in a spark dataframe, and I want to concatenate the two columns element-wise:
input:
+-------------+-------------+
|col1 |col2 |
+-------------+-------------+
|['a','b'] |['c','d'] |
|['a','b','c']|['e','f','g']|
+-------------+-------------+
output:
+-------------+-------------+----------------+
|col1 |col2 |col3 |
+-------------+-------------+----------------+
|['a','b'] |['c','d'] |['ac', 'bd'] |
|['a','b','c']|['e','f','g']|['ae','bf','cg']|
+-------------+----------- -+----------------+
Thanks.
For Spark 2.4+, you can use zip_with function:
zip_with(left, right, func) - Merges the two given arrays,
element-wise, into a single array using function
df.withColumn("col3", expr("zip_with(col1, col2, (x, y) -> concat(x, y))")).show()
#+------+------+--------+
#| col1| col2| col3|
#+------+------+--------+
#|[a, b]|[c, d]|[ac, bd]|
#+------+------+--------+
Another way using transform function like this:
df.withColumn("col3", expr("transform(col1, (x, i) -> concat(x, col2[i]))"))
The transform function takes as parameters the first array column col1, iterates over its elements and applies a lambda function (x, i) -> concat(x, col2[i]) where x the actual element and i its index used to get the corresponding element from array col2.
Here is an alternative answer that can be used for the updated non-original question. Uses array and array_except to demonstrate the use of such methods. The accepted answer is more elegant.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Arbitrary max number of elements to apply array over, need not broadcast such a small amount of data afaik.
max_entries = 5
# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length,but per row constant length.
dfA = spark.createDataFrame([ ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500]) ) for x in range(3)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))
dfB = spark.createDataFrame([ ( list([x,x+1]), 4, list([x+100,x+200]) ) for x in range(5)], ['array1', 'value1', 'array2'] ).withColumn("s",size(col("array1")))
df = dfA.union(dfB)
# concat the array elements which are variable in size up to a max amount.
df2 = df.select(( [concat(col("array1")[i], col("array2")[i]) for i in range(max_entries)]))
df3 = df2.withColumn("res", array(df2.schema.names))
# Get results but strip out null entires from array.
df3.select(array_except(df3.res, array(lit(None)))).show(truncate=False)
Could not get the s value of column to be passed into range.
This returns:
+------------------------------+
|array_except(res, array(NULL))|
+------------------------------+
|[0100, 1200, 4999, 100500] |
|[1101, 2201, 4999, 101501] |
|[2102, 3202, 4999, 102502] |
|[0100, 1200] |
|[1101, 2201] |
|[2102, 3202] |
|[3103, 4203] |
|[4104, 5204] |
+------------------------------+
It wouldn't really scale, but you could get the 0th and 1st entries in each array and then say col3 is a[0] + b[0] and then a[1] + b[1].
Make all 4 entries separate values and then output them combined.
Here is a generic answer. Just look at res for the result. 2 equally sized arrays, thus n elements for both.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Gen in this case numeric data, etc. 3 rows with 2 arrays of varying length, but both the same length as in your example
df = spark.createDataFrame([ ( list([x,x+1,4, x+100]), 4, list([x+100,x+200,999,x+500]) ) for x in range(3)], ['array1', 'value1', 'array2'] )
num_array_elements = len(df.select("array1").first()[0])
# concat
df2 = df.select(([ concat(col("array1")[i], col("array2")[i]) for i in range(num_array_elements)]))
df2.withColumn("res", array(df2.schema.names)).show(truncate=False)
returns:

Counting words after grouping records (Part 2)

Although I am having an answer for what I want to achieve, the problem is that it's way to slow. The data set is not very large. It's 50GB in total but the affected part is probably just between 5 to 10GB of data. However, the following is what I require, but it's way to slow And by slow I mean it was running for an hour and it didn't terminate.
df_ = spark.createDataFrame([
('1', 'hello how are are you today'),
('1', 'hello how are you'),
('2', 'hello are you here'),
('2', 'how is it'),
('3', 'hello how are you'),
('3', 'hello how are you'),
('4', 'hello how is it you today')
], schema=['label', 'text'])
tokenizer = Tokenizer(inputCol='text', outputCol='tokens')
tokens = tokenizer.transform(df_)
token_counts.groupby('label')\
.agg(F.collect_list(F.struct(F.col('token'), F.col('count'))).alias('text'))\
.show(truncate=False)
Which gives me the token count for each label:
+-----+----------------------------------------------------------------+
|label|text |
+-----+----------------------------------------------------------------+
|3 |[[are,2], [how,2], [hello,2], [you,2]] |
|1 |[[today,1], [how,2], [are,3], [you,2], [hello,2]] |
|4 |[[hello,1], [how,1], [is,1], [today,1], [you,1], [it,1]] |
|2 |[[hello,1], [are,1], [you,1], [here,1], [is,1], [how,1], [it,1]]|
+-----+----------------------------------------------------------------+
However, I think the call to explode() is way too expensive for this.
I don't know but it might be faster to count the tokens in each "dokument" and later merge it in a groupBy():
df_.select(['label'] + [udf_get_tokens(F.col('text')).alias('text')])\
.rdd.map(lambda x: (x[0], list(Counter(x[1]).items()))) \
.toDF(schema=['label', 'text'])\
.show()
Gives the counts:
+-----+--------------------+
|label| text|
+-----+--------------------+
| 1|[[are,2], [hello,...|
| 1|[[are,1], [hello,...|
| 2|[[are,1], [hello,...|
| 2|[[how,1], [it,1],...|
| 3|[[are,1], [hello,...|
| 3|[[are,1], [hello,...|
| 4|[[you,1], [today,...|
+-----+--------------------+
Is there a way to merge those token counts in a more efficient way?
If groups defined by id are largish the obvious target for improvement is shuffle size. Instead of shuffling text, shuffle labels. First vectorize input
from pyspark.ml.feature import CountVectorizer
from pyspark.ml import Pipeline
pipeline_model = Pipeline(stages=[
Tokenizer(inputCol='text', outputCol='tokens'),
CountVectorizer(inputCol='tokens', outputCol='vectors')
]).fit(df_)
df_vec = pipeline_model.transform(df_).select("label", "vectors")
Then aggregate:
from pyspark.ml.linalg import SparseVector, DenseVector
from collections import defaultdict
def seq_func(acc, v):
if isinstance(v, SparseVector):
for i in v.indices:
acc[int(i)] += v[int(i)]
if isinstance(v, DenseVector):
for i in len(v):
acc[int(i)] += v[int(i)]
return acc
def comb_func(acc1, acc2):
for k, v in acc2.items():
acc1[k] += v
return acc1
aggregated = rdd.aggregateByKey(defaultdict(int), seq_func, comb_func)
And map back to the required output:
vocabulary = pipeline_model.stages[-1].vocabulary
def f(x, vocabulary=vocabulary):
# For list of tuples use [(vocabulary[i], float(v)) for i, v in x.items()]
return {vocabulary[i]: float(v) for i, v in x.items()}
aggregated.mapValues(f).toDF(["id", "text"]).show(truncate=False)
# +---+-------------------------------------------------------------------------------------+
# |id |text |
# +---+-------------------------------------------------------------------------------------+
# |4 |[how -> 1.0, today -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0] |
# |3 |[how -> 2.0, hello -> 2.0, are -> 2.0, you -> 2.0] |
# |1 |[how -> 2.0, hello -> 2.0, are -> 3.0, you -> 2.0, today -> 1.0] |
# |2 |[here -> 1.0, how -> 1.0, are -> 1.0, is -> 1.0, it -> 1.0, hello -> 1.0, you -> 1.0]|
# +---+-------------------------------------------------------------------------------------+
This worth trying only if text part is considerably large - otherwise all required transformations between DataFrame and Python objects might be more expensive than collecting_list.

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}
To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |
An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.
Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])
Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df
from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.
if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')
Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls
Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

pyspark corr for each group in DF (more than 5K columns)

I have a DF with 100 million rows and 5000+ columns. I am trying to find the corr between colx and remaining 5000+ columns.
aggList1 = [mean(col).alias(col + '_m') for col in df.columns] #exclude keys
df21= df.groupBy('key1', 'key2', 'key3', 'key4').agg(*aggList1)
df = df.join(broadcast(df21),['key1', 'key2', 'key3', 'key4']))
df= df.select([func.round((func.col(colmd) - func.col(colmd + '_m')), 8).alias(colmd)\
for colmd in all5Kcolumns])
aggCols= [corr(colx, col).alias(col) for col in colsall5K]
df2 = df.groupBy('key1', 'key2', 'key3').agg(*aggCols)
Right now it is not working because of spark 64KB codegen issue (even spark 2.2). So i am looping for each 300 columns and merging all at the end. But it is taking more than 30 hours in a cluster with 40 nodes (10 core each and each node with 100GB). Any help to tune this?
Below things already tried
- Re partition DF to 10,000
- Checkpoint in each loop
- cache in each loop
You can try with a bit of NumPy and RDDs. First a bunch of imports:
from operator import itemgetter
import numpy as np
from pyspark.statcounter import StatCounter
Let's define a few variables:
keys = ["key1", "key2", "key3"] # list of key column names
xs = ["x1", "x2", "x3"] # list of column names to compare
y = "y" # name of the reference column
And some helpers:
def as_pair(keys, y, xs):
""" Given key names, y name, and xs names
return a tuple of key, array-of-values"""
key = itemgetter(*keys)
value = itemgetter(y, * xs) # Python 3 syntax
def as_pair_(row):
return key(row), np.array(value(row))
return as_pair_
def init(x):
""" Init function for combineByKey
Initialize new StatCounter and merge first value"""
return StatCounter().merge(x)
def center(means):
"""Center a row value given a
dictionary of mean arrays
"""
def center_(row):
key, value = row
return key, value - means[key]
return center_
def prod(arr):
return arr[0] * arr[1:]
def corr(stddev_prods):
"""Scale the row to get 1 stddev
given a dictionary of stddevs
"""
def corr_(row):
key, value = row
return key, value / stddev_prods[key]
return corr_
and convert DataFrame to RDD of pairs:
pairs = df.rdd.map(as_pair(keys, y, xs))
Next let's compute statistics per group:
stats = (pairs
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.collectAsMap())
means = {k: v.mean() for k, v in stats.items()}
Note: With 5000 features and 7000 group there should no issue with keeping this structure in memory. With larger datasets you may have to use RDD and join but this will be slower.
Center the data:
centered = pairs.map(center(means))
Compute covariance:
covariance = (centered
.mapValues(prod)
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.mapValues(StatCounter.mean))
And finally correlation:
stddev_prods = {k: prod(v.stdev()) for k, v in stats.items()}
correlations = covariance.map(corr(stddev_prods))
Example data:
df = sc.parallelize([
("a", "b", "c", 0.5, 0.5, 0.3, 1.0),
("a", "b", "c", 0.8, 0.8, 0.9, -2.0),
("a", "b", "c", 1.5, 1.5, 2.9, 3.6),
("d", "e", "f", -3.0, 4.0, 5.0, -10.0),
("d", "e", "f", 15.0, -1.0, -5.0, 10.0),
]).toDF(["key1", "key2", "key3", "y", "x1", "x2", "x3"])
Results with DataFrame:
df.groupBy(*keys).agg(*[corr(y, x) for x in xs]).show()
+----+----+----+-----------+------------------+------------------+
|key1|key2|key3|corr(y, x1)| corr(y, x2)| corr(y, x3)|
+----+----+----+-----------+------------------+------------------+
| d| e| f| -1.0| -1.0| 1.0|
| a| b| c| 1.0|0.9972300220940342|0.6513360726920862|
+----+----+----+-----------+------------------+------------------+
and the method provided above:
correlations.collect()
[(('a', 'b', 'c'), array([ 1. , 0.99723002, 0.65133607])),
(('d', 'e', 'f'), array([-1., -1., 1.]))]
This solution, while a bit involved, is quite elastic and can be easily adjusted to handle different data distributions. It should be also possible to given further boost with JIT.

Resources