T-table Lookup Function PySpark: Chauvenet's criterion - Outlier Detection - apache-spark

I have price data in a large number of product groups with varying numbers of observations by product. I am attempting to detect outliers using a t-distribution with the density approaching normal asymptotically as the number of products in each group gets large.
Data looks something like the following.
x = (spark
.sparkContext
.parallelize([
('A', 1. , 30, 1.2, .3),
('A', 1.5 , 30, 1.2, .3),
('B', 10. , 50, 15., 5.),
('B', 15. , 50, 15., 5.),
('B', 5. , 50, 15., 5.),
])
.toDF(['name', 'price', 'count', 'mean','standardDev']))
x.show()
+----+-----+-----+----+-----------+
|name|price|count|mean|standardDev|
+----+-----+-----+----+-----------+
| A| 1.0| 30| 1.2| 0.3|
| A| 1.5| 30| 1.2| 0.3|
| B| 10.0| 50|15.0| 5.0|
| B| 15.0| 50|15.0| 5.0|
| B| 5.0| 50|15.0| 5.0|
+----+-----+-----+----+-----------+
My ideal would be:
x = x.withColumn('pvalue', f.MagicT(col('price'), col('mean'), col('standardDev'), col('count'))

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

How to fill up null values in Spark Dataframe based on other columns' value?

Given this dataframe:
+-----+-----+----+
|num_a|num_b| sum|
+-----+-----+----+
| 1| 1| 2|
| 12| 15| 27|
| 56| 11|null|
| 79| 3| 82|
| 111| 114| 225|
+-----+-----+----+
How would you fill up Null values in sum column if the value can be gathered from other columns? In this example 56+11 would be the value.
I've tried df.fillna with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.
If your requirement is UDF, then it can be done as:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum):
return sum if sum is None else (num_a + num_b)
df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))
[Out]:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+
You can use coalesce function. Check this sample code
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a") + f.col("num_b"))).show()
Output is:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+

Descriptive statistics on values outside of group for each group

I have a Spark DataFrame like this:
edited: each name can appear multiple times, in any org.
df = sqlContext.createDataFrame(
[
('org_1', 'a', 1),
('org_1', 'a', 2),
('org_1', 'a', 3),
('org_1', 'b', 4),
('org_1', 'c', 5),
('org_2', 'a', 7),
('org_2', 'd', 4),
('org_2', 'e', 5),
('org_2', 'e', 10)
],
["org", "name", "value"]
)
I would like to calculate for each org and name: the mean, stddev and count of values from the rest of the names excluding that name within each org. E.g. For org_1, name b, mean = (1+2+3+5)/4
The DataFrame has ~450 million rows. I cannot use vectorized pandas_UDF because my Spark version is 2.2. There is also a constraint of spark.driver.maxResultSize of 4.0 GB.
I tried this on Pandas (filter rows within groups and take mean/std/count) on a DataFrame with only two columns (name and value). I haven't figured out how to do this with two levels of grouped columns (org and name).
def stats_fun(x):
return pd.Series({'data_mean': x['value'].mean(),
'data_std': x['value'].std(),
'data_n': x['value'].count(),
'anti_grp_mean': df[df['name'] != x.name]['value'].mean(),
'anti_grp_std': df[df['name'] != x.name]['value'].std(),
'anti_grp_n': df[df['name'] != x.name]['value'].count()
})
df.groupby('name').apply(stats_fun)
Is there a similar UDF function I can define on Spark? (This function would have to take in multiple columns). Otherwise, what is a more efficient way to do this?
A simple UDF can also work.
import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import *
df = sql.createDataFrame(
[
('org_1', 'a', 1),
('org_1', 'a', 2),
('org_1', 'a', 3),
('org_1', 'b', 4),
('org_1', 'c', 5),
('org_2', 'a', 7),
('org_2', 'd', 4),
('org_2', 'e', 5),
('org_2', 'e', 10)
],
["org", "name", "value"]
)
+-----+----+-----+
| org|name|value|
+-----+----+-----+
|org_1| a| 1|
|org_1| a| 2|
|org_1| a| 3|
|org_1| b| 4|
|org_1| c| 5|
|org_2| a| 7|
|org_2| d| 4|
|org_2| e| 5|
|org_2| e| 10|
+-----+----+-----+
After applying groupby and collecting all elements in list, we apply udf to find statistics. After that, columns are exploded and split into multiple columns.
def _find_stats(a,b):
dict_ = zip(a,b)
stats = []
for name in a:
to_cal = [v for k,v in dict_ if k != name]
stats.append((name,float(np.mean(to_cal))\
,float(np.std(to_cal))\
,len(to_cal)))
print stats
return stats
find_stats = F.udf(_find_stats,ArrayType(ArrayType(StringType())))
cols = ['name', 'mean', 'stddev', 'count']
splits = [F.udf(lambda val:val[0],StringType()),\
F.udf(lambda val:val[1],StringType()),\
F.udf(lambda val:val[2],StringType()),\
F.udf(lambda val:val[3],StringType())]
df = df.groupby('org').agg(*[F.collect_list('name').alias('name'), F.collect_list('value').alias('value')])\
.withColumn('statistics', find_stats(F.col('name'), F.col('value')))\
.drop('name').drop('value')\
.select('org', F.explode('statistics').alias('statistics'))\
.select(['org']+[split_('statistics').alias(col_name) for split_,col_name in zip(splits,cols)])\
.dropDuplicates()
df.show()
+-----+----+-----------------+------------------+-----+
| org|name| mean| stddev|count|
+-----+----+-----------------+------------------+-----+
|org_1| c| 2.5| 1.118033988749895| 4|
|org_2| e| 5.5| 1.5| 2|
|org_2| a|6.333333333333333|2.6246692913372702| 3|
|org_2| d|7.333333333333333|2.0548046676563256| 3|
|org_1| a| 4.5| 0.5| 2|
|org_1| b| 2.75| 1.479019945774904| 4|
+-----+----+-----------------+------------------+-----+
If you also want 'value' column, you can insert that in the tuple in udf function and add one split udf.
Also, since there will be duplicates in the dataframe due to repetition of names, you can remove them using dropDuplicates.
Here is a way to do this using only DataFrame functions.
Just join your DataFrame to itself on the org column and use a where clause to specify that the name column should be different. Then we select the distinct rows of ('l.org', 'l.name', 'r.name', 'r.value') - essentially, we ignore the l.value column because we want to avoid double counting for the same (org, name) pair.
For example, this is how you could collect the other values for each ('org', 'name') pair:
import pyspark.sql.functions as f
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.select('l.org', 'l.name', 'r.name', 'r.value')\
.distinct()\
.groupBy('l.org', 'l.name')\
.agg(f.collect_list('r.value').alias('other_values'))\
.show()
#+-----+----+------------+
#| org|name|other_values|
#+-----+----+------------+
#|org_1| a| [4, 5]|
#|org_1| b|[1, 2, 3, 5]|
#|org_1| c|[1, 2, 3, 4]|
#|org_2| a| [4, 5, 10]|
#|org_2| d| [7, 5, 10]|
#|org_2| e| [7, 4]|
#+-----+----+------------+
For the descriptive stats, you can use the mean, stddev, and count functions from pyspark.sql.functions:
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.select('l.org', 'l.name', 'r.name', 'r.value')\
.distinct()\
.groupBy('l.org', 'l.name')\
.agg(
f.mean('r.value').alias('mean'),
f.stddev('r.value').alias('stddev'),
f.count('r.value').alias('count')
)\
.show()
#+-----+----+-----------------+------------------+-----+
#| org|name| mean| stddev|count|
#+-----+----+-----------------+------------------+-----+
#|org_1| a| 4.5|0.7071067811865476| 2|
#|org_1| b| 2.75| 1.707825127659933| 4|
#|org_1| c| 2.5|1.2909944487358056| 4|
#|org_2| a|6.333333333333333|3.2145502536643185| 3|
#|org_2| d|7.333333333333333|2.5166114784235836| 3|
#|org_2| e| 5.5|2.1213203435596424| 2|
#+-----+----+-----------------+------------------+-----+
Note that pyspark.sql.functions.stddev() returns the unbiased sample standard deviation. If you wanted the population standard deviation, use pyspark.sql.functions.stddev_pop():
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.groupBy('l.org', 'l.name')\
.agg(
f.mean('r.value').alias('mean'),
f.stddev_pop('r.value').alias('stddev'),
f.count('r.value').alias('count')
)\
.show()
#+-----+----+-----------------+------------------+-----+
#| org|name| mean| stddev|count|
#+-----+----+-----------------+------------------+-----+
#|org_1| a| 4.5| 0.5| 2|
#|org_1| b| 2.75| 1.479019945774904| 4|
#|org_1| c| 2.5| 1.118033988749895| 4|
#|org_2| a|6.333333333333333|2.6246692913372702| 3|
#|org_2| d|7.333333333333333|2.0548046676563256| 3|
#|org_2| e| 5.5| 1.5| 2|
#+-----+----+-----------------+------------------+-----+
EDIT
As #NaomiHuang mentioned in the comments, you could also reduce l to the distinct org/name pairs before doing the join:
df.select('org', 'name')\
.distinct()\
.alias('l')\
.join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.groupBy('l.org', 'l.name')\
.agg(f.collect_list('r.value').alias('other_values'))\
.show()
#+-----+----+------------+
#| org|name|other_values|
#+-----+----+------------+
#|org_1| a| [5, 4]|
#|org_1| b|[5, 1, 2, 3]|
#|org_1| c|[1, 2, 3, 4]|
#|org_2| a| [4, 5, 10]|
#|org_2| d| [7, 5, 10]|
#|org_2| e| [7, 4]|
#+-----+----+------------+

How do I get Pyspark to aggregate sets at two levels?

I need to aggregate rows in a DataFrame by collecting the values in a certain column in each group into a set. pyspark.sql.functions.collect_set does exactly what I need.
However, I need to do this for two columns in turn, because I need to group the input by one column, divide each group into subgroups by another column, and do some aggregation on each subgroup. I don't see how to get collect_set to create a set for each group.
Example:
df = spark.createDataFrame([('a', 'x', 11, 22), ('a', 'y', 33, 44), ('b', 'x', 55, 66), ('b', 'y', 77, 88),('a','x',12,23),('a','y',34,45),('b','x',56,67),('b','y',78,89)], ('col1', 'col2', 'col3', 'col4'))
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| x| 11| 22|
| a| y| 33| 44|
| b| x| 55| 66|
| b| y| 77| 88|
| a| x| 12| 23|
| a| y| 34| 45|
| b| x| 56| 67|
| b| y| 78| 89|
+----+----+----+----+
g1 = df.groupBy('col1', 'col2').agg(collect_set('col3'),collect_set('col4'))
g1.show()
+----+----+-----------------+-----------------+
|col1|col2|collect_set(col3)|collect_set(col4)|
+----+----+-----------------+-----------------+
| a| x| [12, 11]| [22, 23]|
| b| y| [78, 77]| [88, 89]|
| a| y| [33, 34]| [45, 44]|
| b| x| [56, 55]| [66, 67]|
+----+----+-----------------+-----------------+
g2 = g1.groupBy('col1').agg(collect_set('collect_set(col3)'),collect_set('collect_set(col4)'),count('col2'))
g2.show(truncate=False)
+----+--------------------------------------------+--------------------------------------------+-----------+
|col1|collect_set(collect_set(col3)) |collect_set(collect_set(col4)) |count(col2)|
+----+--------------------------------------------+--------------------------------------------+-----------+
|b |[WrappedArray(56, 55), WrappedArray(78, 77)]|[WrappedArray(66, 67), WrappedArray(88, 89)]|2 |
|a |[WrappedArray(33, 34), WrappedArray(12, 11)]|[WrappedArray(22, 23), WrappedArray(45, 44)]|2 |
+----+--------------------------+--------------------------------------------+-----------+
I'd like the result to look more like
+----+----------------+----------------+-----------+
|col1| ...col3... | ...col4... |count(col2)|
+----+----------------+----------------+-----------+
|b |[56, 55, 78, 77]|[66, 67, 88, 89]|2 |
|a |[33, 34, 12, 11]|[22, 23, 45, 44]|2 |
+----+----------------+----------------+-----------+
but I don't see an aggregate function to take the union of two or more sets, or a pyspark operation to flatten the "array of arrays" structure that shows up in g2.
Does pyspark provide a simple way to accomplish this? Or is there a totally different approach I should be taking?
In PySpark 2.4.5, you could use the now built-in flatten function.
You can flatten the columns with a UDF afterwards:
flatten = udf(lambda l: [x for i in l for x in i], ArrayType(IntegerType()))
I took the liberty of renaming the columns of g2 as col3 and and col4 to save typing. This gives:
g3 = g2.withColumn('col3flat', flatten('col3'))
>>> g3.show()
+----+--------------------+--------------------+-----+----------------+
|col1| col3| col4|count| col3flat|
+----+--------------------+--------------------+-----+----------------+
| b|[[78, 77], [56, 55]]|[[66, 67], [88, 89]]| 2|[78, 77, 56, 55]|
| a|[[12, 11], [33, 34]]|[[22, 23], [45, 44]]| 2|[12, 11, 33, 34]|
+----+--------------------+--------------------+-----+----------------+
You can accomplish the same with
from pyspark.sql.functions import collect_set, countDistinct
(
df.
groupby('col1').
agg(
collect_set('col3').alias('col3_vals'),
collect_set('col4').alias('col4_vals'),
countDistinct('col2').alias('num_grps')
).
show(truncate=False)
)
+----+----------------+----------------+--------+
|col1|col3_vals |col4_vals |num_grps|
+----+----------------+----------------+--------+
|b |[78, 56, 55, 77]|[66, 88, 67, 89]|2 |
|a |[33, 12, 34, 11]|[45, 22, 44, 23]|2 |
+----+----------------+----------------+--------+

How to bin in PySpark?

For example, I'd like to classify a DataFrame of people into the following 4 bins according to age.
age_bins = [0, 6, 18, 60, np.Inf]
age_labels = ['infant', 'minor', 'adult', 'senior']
I would use pandas.cut() to do this in pandas. How do I do this in PySpark?
You can use Bucketizer feature transfrom from ml library in spark.
values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)]
df = spark.createDataFrame(values, ["name", "ages"])
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)
df_buck.show()
output
+----+----+-------+
|name|ages|buckets|
+----+----+-------+
| a| 23| 2.0|
| b| 45| 2.0|
| c| 10| 1.0|
| d| 60| 3.0|
| e| 56| 2.0|
| f| 2| 0.0|
| g| 25| 2.0|
| h| 40| 2.0|
| j| 33| 2.0|
+----+----+-------+
If you want names for each bucket you can use udf to create a new column with bucket names
from pyspark.sql.functions import udf
from pyspark.sql.types import *
t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"}
udf_foo = udf(lambda x: t[x], StringType())
df_buck.withColumn("age_bucket", udf_foo("buckets")).show()
output
+----+----+-------+----------+
|name|ages|buckets|age_bucket|
+----+----+-------+----------+
| a| 23| 2.0| adult|
| b| 45| 2.0| adult|
| c| 10| 1.0| minor|
| d| 60| 3.0| senior|
| e| 56| 2.0| adult|
| f| 2| 0.0| infant|
| g| 25| 2.0| adult|
| h| 40| 2.0| adult|
| j| 33| 2.0| adult|
+----+----+-------+----------+
You could also write a PySpark UDF:
def categorizer(age):
if age < 6:
return "infant"
elif age < 18:
return "minor"
elif age < 60:
return "adult"
else:
return "senior"
Then:
bucket_udf = udf(categorizer, StringType() )
bucketed = df.withColumn("bucket", bucket_udf("age"))
In my case I had to randomly bucket a string value column, so it required me some extra steps:
from pyspark.sql.types import LongType, IntegerType
import pyspark.sql.functions as F
buckets_number = 4 # number of buckets desired
df.withColumn("sub", F.substring(F.md5('my_col'), 0, 16)) \
.withColumn("translate", F.translate("sub", "abcdefghijklmnopqrstuvwxyz", "01234567890123456789012345").cast(LongType())) \
.select("my_col",
(F.col("translate") % (buckets_number + 1)).cast(IntegerType()).alias("bucket_my_col"))
hash it with MD5
substring the result to 16 characters (otherwise would have a too big number in following steps)
translate letters generated by MD5 in numbers
apply modulo function based on the number of desired buckets
In case you know the bin width, then you can use division with a cast. The result is multiplied by the bin width to get the lower bound of the bin as a label.
from pyspark.sql.types import IntegerType
def categorize(df, bin_width):
df = df.withColumn('bucket', (col('value') / bin_width).cast(IntegerType()) * bin_width)
return df
values = [("a", 23), ("b", 45), ("e", 56), ("f", 2)]
df = spark.createDataFrame(values, ["name", "value"])
categorize(df, bin_width=10).show()
Output:
+----+---+------+
|name|age|bucket|
+----+---+------+
| a| 23| 20|
| b| 45| 40|
| e| 56| 50|
| f| 2| 0|
+----+---+------+
Notice that it also works for floating point attributes:
values = [("a", .23), ("b", .45), ("e", .56), ("f", .02)]
df = spark.createDataFrame(values, ["name", "value"])
categorize(df, bin_width=.10).show()
Output:
+----+-----+------+
|name|value|bucket|
+----+-----+------+
| a| 0.23| 0.2|
| b| 0.45| 0.4|
| e| 0.56| 0.5|
| f| 0.02| 0.0|
+----+-----+------+

Resources