I have a Spark DataFrame like this:
edited: each name can appear multiple times, in any org.
df = sqlContext.createDataFrame(
[
('org_1', 'a', 1),
('org_1', 'a', 2),
('org_1', 'a', 3),
('org_1', 'b', 4),
('org_1', 'c', 5),
('org_2', 'a', 7),
('org_2', 'd', 4),
('org_2', 'e', 5),
('org_2', 'e', 10)
],
["org", "name", "value"]
)
I would like to calculate for each org and name: the mean, stddev and count of values from the rest of the names excluding that name within each org. E.g. For org_1, name b, mean = (1+2+3+5)/4
The DataFrame has ~450 million rows. I cannot use vectorized pandas_UDF because my Spark version is 2.2. There is also a constraint of spark.driver.maxResultSize of 4.0 GB.
I tried this on Pandas (filter rows within groups and take mean/std/count) on a DataFrame with only two columns (name and value). I haven't figured out how to do this with two levels of grouped columns (org and name).
def stats_fun(x):
return pd.Series({'data_mean': x['value'].mean(),
'data_std': x['value'].std(),
'data_n': x['value'].count(),
'anti_grp_mean': df[df['name'] != x.name]['value'].mean(),
'anti_grp_std': df[df['name'] != x.name]['value'].std(),
'anti_grp_n': df[df['name'] != x.name]['value'].count()
})
df.groupby('name').apply(stats_fun)
Is there a similar UDF function I can define on Spark? (This function would have to take in multiple columns). Otherwise, what is a more efficient way to do this?
A simple UDF can also work.
import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import *
df = sql.createDataFrame(
[
('org_1', 'a', 1),
('org_1', 'a', 2),
('org_1', 'a', 3),
('org_1', 'b', 4),
('org_1', 'c', 5),
('org_2', 'a', 7),
('org_2', 'd', 4),
('org_2', 'e', 5),
('org_2', 'e', 10)
],
["org", "name", "value"]
)
+-----+----+-----+
| org|name|value|
+-----+----+-----+
|org_1| a| 1|
|org_1| a| 2|
|org_1| a| 3|
|org_1| b| 4|
|org_1| c| 5|
|org_2| a| 7|
|org_2| d| 4|
|org_2| e| 5|
|org_2| e| 10|
+-----+----+-----+
After applying groupby and collecting all elements in list, we apply udf to find statistics. After that, columns are exploded and split into multiple columns.
def _find_stats(a,b):
dict_ = zip(a,b)
stats = []
for name in a:
to_cal = [v for k,v in dict_ if k != name]
stats.append((name,float(np.mean(to_cal))\
,float(np.std(to_cal))\
,len(to_cal)))
print stats
return stats
find_stats = F.udf(_find_stats,ArrayType(ArrayType(StringType())))
cols = ['name', 'mean', 'stddev', 'count']
splits = [F.udf(lambda val:val[0],StringType()),\
F.udf(lambda val:val[1],StringType()),\
F.udf(lambda val:val[2],StringType()),\
F.udf(lambda val:val[3],StringType())]
df = df.groupby('org').agg(*[F.collect_list('name').alias('name'), F.collect_list('value').alias('value')])\
.withColumn('statistics', find_stats(F.col('name'), F.col('value')))\
.drop('name').drop('value')\
.select('org', F.explode('statistics').alias('statistics'))\
.select(['org']+[split_('statistics').alias(col_name) for split_,col_name in zip(splits,cols)])\
.dropDuplicates()
df.show()
+-----+----+-----------------+------------------+-----+
| org|name| mean| stddev|count|
+-----+----+-----------------+------------------+-----+
|org_1| c| 2.5| 1.118033988749895| 4|
|org_2| e| 5.5| 1.5| 2|
|org_2| a|6.333333333333333|2.6246692913372702| 3|
|org_2| d|7.333333333333333|2.0548046676563256| 3|
|org_1| a| 4.5| 0.5| 2|
|org_1| b| 2.75| 1.479019945774904| 4|
+-----+----+-----------------+------------------+-----+
If you also want 'value' column, you can insert that in the tuple in udf function and add one split udf.
Also, since there will be duplicates in the dataframe due to repetition of names, you can remove them using dropDuplicates.
Here is a way to do this using only DataFrame functions.
Just join your DataFrame to itself on the org column and use a where clause to specify that the name column should be different. Then we select the distinct rows of ('l.org', 'l.name', 'r.name', 'r.value') - essentially, we ignore the l.value column because we want to avoid double counting for the same (org, name) pair.
For example, this is how you could collect the other values for each ('org', 'name') pair:
import pyspark.sql.functions as f
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.select('l.org', 'l.name', 'r.name', 'r.value')\
.distinct()\
.groupBy('l.org', 'l.name')\
.agg(f.collect_list('r.value').alias('other_values'))\
.show()
#+-----+----+------------+
#| org|name|other_values|
#+-----+----+------------+
#|org_1| a| [4, 5]|
#|org_1| b|[1, 2, 3, 5]|
#|org_1| c|[1, 2, 3, 4]|
#|org_2| a| [4, 5, 10]|
#|org_2| d| [7, 5, 10]|
#|org_2| e| [7, 4]|
#+-----+----+------------+
For the descriptive stats, you can use the mean, stddev, and count functions from pyspark.sql.functions:
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.select('l.org', 'l.name', 'r.name', 'r.value')\
.distinct()\
.groupBy('l.org', 'l.name')\
.agg(
f.mean('r.value').alias('mean'),
f.stddev('r.value').alias('stddev'),
f.count('r.value').alias('count')
)\
.show()
#+-----+----+-----------------+------------------+-----+
#| org|name| mean| stddev|count|
#+-----+----+-----------------+------------------+-----+
#|org_1| a| 4.5|0.7071067811865476| 2|
#|org_1| b| 2.75| 1.707825127659933| 4|
#|org_1| c| 2.5|1.2909944487358056| 4|
#|org_2| a|6.333333333333333|3.2145502536643185| 3|
#|org_2| d|7.333333333333333|2.5166114784235836| 3|
#|org_2| e| 5.5|2.1213203435596424| 2|
#+-----+----+-----------------+------------------+-----+
Note that pyspark.sql.functions.stddev() returns the unbiased sample standard deviation. If you wanted the population standard deviation, use pyspark.sql.functions.stddev_pop():
df.alias('l').join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.groupBy('l.org', 'l.name')\
.agg(
f.mean('r.value').alias('mean'),
f.stddev_pop('r.value').alias('stddev'),
f.count('r.value').alias('count')
)\
.show()
#+-----+----+-----------------+------------------+-----+
#| org|name| mean| stddev|count|
#+-----+----+-----------------+------------------+-----+
#|org_1| a| 4.5| 0.5| 2|
#|org_1| b| 2.75| 1.479019945774904| 4|
#|org_1| c| 2.5| 1.118033988749895| 4|
#|org_2| a|6.333333333333333|2.6246692913372702| 3|
#|org_2| d|7.333333333333333|2.0548046676563256| 3|
#|org_2| e| 5.5| 1.5| 2|
#+-----+----+-----------------+------------------+-----+
EDIT
As #NaomiHuang mentioned in the comments, you could also reduce l to the distinct org/name pairs before doing the join:
df.select('org', 'name')\
.distinct()\
.alias('l')\
.join(df.alias('r'), on='org')\
.where('l.name != r.name')\
.groupBy('l.org', 'l.name')\
.agg(f.collect_list('r.value').alias('other_values'))\
.show()
#+-----+----+------------+
#| org|name|other_values|
#+-----+----+------------+
#|org_1| a| [5, 4]|
#|org_1| b|[5, 1, 2, 3]|
#|org_1| c|[1, 2, 3, 4]|
#|org_2| a| [4, 5, 10]|
#|org_2| d| [7, 5, 10]|
#|org_2| e| [7, 4]|
#+-----+----+------------+
Related
I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+
I want to group on multiple columns and then aggregate various columns by user-defined-functions (udf) that calculates mode for each of the columns. I demonstrate my problem by this sample code:
import pandas as pd
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType, IntegerType
df = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df["A"] = ["Mon", "Mon", "Mon", "Fri", "Fri", "Fri", "Fri"]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = ["x", "y", "y", "m", "n", "r", "r"]
df["D"] = [3, 3, 5, 1, 1, 1, 9]
df_sdf = spark.createDataFrame(df)
df_sdf.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
|Mon|Feb| x| 3|
|Mon|Feb| y| 3|
|Mon|Feb| y| 5|
|Fri|May| m| 1|
|Fri|May| n| 1|
|Fri|May| r| 1|
|Fri|May| r| 9|
+---+---+---+---+
# Custom mode function to get mode value for string list and integer list
def custom_mode(lst): return(max(lst, key=lst.count))
custom_mode_str = udf(custom_mode, StringType())
custom_mode_int = udf(custom_mode, IntegerType())
grp_columns = ["A", "B"]
df_sdf.groupBy(grp_columns).agg(custom_mode_str(col("C")).alias("C"), custom_mode_int(col("D")).alias("D")).distinct().show()
However, I am getting the following error on last line of above code:
AnalysisException: expression '`C`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
The expected output for this code is:
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
|Mon|Feb| y| 3|
|Fri|May| r| 1|
+---+---+---+---+
I searched a lot but couldn't find something similar to this problem in pyspark. Thanks for your time.
Your UDF requires a list but you're providing a spark dataframe's column. You can pass a list to the function which will generate your desired result.
sdf.groupBy(['A', 'B']). \
agg(custom_mode_str(func.collect_list('C')).alias('C'),
custom_mode_int(func.collect_list('D')).alias('D')
). \
show()
# +---+---+---+---+
# | A| B| C| D|
# +---+---+---+---+
# |Mon|Feb| y| 3|
# |Fri|May| r| 1|
# +---+---+---+---+
The collect_list() is the key here as it will generate a list which will work with your UDF. See collection outputs below.
sdf.groupBy(['A', 'B']). \
agg(func.collect_list('C').alias('C_collected'),
func.collect_list('D').alias('D_collected')
). \
show()
# +---+---+------------+------------+
# | A| B| C_collected| D_collected|
# +---+---+------------+------------+
# |Mon|Feb| [x, y, y]| [3, 3, 5]|
# |Fri|May|[m, n, r, r]|[1, 1, 1, 9]|
# +---+---+------------+------------+
Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).
Hi Stack Overflow community. I am new to spark/pyspark and I have this question.
Say I have two data frames (df2 being the interest data set with a lot of records and df1 is a new update). I want to join the two data frames on multiple columns (if possible) and get the updated information from df1 when there is a key match otherwise keep the df2 information as it is.
here is my sample data set and my expected output (df30)
df1 = spark.createDataFrame([("a", 4, 'x'), ("b", 3, 'y'), ("c", 4, 'z'), ("d", 4, 'l')], ["C1", "C2", "C3"])
df2 = spark.createDataFrame([("a", 4, 5), ("f", 3, 4), ("b", 3, 6), ("c", 4, 7), ("d", 4, 8)], ["C1", "C2","C3"])
df1_s = df1.select([col(c).alias('s_' + c) for c in df1.columns])
You can use left join on your list of columns and using a list comprehension over the remaining columns select updates using the coalesce function:
from pyspark.sql import functions as F
join_columns = ["C1"]
result = df2.alias("df2").join(
df1.alias("df1"),
join_columns,
"left"
).select(
*join_columns,
*[
F.coalesce(f"df1.{c}", f"df2.{c}").alias(c)
for c in df1.columns if c not in join_columns
]
)
result.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| f| 3| 4|
#| d| 4| l|
#| c| 4| z|
#| b| 3| y|
#| a| 4| x|
#+---+---+---+
I have a dataframe:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 1| 5|
| 1| 2| 5|
| 1| 3| 5|
| 2| 1| 15|
| 2| 2| 5|
| 2| 6| 5|
+------------+------------+-------------+
How to get the maximum value of column 1? And how to get the sum of the values in column 2?
To get this result:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 3| 15|
| 2| 6| 25|
+------------+------------+-------------+
Use .groupBy and agg (max(column1),sum(column2)) for this case
#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])
from pyspark.sql.functions import *
df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#| 1| 3| 15|
#| 2| 6| 25|
#+---+-------+-------+
If you are familiar with sql then below is the sql version using group by , max and sum functions
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2").createTempView("mytable")
spark.sql("select id,max(col1),sum(col2) from mytable group by id").show
Result :
+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
| 1| 3| 15|
| 2| 6| 25|
+---+---------+---------+
All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg
The functions come from org.apache.spark.sql.functions._ package.
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2")
val result = input
.groupBy("id")
.agg(max(col("col1")),sum(col("col2")))
.show()