PySpark : Different Aggregate Alias for each group - apache-spark

Is it possible in spark to do a groupby and aggregate where the alias for the aggregate function is different for each group? For example, if I was doing a groupby and AVG, I want each group to have a different column name such as "group_1_avg" for group 1 and "group_2_avg" for group 2, etc. With the idea that the final result will be a list of columns group_1_avg, group_2_avg, etc.
I realize I can probably not do this and just have everything aggregated under one name and pivot it, but I am trying to avoid pivot due to how expensive it is for my data.
Things I've tried:
frame = frame.groupBy(Item, Group, Level).agg(F.avg(val))
frame = frame.withColumn('Columns', concat(col("Group"), lit(""), col("level"), lit(""), lit("AVG")))
frame = frame.groupBy(Item).pivot(Columns).agg(first(AVG))
This works and does what I need but the problem I have is that the pivot becomes too expensive given the scale of my data so I am looking for an alternate solution.
Thank you for your time.
Input Format
Item
Group
Level
val
W1
A
S1
40
W1
A
S1
40
W1
A
S2
25
W2
A
S1
50
W2
A
S1
50
Expected Output:
Item
A_S1_AVG
A_S2_AVG
W1
40
25.0
W2
50
null

For large dataset, can you envision your data to be in this format ?(instead of constructing 20K columns(!!?), you could have 20K rows)
+----+-----------+----+
|Item|Group_Level|Mean|
+----+-----------+----+
| W1| A_S1|40.0|
| W1| A_S2|25.0|
| W2| A_S1|50.0|
+----+-----------+----+
If so,
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([('W1', 'A' ,'S1', 40),
('W1', 'A', 'S1', 40),
('W1','A', 'S2', 25),
('W2', 'A', 'S1', 50),
('W2', 'A', 'S1', 50)], ["Item", "Group", "Level", "Val"])
#udf (T.MapType(T.StringType(), T.FloatType()))
def create_group_scores(data):
data_map = {}
mean_map = {}
for datum in data:
key = f"{datum.Group}_{datum.Level}"
if key in data_map:
data_map[key].append(datum.Val)
else:
data_map[key] = [datum.Val]
for key in data_map:
mean_map[key] = sum(data_map[key])/len(data_map[key])
return mean_map
item_groups = df.groupBy("Item").agg(F.collect_list(F.struct("Group", "Level", "Val")).alias("group_level_val")).withColumn("group_scores",create_group_scores("group_level_val"))
item_groups = item_groups.select("Item", F.explode_outer("group_scores").alias("Group_Level", "Mean"))
item_groups.show()

Related

Building derived column using Spark transformations

I got a table record as stated below.
Id Indicator Date
1 R 2018-01-20
1 R 2018-10-21
1 P 2019-01-22
2 R 2018-02-28
2 P 2018-05-22
2 P 2019-03-05
I need to pick the Ids that had more than two R indicator in the last one year and derive a new column called Marked_Flag as Y otherwise N. So the expected output should look like below,
Id Marked_Flag
1 Y
2 N
So what I did so far, I took the records in a dataset and then again build another dataset from that. The code looks like below.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
But my lead what this to be done using a single dataset and using Spark transformations. I am pretty new to Spark, any guidance or code snippet on this regard would be highly helpful.
Created two Datasets one to get the aggregation and another used the aggregated value to derive the new column.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
Input
Expected output
Try out the following. Note that I am using pyspark DataFrame here
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[1, "R", "2018-01-20"],
[1, "R", "2018-10-21"],
[1, "P", "2019-01-22"],
[2, "R", "2018-02-28"],
[2, "P", "2018-05-22"],
[2, "P", "2019-03-05"]], ["Id", "Indicator","Date"])
gr = df.filter(F.col("Indicator")=="R").groupBy("Id").agg(F.count("Indicator"))
gr = gr.withColumn("Marked_Flag", F.when(F.col("count(Indicator)") > 1, "Y").otherwise('N')).drop("count(Indicator)")
gr.show()
# +---+-----------+
# | Id|Marked_Flag|
# +---+-----------+
# | 1| Y|
# | 2| N|
# +---+-----------+
#

Spark - How to join current and previous records in a DataFrame and assign an original field to all such occurences

I need to scan through a Hive table and add values from the first record in a sequence to all linked records.
The logic would be:-
Find the first record (where previous_id is blank).
Find the next record (current_id = previous_id).
Repeat until there are no more linked records.
Add columns from original record to all linked records.
Output results to a Hive table.
Example Source Data:-
current_id previous_id start_date
---------- ----------- ----------
100 01/01/2001
200 100 02/02/2002
300 200 03/03/2003
Example Output Data:-
current_id start_date
---------- ----------
100 01/01/2001
200 01/01/2001
300 01/01/2001
I can achieve this by creating two DataFrames from the source table and performing multiple joins. However, this approach does not seem ideal as data has to be cached to avoid re-querying the source data with each iteration.
Any suggestions on how to approach this problem?
I think you can accomplish this using GraphFrames Connected components
It will help you avoid writing the checkpointing and looping logic yourself. Essentially you create a graph from the current_id and previous_id pairs and use GraphFrames to the component for each vertex. That resulting DataFrame can then be joined to the original DataFrame to get the start_date.
from graphframes import *
sc.setCheckpointDir("/tmp/chk")
input = spark.createDataFrame([
(100, None, "2001-01-01"),
(200, 100, "2002-02-02"),
(300, 200, "2003-03-03"),
(400, None, "2004-04-04"),
(500, 400, "2005-05-05"),
(600, 500, "2006-06-06"),
(700, 300, "2007-07-07")
], ["current_id", "previous_id", "start_date"])
input.show()
vertices = input.select(input.current_id.alias("id"))
edges = input.select(input.current_id.alias("src"), input.previous_id.alias("dst"))
graph = GraphFrame(vertices, edges)
result = graph.connectedComponents()
result.join(input.previous_id.isNull(), result.component == input.current_id)\
.select(result.id.alias("current_id"), input.start_date)\
.orderBy("current_id")\
.show()
Results in the following output:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001-01-01|
| 200|2001-01-01|
| 300|2001-01-01|
| 400|2004-04-04|
| 500|2004-04-04|
| 600|2004-04-04|
| 700|2001-01-01|
+----------+----------+
Here is an approach that I am not sure sits well with Spark.
There is a lack of a grouping id / key for the data.
Not sure how Catalyst would be able to optimize this - will look at at a later point in time. Memory errors if too large?
Have made the data more complicated, and this does work. Here goes:
# No grouping key evident, more a linked list with asc current_ids.
# Added more complexity to the example.
# Questions open on performance at scale. Interested to see how well Catalyst handles this.
# Need really some grouping id/key in the data.
from pyspark.sql import functions as f
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
# Started from dataframe.
# Some more realistic data? At least more complex.
columns = ['current_id', 'previous_id', 'start_date']
vals = [
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
]
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView("trans")
# Starting data. The null / None entries.
df2 = spark.sql("""
select *
from trans
where previous_id is null
""")
df2.cache
df2.createOrReplaceTempView("trans_0")
# Loop through the stuff based on traversing the list elements until exhaustion of data, and, write to dynamically named TempViews.
# May need to checkpoint? Depends on depth of chain of linked items.
# Spark not well suited to this type of processing.
dfX_cnt = 1
cnt = 1
while (dfX_cnt != 0):
tabname_prev = 'trans_' + str(cnt-1)
tabname = 'trans_' + str(cnt)
query = "select t2.current_id, t2.previous_id, t1.start_date from {} t1, trans t2 where t1.current_id = t2.previous_id".format(tabname_prev)
dfX = spark.sql(query)
dfX.cache
dfX_cnt = dfX.count()
if (dfX_cnt!=0):
#print('Looping for dynamic creation of TempViews')
dfX.createOrReplaceTempView(tabname)
cnt=cnt+1
# Reduce the TempViews all to one DF. Can reduce an array of DF's as well, but could not find my notes here in this regard.
# Will memory errors occur?
from pyspark.sql.types import *
fields = [StructField('current_id', LongType(), False),
StructField('previous_id', LongType(), True),
StructField('start_date', StringType(), False)]
schema = StructType(fields)
dfZ = spark.createDataFrame(sc.emptyRDD(), schema)
for i in range(0,cnt,1):
tabname = 'trans_' + str(i)
query = "select * from {}".format(tabname)
df = spark.sql(query)
dfZ = dfZ.union(df)
# Show final results.
dfZ.select('current_id', 'start_date').sort(col('current_id')).show()
returns:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001/01/01|
| 200|2001/01/01|
| 300|2001/01/01|
| 400|2005/01/01|
| 500|2005/01/01|
| 600|2001/01/01|
| 700|2001/01/01|
| 800|2009/02/02|
| 900|2009/02/02|
+----------+----------+
Thanks for the suggestions posted here. After trying various approaches I have gone with the following solution which works for multiple iterations (e.g. 20 loops), and does not cause any memory issues.
The "Physical Plan" is still huge, but caching means most of the steps are skipped, keeping performance acceptable.
input = spark.createDataFrame([
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
], ["current_id", "previous_id", "start_date"])
input.createOrReplaceTempView("input")
cur = spark.sql("select * from input where previous_id is null")
nxt = spark.sql("select * from input where previous_id is not null")
cur.cache()
nxt.cache()
cur.createOrReplaceTempView("cur0")
nxt.createOrReplaceTempView("nxt")
i = 1
while True:
spark.sql("set table_name=cur" + str(i - 1))
cur = spark.sql(
"""
SELECT nxt.current_id as current_id,
nxt.previous_id as previous_id,
cur.start_date as start_date
FROM ${table_name} cur,
nxt nxt
WHERE cur.current_id = nxt.previous_id
""").cache()
cur.createOrReplaceTempView("cur" + str(i))
i = i + 1
if cur.count() == 0:
break
for x in range(0, i):
spark.sql("set table_name=cur" + str(x))
cur = spark.sql("select * from ${table_name}")
if x == 0:
out = cur
else:
out = out.union(cur)

How to aggregate string to dictionary like results in pyspark?

I have a dataframe and I want to aggregate to daily.
data = [
(125, '2012-10-10','good'),
(20, '2012-10-10','good'),
(40, '2012-10-10','bad'),
(60, '2012-10-10','NA')]
df = spark.createDataFrame(data, ["temperature", "date","performance"])
I could aggregate numerical values using spark built in functions like max, min, avg. How could I aggregate strings?
I expect something like:
date
max_temp
min_temp
performance_frequency
2012-10-10
125
20
"good": 2, "bad":1, "NA":1
We can use MapType and UDF with Counter to return the value counts,
from pyspark.sql import functions as F
from pyspark.sql.types import MapType,StringType,IntegerType
from collections import Counter
data = [(125, '2012-10-10','good'),(20, '2012-10-10','good'),(40, '2012-10-10','bad'),(60, '2012-10-10','NA')]
df = spark.createDataFrame(data, ["temperature", "date","performance"])
udf1 = F.udf(lambda x: dict(Counter(x)),MapType(StringType(),IntegerType()))
df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).show(1,False)
+----------+----------------+----------------+---------------------------------+
|date |min(temperature)|max(temperature)|performance_frequency |
+----------+----------------+----------------+---------------------------------+
|2012-10-10|20 |125 |Map(NA -> 1, bad -> 1, good -> 2)|
+----------+----------------+----------------+---------------------------------+
df.groupby('date').agg(F.min('temperature'),F.max('temperature'),udf1(F.collect_list('performance')).alias('performance_frequency')).collect()
[Row(date='2012-10-10', min(temperature)=20, max(temperature)=125, performance_frequency={'bad': 1, 'good': 2, 'NA': 1})]
Hope this helps!

pyspark convert transactions into a list of list

I want to use PrefixSpan sequence mining in pyspark. The format of data that I need to have is the following:
[[['a', 'b'], ['c']], [['a'], ['c', 'b'], ['a', 'b']], [['a', 'b'], ['e']], [['f']]]
where the innermost elements are productIds, then there are orders (containing list of products) and then there are clients (containing lists of orders).
My data has transactional format:
clientId orderId product
where orderId has multiple rows for separate products and clientId has multiple rows for separate orders.
Sample data:
test = sc.parallelize([[u'1', u'100', u'a'],
[u'1', u'100', u'a'],
[u'1', u'101', u'b'],
[u'2', u'102', u'c'],
[u'3', u'103', u'b'],
[u'3', u'103', u'c'],
[u'4', u'104', u'a'],
[u'4', u'105', u'b']]
)
My solution so far:
1. Group products in orders:
order_prod = test.map(lambda x: [x[1],([x[2]])])
order_prod = order_prod.reduceByKey(lambda a,b: a + b)
order_prod.collect()
which results in:
[(u'102', [u'c']),
(u'103', [u'b', u'c']),
(u'100', [u'a', u'a']),
(u'104', [u'a']),
(u'101', [u'b']),
(u'105', [u'b'])]
2. Group orders in customers:
client_order = test.map(lambda x: [x[0],[(x[1])]])
df_co = sqlContext.createDataFrame(client_order)
df_co = df_co.distinct()
client_order = df_co.rdd.map(list)
client_order = client_order.reduceByKey(lambda a,b: a + b)
client_order.collect()
which results in:
[(u'4', [u'105', u'104']),
(u'3', [u'103']),
(u'2', [u'102']),
(u'1', [u'100', u'101'])]
Then I want to have a list like this:
[[[u'a', u'a'],[u'b']], [[u'c']], [[u'b', u'c']], [[u'a'],[u'b']]]
Here is the solution using PySpark dataframe (not that I use PySpark 2.1). First, you have to transform RDD to Dataframe.
df = test.toDF(['clientId', 'orderId', 'product'])
And this is the snippet to group the dataframe. Basic idea is to group by clientId and orderId first and aggregate product column together. Then group again by only clientId.
import pyspark.sql.functions as func
df_group = df.groupby(['clientId', 'orderId']).agg(func.collect_list('product').alias('product_list'))
df_group_2 = df_group[['clientId', 'product_list']].\
groupby('clientId').\
agg(func.collect_list('product_list').alias('product_list_group')).\
sort('clientId', ascending=True)
df_group_2.rdd.map(lambda x: x.product_list_group).collect() # collect output here
Result is the following:
[[['a', 'a'], ['b']], [['c']], [['b', 'c']], [['b'], ['a']]]

pyspark corr for each group in DF (more than 5K columns)

I have a DF with 100 million rows and 5000+ columns. I am trying to find the corr between colx and remaining 5000+ columns.
aggList1 = [mean(col).alias(col + '_m') for col in df.columns] #exclude keys
df21= df.groupBy('key1', 'key2', 'key3', 'key4').agg(*aggList1)
df = df.join(broadcast(df21),['key1', 'key2', 'key3', 'key4']))
df= df.select([func.round((func.col(colmd) - func.col(colmd + '_m')), 8).alias(colmd)\
for colmd in all5Kcolumns])
aggCols= [corr(colx, col).alias(col) for col in colsall5K]
df2 = df.groupBy('key1', 'key2', 'key3').agg(*aggCols)
Right now it is not working because of spark 64KB codegen issue (even spark 2.2). So i am looping for each 300 columns and merging all at the end. But it is taking more than 30 hours in a cluster with 40 nodes (10 core each and each node with 100GB). Any help to tune this?
Below things already tried
- Re partition DF to 10,000
- Checkpoint in each loop
- cache in each loop
You can try with a bit of NumPy and RDDs. First a bunch of imports:
from operator import itemgetter
import numpy as np
from pyspark.statcounter import StatCounter
Let's define a few variables:
keys = ["key1", "key2", "key3"] # list of key column names
xs = ["x1", "x2", "x3"] # list of column names to compare
y = "y" # name of the reference column
And some helpers:
def as_pair(keys, y, xs):
""" Given key names, y name, and xs names
return a tuple of key, array-of-values"""
key = itemgetter(*keys)
value = itemgetter(y, * xs) # Python 3 syntax
def as_pair_(row):
return key(row), np.array(value(row))
return as_pair_
def init(x):
""" Init function for combineByKey
Initialize new StatCounter and merge first value"""
return StatCounter().merge(x)
def center(means):
"""Center a row value given a
dictionary of mean arrays
"""
def center_(row):
key, value = row
return key, value - means[key]
return center_
def prod(arr):
return arr[0] * arr[1:]
def corr(stddev_prods):
"""Scale the row to get 1 stddev
given a dictionary of stddevs
"""
def corr_(row):
key, value = row
return key, value / stddev_prods[key]
return corr_
and convert DataFrame to RDD of pairs:
pairs = df.rdd.map(as_pair(keys, y, xs))
Next let's compute statistics per group:
stats = (pairs
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.collectAsMap())
means = {k: v.mean() for k, v in stats.items()}
Note: With 5000 features and 7000 group there should no issue with keeping this structure in memory. With larger datasets you may have to use RDD and join but this will be slower.
Center the data:
centered = pairs.map(center(means))
Compute covariance:
covariance = (centered
.mapValues(prod)
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.mapValues(StatCounter.mean))
And finally correlation:
stddev_prods = {k: prod(v.stdev()) for k, v in stats.items()}
correlations = covariance.map(corr(stddev_prods))
Example data:
df = sc.parallelize([
("a", "b", "c", 0.5, 0.5, 0.3, 1.0),
("a", "b", "c", 0.8, 0.8, 0.9, -2.0),
("a", "b", "c", 1.5, 1.5, 2.9, 3.6),
("d", "e", "f", -3.0, 4.0, 5.0, -10.0),
("d", "e", "f", 15.0, -1.0, -5.0, 10.0),
]).toDF(["key1", "key2", "key3", "y", "x1", "x2", "x3"])
Results with DataFrame:
df.groupBy(*keys).agg(*[corr(y, x) for x in xs]).show()
+----+----+----+-----------+------------------+------------------+
|key1|key2|key3|corr(y, x1)| corr(y, x2)| corr(y, x3)|
+----+----+----+-----------+------------------+------------------+
| d| e| f| -1.0| -1.0| 1.0|
| a| b| c| 1.0|0.9972300220940342|0.6513360726920862|
+----+----+----+-----------+------------------+------------------+
and the method provided above:
correlations.collect()
[(('a', 'b', 'c'), array([ 1. , 0.99723002, 0.65133607])),
(('d', 'e', 'f'), array([-1., -1., 1.]))]
This solution, while a bit involved, is quite elastic and can be easily adjusted to handle different data distributions. It should be also possible to given further boost with JIT.

Resources