Convert groupBYKey to ReduceByKey Pyspark - apache-spark

How to convert groupbyKey to reduceByKey in pyspark. I have attached a snippet. This will apply a corr for each region dept week combination. I have used groupbyKey, but its very slow and Shuffle error (i have 10-20GB of data and each group will have 2-3GB). Please help me in rewriting this using reduceByKey
Data set
region dept week val1 valu2
US CS 1 1 2
US CS 2 1.5 2
US CS 3 1 2
US ELE 1 1.1 2
US ELE 2 2.1 2
US ELE 3 1 2
UE CS 1 2 2
output
region dept corr
US CS 0.5
US ELE 0.6
UE CS .3333
Code
def testFunction (key, value):
for val in value:
keysValue = val.asDict().keys()
inputpdDF.append(dict([(keyRDD, val[keyRDD]) for keyRDD in keysValue])
pdDF = pd.DataFrame(inputpdDF, columns = keysValue)
corr = pearsonr(pdDF['val1'].astype(float), pdDF['val1'].astype(float))[0]
corrDict = {"region" : key.region, "dept" : key.dept, "corr": corr}
finalRDD.append(Row(**corrDict))
return finalRDD
resRDD = df.select(["region", "dept", "week", "val1", "val2"])\
.map(lambda r: (Row(region= r.region, dept= r.dept), r))\
.groupByKey()\
.flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))

Try:
>>> from pyspark.sql.functions import corr
>>> df.groupBy("region", "dept").agg(corr("val1", "val2"))

Related

Pandas - Sorting by 2 columns and comparing the values in the other column

Please consider this data Frame:
pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']
})
In this data frame, we have two regions, and in each region multiple employee roles. The salary column contains salary for that role in that region. Assume that all salary numbers are have the same currency.
Now, I would like to make sure that for any ROLE, the salary in CAN region must be at least as much as that in the US - and the salary in EU must be at least as much as that in CAN.
How do I solve it so that I get the following data frame?
pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,4,6,4.1,6,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']
})
Please note that this is a sample data frame - in the real data frame, I have a few additional columns that I would like to keep unchanged.
Thanks!
Another solution using groupby and cummax. I like this method because you can extend the number of regions you need to support relatively easily by adding additional regions to the custom sorting order.
df = pd.DataFrame({
'REGION':['US','US','CAN','CAN', 'EU','EU','EU'],
'ROLE': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'SALARY' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})
# Replace the region with a categorical variable to ensure sorting order is US, CAN, EU
df["REGION"] = pd.Categorical(df["REGION"], ["US", "CAN", "EU"])
df = df.sort_values(["ROLE", "REGION"])
df = df.groupby("ROLE").apply(lambda x: x.assign(SALARY=x["SALARY"].cummax()))
# if you need your data in the original order again
df = df.sort_index()
import pandas as pd
data = pd.DataFrame({
'region':['US','US','CAN','CAN', 'EU','EU','EU'],
'role': ['mgr','dir','mgr','dir','mgr','dir','CEO'],
'salary' : [4,5,3.7,6,4.1,5.5,8],
'other_columns':['random_val1','random_val2','random_val3','random_val4','random_val5','random_val6','random_val7']})
pt = pd.pivot_table(data, values=['salary'], index=['role'], columns=['region'])
df = pt['salary'].fillna(0)
df['CAN'] = df.apply(lambda x: max(x['US'], x['CAN']), axis=1)
df['EU'] = df.apply(lambda x: max(x['CAN'], x['EU']), axis=1)
data['salary'] = data.apply(lambda x: df[x['region']][x['role']], axis=1)
print(data)
Solution by mapping and slicing in MultiIndex, for set values i use Series.clip:
df = df.set_index(['REGION','ROLE'])
df1 = df.copy()
us = df.loc['US', 'SALARY']
can = df.loc['CAN', 'SALARY']
eu = df.loc['EU', 'SALARY']
df.loc['CAN', 'SALARY'] = can.clip(lower=can.index.map(us)).to_numpy()
df.loc['EU', 'SALARY'] = eu.clip(lower=eu.index.map(can)).to_numpy()
df = df.fillna(df1).reset_index()
print (df)
REGION ROLE SALARY other_columns
0 US mgr 4.0 random_val1
1 US dir 5.0 random_val2
2 CAN mgr 4.0 random_val3
3 CAN dir 6.0 random_val4
4 EU mgr 4.1 random_val5
5 EU dir 6.0 random_val6
6 EU CEO 8.0 random_val7
Another solution woth pivoting and unpivot:
df1 = df.pivot('ROLE','REGION','SALARY')
df1['CAN'] = df1[['CAN','US']].max(axis=1)
df1['EU'] = df1[['CAN','EU']].max(axis=1)
df = df.join(df1.stack().rename('new'), on=['ROLE','REGION'])
df['SALARY'] = df.pop('new')
print (df)
REGION ROLE SALARY other_columns
0 US mgr 4.0 random_val1
1 US dir 5.0 random_val2
2 CAN mgr 4.0 random_val3
3 CAN dir 6.0 random_val4
4 EU mgr 4.1 random_val5
5 EU dir 6.0 random_val6
6 EU CEO 8.0 random_val7

featuretools: manual derivation of the features generated by dfs?

Code example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
# Normalized one more time
es = es.normalize_entity(
new_entity_id="device",
base_entity_id="sessions",
index="device",
)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["std",],
groupby_trans_primitives=['cum_count'],
max_depth=2
)
And I'd like to look into STD(sessions.CUM_COUNT(device) by customer_id) feature deeper:
And I tried to generate this feature manually but had different result:
df = ft.demo.load_mock_customer(return_single_table=True)
a = df.groupby("customer_id")['device'].cumcount()
a.name = "cumcount_device"
a = pd.concat([df, a], axis=1)
b = a.groupby("customer_id")['cumcount_device'].std()
>>> b
customer_id
1 36.517
2 26.991
3 26.991
4 31.610
5 22.949
Name: cumcount_device, dtype: float64
What am I missing?
Thanks for the question. The calculation needs to be based on the data frame from sessions.
df = es['sessions'].df
cumcount = df['device'].groupby(df['customer_id']).cumcount()
std = cumcount.groupby(df['customer_id']).std()
std.round(3).loc[feature_matrix.index]
customer_id
5 1.871
4 2.449
1 2.449
3 1.871
2 2.160
dtype: float64
You should get the same output as in DFS.

Aggregating using custom function and several colums in pandas

Suppose I have the following data frame:
group num value
a 3 20
a 5 5
b 5 10
b 10 5
b 2 25
Now, I want to compute the weighted average of columns num and value grouping by column group. Using tidyverse packages in R, this is straightforward:
> library(tidyverse)
> df <- tribble(
~group , ~num , ~value,
"a" , 3 , 20,
"a" , 5 , 5,
"b" , 5 , 10,
"b" , 10 , 5,
"b" , 2 , 25
)
> df %>%
group_by(group) %>%
summarise(new_value = sum(num * value) / sum(num))
# A tibble: 2 x 2
group new_value
<chr> <dbl>
1 a 10.6
2 b 8.82
Using Pandas in Python, I can make all intermediary computation beforehand, and then use sum() to sum up the variables, and then perform the division using transform() like this:
import pandas as pd
from io import StringIO
data = StringIO(
"""
group,num,value
a,3,20
a,5,5
b,5,10
b,10,5
b,2,25
""")
df = pd.read_csv(data)
df["tmp_value"] = df["num"] * df["value"]
df = df.groupby(["group"]) \
[["num", "tmp_value"]] \
.sum() \
.transform(lambda x : x["tmp_value"] / x["num"], axis="columns")
print(df)
# group
# a 10.625000
# b 8.823529
# dtype: float64
Note that we explicitly need first to subset the columns of interest ([["num", "tmp_value"]]), compute the sum (sum()), and then the average/division using transform(). In R, we write this in just one simple step, much more compact and readable, IMHO.
Now, how can I accomplish that elegancy using Pandas? In other words, can it be more clean, elegant, and mainly easy to read as we do in R?
#an_drade - There has been a very similar stackoverflow question that provides the solution:
Pandas DataFrame aggregate function using multiple columns
The solution to your question is based on the above post by creating a python function:
df=pd.DataFrame([['a',3,20],['a',5,5],['b',5,10],['b',10,5],['b',2,25]],columns=['group','num','value'])
def wavg(group):
d = group['num']
w = group['value']
return (d*w).sum() / d.sum()
final=df.groupby("group").apply(wavg)
group
a 10.625000
b 8.823529
dtype: float64
This is the "R way" you wanted:
>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.dplyr import group_by, summarise
>>> from datar.base import sum
>>> # or if you are lazy:
>>> # from datar.all import *
>>>
>>> df = tribble(
... f.group , f.num , f.value,
... "a" , 3 , 20,
... "a" , 5 , 5,
... "b" , 5 , 10,
... "b" , 10 , 5,
... "b" , 2 , 25
... )
>>> df >> \
... group_by(f.group) >> \
... summarise(new_value = sum(f.num * f.value) / sum(f.num))
group new_value
<object> <float64>
0 a 10.625000
1 b 8.823529
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

How to check NULL values while comparing 2 text files using spark data frames

The below code failing to capture the 'null' value records. From below df1, the column NO . 5 has a null value (name field).
As per my below requirement OutputDF, the No. 5 record should come as mentioned. But after below code execution this record is not coming into the final output. The records with 'null' values are not coming into the output. Except this, remaining everything fine.
df1
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 600
3 HR GOPI 1500
5 HW 700
df2
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 900
4 MT SUMP 1200
5 HW MAHI 700
OutputDF
NO DEPT NAME SAL FLAG
1 IT RAM 1000 SAME
2 IT SRI 900 UPDATE
4 MT SUMP 1200 INSERT
3 HR GOPI 1500 DELETE
5 HW MAHI 700 UPDATE
from pyspark.shell import spark
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
sc = spark.sparkContext
filedf1 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file1.csv")
filedf2 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file2.csv")
filedf1.createOrReplaceTempView("table1")
filedf2.createOrReplaceTempView("table2")
df1 = spark.sql( "select * from table1" )
df2 = spark.sql( "select * from table2" )
#DELETE
df_d = df1.join(df2, df1.NO == df2.NO, 'left').filter(F.isnull(df2.NO)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('DELETE').alias('FLAG'))
print("df_d left:",df_d.show())
#INSERT
df_i = df1.join(df2, df1.NO == df2.NO, 'right').filter(F.isnull(df1.NO)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('INSERT').alias('FLAG'))
print("df_i right:",df_i.show())
#SAME
df_s = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) == F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('SAME').alias('FLAG'))
print("df_s inner:",df_s.show())
#UPDATE
df_u = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) != F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('UPDATE').alias('FLAG'))
print("df_u inner:",df_u.show())
df = df_d.union(df_i).union(df_s).union(df_u)
df.show()
Here i'm comparing both df1 and df2, if found new records in df2 taking flag as INSERT, if record is same in both dfs then taking as SAME, if the record is in df1 and not in df2 taking as DELETE and if the record exist in both dfs but with different values then taking df2 values as UPDATE.
There's two issues with the code:
The result of F.concat of a null returns null, so this part in code filters out row row NO 5:
.filter(F.concat(df2.NO, df2.NAME, df2.SAL) != F.concat(df1.NO, df1.NAME, df1.SAL))
You are only selecting df2. It's fine in the example case above, but if your df2 has a null then the resultant dataframe will have null.
You can try concatenating it with a udf below:
def concat_cols(row):
concat_row = ''.join([str(col) for col in row if col is not None])
return concat_row
udf_concat_cols = udf(concat_cols, StringType())
The function concat_row can be broken down into two parts:
"".join([mylist]) is a string function. It joins everything in the
list with the defined delimeter, in this case it's an empty string.
[str(col) for col in row if col is not None] is a list comprehension, it does as it reads: for each column in the row, if
the column is not None, then append the str(col) into the list.
List comprehension is just a more pythonic way of doing this:
mylist = []
for col in row:
if col is not None:
mylist.append(col))
You can replace your update code as:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(coalesce(df1.NO, df2.NO),
coalesce(df1.NAME, df2.NAME),
coalesce(df1.SAL, df2.SAL),
F.lit('UPDATE').alias('FLAG')))
You should do something similar for your #SAME flag and break the line for readability.
Update:
If df2 always have the correct (updated) result, there is no need to coalesce.
The code for this instance would be:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(df2.NO,
df2.NAME,
df2.SAL,
F.lit('UPDATE').alias('FLAG')))

Resources