pyspark calculate custom metric on grouped data - apache-spark

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value from col1 and col2 for the group_key. This calculation involves complex looping for col1 and col2. I have written a pyspark UDF which works fine and returns me the desired output for each group.
However, the row-by-row UDF runs for few thousand rows which is slow, but it definitely does not run for 40 billion rows.
group_key col1 col2
123 a 5
123 a 6
123 b 6
123 cd 3
123 d 2
123 ab 9
456 d 4
456 ad 6
456 ce 7
456 a 4
456 s 3
desired output
group_key output
123 9.2
456 7.3
I know pyspark UDF is the worst choice since its row-by-row operation. For vectorized or batch operation I tried using pandas_udf (PandasUDFType: GROUPED_MAP) on each group. If I try to use for loop inside output_calc_udf it throws error. My logic needs to traverse each row of the dataframe passed to the pandas_udf and create a list out of it which I need to traverse few times until I keep pop-ing list elements and the list is empty. Its quite complex. But as I am understanding I cannot really do looping in vectorized operation ? How else I can execute the UDF for 40 billion rows ?
schema = StructType([StructField("output", DecimalType(), True)])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def output_calc_udf(pdf):
pdf=pdf[['output']]
return pdf
df_prob_new2.groupBy("group_key").apply(output_calc_udf).show(truncate=False)

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.
I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540

how can I get the operator type and apply formula into pandas dataframe

I have a python string like this: str1='PRODUCT1_PRD/2+PRODUCT2_NON-PROD-PRODUCT3_NON-PRD/2'
Here I want to get operator in a dynamic way and based on the operator I want to perform certain operation.I have a pandas dataframe df like this:
PRODUCT PRD NON-PORD
PRODUCT1 3 5
PRODUCT2 4 6
PRODUCT3 5 8
Output I want a variable var1=(3/2)+6-(8/2)=3.5 after applying the above formula. How can I do this most efficient way?
Want to note one thing: I have multiple formulas like what I mentioned, all are inside a list of strings. So I have to apply all those formulas one by one.
First create MultiIndex Series by DataFrame.set_index with DataFrame.stack and join index values by _ in map:
s = df.set_index('PRODUCT').stack()
s.index = s.index.map('_'.join)
print (s)
PRODUCT1_PRD 3
PRODUCT1_NON-PROD 5
PRODUCT2_PRD 4
PRODUCT2_NON-PROD 6
PRODUCT3_PRD 5
PRODUCT3_NON-PROD 8
dtype: int64
Then replace values in string by Series and call pandas.eval:
str1='PRODUCT1_PRD/2+PRODUCT2_NON-PROD-PRODUCT3_NON-PROD/2'
for k, v in s.items():
str1 = str1.replace(k, str(v))
print (str1)
3/2+6-8/2
print (pd.eval(str1))
3.5

compare values of two dataframes based on certain filter conditions and then get count

I am new to spark. I am writing a pyspark code where I have two dataframes such that :
DATAFRAME-1:
NAME BATCH MARKS
A 1 44
B 15 50
C 45 99
D 2 18
DATAFRAME-2:
NAME MARKS
A 36
B 100
C 23
D 67
I want my output as a comparison between these two dataframes such that I can store the counts as my variables.
for instance,
improvedStudents = 1 (since D belongs to batch 1-15 and has improved his score)
badPerformance = 2 (A,B have performed bad since they belong between batch 1-15 and their marks are lesser than before)
neutralPerformance = 1 (C because even if his marks went down, he belongs to batch 45 that we dont want to consider)
This is just an example out of a complex problem I'm trying to solve.
Thanks
If the data is as in your example why don't you just join them and create new columns for every metric that you have:
val df = df1.withColumnRenamed("MARKS", "PRE_MARKS")
.join(df2.withColumnRenamed("MARKS", "POST_MARKS"), Seq("NAME"))
.withColumn("Evaluation",
when(col("BATCH") > 15, lit("neutral"))
.when(col("PRE_MARKS") gt col("POST_MARKS"), lit("bad"))
.when(col("POST_MARKS") gt col("PRE_MARKS"), lit("improved"))
.otherwise(lit("neutral"))
.groupBy("Evaluation")
.count

How to snip part of a rows' data and only leave the first 3 digits in Python

0 546/001441
1 540/001495
2 544/000796
3 544/000797
4 544/000798
I have a column in my dataframe that I've provided above. It can have any number of rows depending on the data being crunched. It is one of many columns but the first three numbers match another columns data. I need to cut off everything after the first 3 numbers in order to append it to another dataframe based off of the similar values. Any ideas as to how to get only the first 3 numbers and cut off the remaining 8 values?
So far I've got the whole column singled out as a Series and also as a numpy.array in order to try to convert it to a str instead of an object.
I know this is getting me closer to an answer but i can't seem to figure out how to cut out the unnecessary values
testcut=dfwhynot[0][:3]
this cuts the string where i need it, but how do i do this for the whole column is what i can't figure out.
Assuming the name of your column is col, you can
# Split the column into two
df['col'] = df['col'].apply(lambda row: row.split('/'))
df[['col1', 'col2']] = pd.DataFrame(df_out.values.tolist())
col1 col2
0 546 001441
1 540 001495
2 544 000796
3 544 000797
4 544 000798
then get the minimal element of each col1 group
df.groupby('col1').min().reset_index()
resulting in
col1 col2
0 540 001495
1 544 000796
2 546 001441

Resources