Loop through grouped pandas dataframe and perform some operations - pandas-groupby

I'm trying to perform an action on grouped data in Pandas. For each group based on variable "atable" and "column" I want to loop through the rows and see if sum of values for variable "value" for Include "Yes" is equal to sum of values for variable "value" for Include "No", only if Include has both "Yes" and "No" values for that group. If conditions are not met, then I want to print out the error with the row details. My data looks like this:
df1 = pd.DataFrame({
'atable':['Users','Users','Users','Users','Locks'],
'column':['col_1','col_1','col_1','col_a','col'],
'Include':['No','Yes','Yes','Yes','Yes'],
'value':[3,2,1,1,1],
})
df1
Include atable column value
0 No Users col_1 3
1 Yes Users col_1 2
2 Yes Users col_1 1
3 Yes Users col_a 1
4 Yes Locks col 1
I tried the below code but it is also erroring out for the rows which doesnot have either "Yes" or "No" in Include column as below:
grouped = df1.groupby(["atable", "column"])
for index, rows in grouped:
if (([rows['Include'].isin(["Yes", "No"])])) and (rows[rows['Include'] == 'Yes']['value'].sum() != rows[rows['Include'] == 'No']["value"].sum()):
print("error", index)
Output:
error ('Locks', 'col')
error ('Users', 'col_a')
I dont want my code to error out even for index 3 & 4 since those rows just has "Yes" in Include column.

This worked:
grouped = df2.groupby(["atable", "column"])
for index, rows in grouped:
if (rows[rows['Include'] == 'Yes']['value'].sum() != rows[rows['Include'] == 'No']["value"].sum()) and (rows[rows['Include'] == 'Yes']['value'].sum() != 0) and (rows[rows['Include'] == 'No']['value'].sum() != 0):
print("error", index)

Related

Python - Filter out rows from dataframe based on match on columns from another dataframe

I have the 2 dataframes as below:
Based on the values of df2, if df1 rows matches ALL the conditions of df2, then remove it from df1.
Expected output:
If the column value is NULL, then consider it to match ALL the values, else regular match.
i.e. 1st row (from df2) only has product value (other columns are null), so filter should match all values of book, business and ONLY product = Forex; so "tr4" row should be matched and hence remove.
2nd row (form df2) has to match book = b2, all business (since NULL) and product = Swap, i.e. no rows matched with all this (AND) condition, and nothing removed.
I can have result inplace or new df, how can this be done?
for i in range(len(df2)):
for j in range(len(df1)):
if (df2['book'][i] == "[NULL]" or df2['book'][i] == df1['book'][i]
and df2['business'][i] == "[NULL]" or df2['business'][i] == df1['business'][i]
and df2['product'][i] == "[NULL]" or df2['product'][i] == df1['product'][i]):
df1 = df1.drop(j)
df1.reset_index(drop=True, inplace=True)

pyspark df filter records based on integer

Input:
Expected Output:
These columns are dynamic. there can be 'n' number of columns. Need to filter records by all columns which is not equal to 0. Column datatype is decimal.
for column in df.columns:
count=(df.filter(F.col(column) != int(0)).count()
if count>0:
do some function
else:
do some other function
Main problems i face:
Any way to filter all columns at the same time without for loop
NULL records are not filtering out
How this can be efficiently done in pyspark?
You can create a filter expression dynamically for all the columns and then perform your filter.
filter_condition = ''
for column_name in df.schema.names:
if filter_condition != '':
filter_condition +=' & '
filter_condition += f"{column_name} <> 0"
display(df.filter(filter_condition))
This one uses simple sql syntax.

Condition using df index range

I want to add region column to df1 using ip range stated in df2 given two data frames:
df1 has ip_address
df2 has ip_from, ip_to, region
can you make a conditional statement using indexes?
if df1.ip[0] is in between specific ip, add region in df1?
I am guessing there should be a loop in the if statement so that it can loop through df2 to see where the ip ranges are and grab the region.
I know by adding each condition manually, this will work,
but is there a way to make condition loop by index?
region=[]
for row in df1['ip']:
if row > 15:
region.append('D')
elif row > 10:
region.append('C')
elif row > 5:
region.append('B')
else:
region.append('A')
df1['region'] = region
To make it iterate through rows, can it be done this way?
region = []
# For each row in the column,
for row in df1['ip']:
if (row >= df2.loc[row,'ip_from']) and (row <= df2.loc[row,'ip_to']):
region.append(df2.loc[row,'region'])
You can use a list of regions with indices that match ip address value:
df1 = {'ip': [13,4,7,2], 'region': []}
df2 = list('AAAAAABBBBBCCCCCDDDDD')
for i in range(len(df1['ip'])):
df1['region'].append(df2[df1['ip'][i]])

How to drop single valued columns effiecently from dataframe

How to drop all columns which have single value from a dataframe effiecntly ?
I found two ways:
This method ignores null and only considers other values, I need to consider nulls in my case
# apply countDistinct on each column
col_counts = partsDF.agg(*(countDistinct(col(c)).alias(c) for c in partsDF.columns)).collect()[0].asDict()
this method takes too long time
col_counts = partsDF.agg(*( partsDF.select(c).distinct().count() for c in partsDF.columns)).collect()[0].asDict()
#select the cols with count=1 in an array
cols_to_drop = [col for col in partsDF.columns if col_counts[col] == 1 ]

return columns where dataframes differ in values

I have two dataframes like the df1 and df2 examples below. I would like to compare values between the dataframes, and return the columns where the dataframes have different values in the column. So in the example below it would return column B. Any tips are greatly apreciated.
df1
A B C
1 2 3
1 1 1
df2
A B C
1 1 3
1 1 1
Comparing dataframes using != or ne() return a boolean dataframe on which you can look for any True values using any(). This returns a boolean series which you can index with itself.
s = (df1 != df2).any()
s[s].index
Index(['B'], dtype='object')
In your above example using eq with all
df1.eq(df2).all().loc[lambda x : ~x].index
Out[720]: Index(['B'], dtype='object')

Resources