Pandas groupby: coordinates of current group - python-3.x

Suppose I have a data frame
import pandas as pd
df = pd.DataFrame({'group':['A','A','B','B','C','C'],'score':[1,2,3,4,5,6]})
At first, say, I want to compute the groups' sums of scores. I usually do
def group_func(x):
d = {}
d['sum_scores'] = x['score'].sum()
return pd.Series(d)
df.groupby('group').apply(group_func).reset_index()
Now suppose I want to modify group_func but this modification requires that I know the group identity of the current input x. I tried x['group'] and x[group].iloc[0] within the function's definition and neither worked.
Is there a way for the function group_func(x) to know the defining coordinates of the current input x?
In this toy example, say, I just want to get:
pd.DataFrame({'group':['A','B','C'],'sum_scores':[3,7,11],'name_of_group':['A','B','C']})
where obviously the last column just repeats the first one. I'd like to know how to make this last column using a function like group_func(x). Like: as group_func processes the x that corresponds to group 'A' and generates the value 3 for sum_scores, how do I extract the current identity 'A' within the local scope of group_func?

Just add .name
def group_func(x):
d = {}
d['sum_scores'] = x['score'].sum()
d['group_name'] = x.name # d['group_name'] = x['group'].iloc[0]
return pd.Series(d)
df.groupby('group').apply(group_func)
Out[63]:
sum_scores group_name
group
A 3 A
B 7 B
C 11 C
Your code fix see about marked line adding ''

Related

Iterate and return based on priority - Python 3

I have a df with multiple rows. What I need is to check for a specific value in a column's value and return if there is a matching. I have a set of rules, which takes priority based on order.
My Sample df:
file_name fil_name
0 02qbhIPSYiHmV_sample_file-MR-job1 02qbhIPSYiHmV
1 02qbhIPSYiHmV_sample_file-MC-job2 02qbhIPSYiHmV
2 02qbhIPSYiHmV_sample_file-job3 02qbhIPSYiHmV
For me MC takes the first priority. If MC is present in file_name value, take that record. If MC is not there, then take the record that has MR in it. If no MC or MR, then just take what ever is there in my case just the third row.
I came up with a function like this,
def choose_best_record(df_t):
file_names = df_t['file_name']
for idx, fn in enumerate(file_names):
lw_fn = fn.lower()
if '-mc-' in lw_fn:
get_mc_row = df_t.iloc[idx:idx+1]
print("Returning MC row")
return get_mc_row
else:
if '-mr-' in lw_fn:
get_mr_row = df_t.iloc[idx:idx+1]
print('Returning MR row')
return get_mr_row
else:
normal_row = df_t.iloc[idx:idx+1]
print('Reutrning normal row')
return normal_row
However, this does not behave the way I want. I need MC (row index 1), instead, it returns MR row.
If I have my rows in the dataframe in order like this, ...file-MR-job1, ...file-MR-job1, ....file-MR-job1, then it works. How can I change my function to work based on how I need my out put?

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Question on calculating incoming data from file

If I am reading a data file with some variable, I need to calculate the total numbers of different items by adding them from different lines. For example:
Fruit,Number
banana,25
apple,12
kiwi,29
apple,44
apple,81
kiwi,3
banana,109
kiwi,113
kiwi,68
we would need to add a third variable which is a total of the fruit, and fouth total of all the fruits.
So the output should be like following:
Fruit,Number,TotalFruit,TotalAllFruits
banana,25,25,25
apple,12,12,37
kiwi,29,29,66
apple,44,56,110
apple,81,137,191
kiwi,3,32,194
banana,109,134,303
kiwi,113,145,416
kiwi,68,213,484
I was able to get the first 2 columns printed, but having problem with the last 2 columns
import sys
import re
f1 = open("SampleInput.csv", "r")
f2 = open('SampleOutput.csv', 'a')
sys.stdout = f2
print("Fruit,Number,TotalFruit,TotalAllFruits")
for line1 in f1:
fruit_list = line1.split(',')
exec("%s = %d" % (fruit_list[1], 0))
print(fruit_list[0] + ',' + fruit_list[1])
I am just learning python, so I want to apologize in advance if I am missing something very simple.
You need to declare a 2d-array to keep the values read from the input file.
And during the loop, you need to read the value from previous lines, and then calculate the value of the current line.
And print the 2d-array after all input lines read.
I would recommend you to use pandas library as it makes your process easier
import pandas as pd
df1 = pd.read_csv("SampleInput.csv",sep=",")
df2 = pd.DataFrame()
for index, row in df1.iterrows():
# change the above to what ever you need
df2['Totalsum'] = df1['TotalFruit'] + df1['TotalAllFruits']
df2['Fruit'] = df1['Fruit']
df2.to_csv('SampleOutput.csv',sep=",")
df2 format :
Fruit | Totalsum |
---------------------
Name | Sum |
---------------------
Feel free to change the number of columns to your needs and add your custom logic.

How to populate a dataframe column based on the value of another column

Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)

Replacing the 0's in a column with mode corresponding to another column python

The above data frame have 9 product of 3 different classes. The attributes of these products are quality and taste.
Attributes in some of the products reads 0 which is wrong and it have to be replaced by the mode of its class. Like the below figure
I have grouped it based on its mode
data.groupby(['class'])['quality','taste'].agg(lambda x:x.value_counts().index[0])
But kindly help me to replace the 0s with the mode corresponding to its class
You can use transform with replace:
data[['quality','taste']] = data.groupby(['class'])['quality','taste'].transform(lambda x: x.replace(0, x.value_counts().index[0]))
Or use custom function:
def f(x):
a = x.value_counts().index[0]
m = x == 0
x[m] = a
return x
data[['quality','taste']] = data.groupby(['class'])['quality','taste'].transform(f)

Resources