Pandas - Get the first n rows if a column has a specific value - python-3.x

I have a DataFrame that has 5 columns including User and MP.
I need to extract a sample of n rows for each User, n being a percentage based on User (if User has 1000 entries and n is 5, select the first 50 rows and and go to the next User. After that I have to add all the samples to a new DataFrame. Also if User has multiple values on the column MP, for example if the user has 2 values in the column MP, select 2.5% for 1 value and 2.5% for the other.
Somehow my logic isn't that good(started with the first step, without adding the logic for multiple MPs)
df = pd.read_excel("Results/fullData.xlsx")
dfSample = pd.DataFrame()
uniqueValues = df['User'].unique()
print(uniqueValues)
n = 5
for u in uniqueValues:
sm = df["User"].str.count(u).sum()
print(sm)
for u in df['User']:
sample = df.head(int(sm*(n/100)))
#print(sample)
dfSample.append(sample)
print(dfSample)
dfSample.to_excel('testFinal.xlsx')

Check Below example. It is intentionally verbose for understanding. The column that solve problem is "ROW_PERC". You can filter it based on the requirement (50% rows or 25% rows) that are required for each USR/MP.
import pandas as pd
df = pd.DataFrame({'USR':[1,1,1,1,2,2,2,2],'MP':['A','A','A','A','B','B','A','A'],"COL1":[1,2,3,4,5,6,7,8]})
df['USR_MP_RANK'] = df.groupby(['USR','MP']).rank()
df['USR_MP_RANK_MAX'] = df.groupby(['USR','MP'])['USR_MP_RANK'].transform('max')
df['ROW_PERC'] = df['USR_MP_RANK']/df['USR_MP_RANK_MAX']
df
Output:

Related

How to calculate a new column from a large Dataframe using Dictionary in a custom function?

I have a dataframe df with 700 million rows and three columns in the following format
key_x key_y num
0 1 1 111.111
1 1 2 222.222
2 1 3 333.333
:
I have a dictionary dict where all values in key_x and key_y are stored as keys
I need to create a new column such that, for each row in df
df['result'] = df['num'] /( dict[key_x] * dict[key_y])
My current approach is to vectorize as the following:
def find_res(key_x,key_y,num):
return num/(dict[key_x]*row_dict[key_y])
df["result"] = np.vectorize(find_res)(df["key"],df["key_y"],df["num"])
However this approach is too slow. I have a RAM of around 500GB, so mem is not an issue. Is there an more efficient method to perform the same operation?
You can use map:
df['result'] = df['num'] / (df['key_x'].map(your_dict) * df['key_y'].map(your_dict) )

Seperating data based on cell value

I have data as below
Account-Num Date Dr Cr
123 29-04-2020 100
123 28-04-2020 50
258 28-04-2020 75
258 29-04-2020 30
How do I separate data of each account number and save it on new sheet or file.
I have tried and came up with following code
import pandas as pd
soa = pd.read_excel('ubl.xlsx')
acc = '218851993'
df2 = soa.where(soa['ACCT_NO']== acc)
df2.to_csv('C:/Users/user/Desktop/mcb/D/HBL/UBL/' + acc + '.csv',index=False)
but it is generating following error.
AttributeError: 'function' object has no attribute 'to_csv'
You can use a pivot table.
In the rows put all your dates. In the columns, put the account numbers. You can then add the DR and CR columns to your values, making sure you sum them.
This will then aggregate all information, per date, for each account number.

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

pandas - grouping values by pair of columns and pivoting

Been struggling to think what to do here, pivoting and melting and whatnot doesn't seem to be working out. I was trying to join the names of the to/from destinations together and then re-order the combined names but it was a total mess
My data concerns flows from one location to another, it's in the format:
pd.DataFrame(columns=['from_location','to_location','flow'],data =[['a','b',1],['b','a',3]])
from_location to_location flow
0 a b 1
1 b a 3
but my output needs to be the format:
pd.DataFrame(columns=['connection','flow','back flow','net'],data =[['a -> b',1,3,2]])
connection flow back flow net
0 a -> b 1 3 2
Any nice built in functions that can rearrange things like this? I'm not even sure what keywords to search by
Use:
#df = df.sort_values(['from_location','to_location'])
df1 = pd.DataFrame(np.sort(df[['from_location','to_location']], axis=1),
columns=list('ab'), index=df.index)
s = df1['a'] + ' -> ' + df1['b']
df2 = df.groupby(s)['flow'].agg(['first','last']).assign(net=lambda x: x['last'] - x['first'])
print (df2)
first last net
a -> b 1 3 2
Explanation:
If necessary first sort_values if possible some paired rows are swapped
Sort columns per rows by numpy.sort and join columns together with splitter
Then groupby by joined values and aggregate by agg with first and last
Last if need subtract columns add new column by assign

Resources