How to get multiple aggregation in a dataframe? cumsum and count columns - python-3.x

I need a column which aggregates using the count() function and another field using the cumsum() function in a dataframe
I would like to group it only once and the cumsum should be grouped with Site almost just like the count. How can I do this?
#I get the count by grouping site and arrived
df_arrived_gby = df.groupby(['Site','Arrived']).size().reset_index(name='Count_X')
#I do the cumsum but it should be groupby Site and Arrived same as above
#How can I do this?
df_arrived_gby['Cumsum_X'] = df_arrived_gby['Count_X'].cumsum()
print(df_arrived_gby)
Data example (it is not grouped by Site, so it continues adding the others):
Site Arrived Count Cumsum
198 T 30/06/2020 146 22368
199 T 31/05/2020 76 22444
200 V 05/01/2020 77 22521
201 V 05/02/2020 57 22578

First you need to get the values from the Count_X column, then you can cumsum():
df_arrived_gby['Cumsum_X'] = df_arrived_gby.Count_X.values.cumsum()
Let me know if that helps

I was able to do it using groupby on a new dataframe column as shown below:
df_arrived_gby['Cumsum'] = df_arrived_gby.groupby(['Site'])['Count X'].apply(lambda x: x.cumsum())

Related

Create a column by Groupy and filter in python

I have a data frame with vendor, bill amount, and payment type.
I want to add a column in which I will get sum of late payment by Vendor.
Is it be possible to write one line code to get this output?
df['Paid Late by Vendor']=
You can use a combination of groupby.transform and bfill(), and assign back to a new column using assign:
df = df.assign(late_payments=df[df['Payment'].eq('Delay')].groupby('Vendor')['Amount'].transform('sum')).bfill()
Prints:
Vendor Payment Amount late_payments
0 A Ontime 91 78.0
1 A Ontime 90 78.0
2 A Delay 78 78.0
3 B Ontime 58 166.0
4 B Delay 77 166.0
5 B Ontime 96 166.0
6 B Delay 89 166.0
Let's define the dataframe:
data = pd.DataFrame({'Vendor':['A', 'A', 'B', 'B'],
'Payment':['Ontime', 'Delay', 'Ontime', 'Delay'],
'Paid Late by Vendor':[20, 21, 19, 18]})
to get the results you want you need to create a separate dataframe with grouped values and then combine it with the original.
Since you want a value for only late payments then you need to filter the data to-be-grouped to have only the wanted records, and group on it.
reset_index() is used to make the index a column(in this case it's the column that we grouped on; Vendor)
groupedLateData = data[data['Payment']=='Delay'].groupby('Vendor')["Paid Late by Vendor"].sum().reset_index()
Then we merge the resulting dataframe with the original on the Vendor column
pd.merge(data, groupedLateData, on='Vendor')
and this would be the result:

Seperating data based on cell value

I have data as below
Account-Num Date Dr Cr
123 29-04-2020 100
123 28-04-2020 50
258 28-04-2020 75
258 29-04-2020 30
How do I separate data of each account number and save it on new sheet or file.
I have tried and came up with following code
import pandas as pd
soa = pd.read_excel('ubl.xlsx')
acc = '218851993'
df2 = soa.where(soa['ACCT_NO']== acc)
df2.to_csv('C:/Users/user/Desktop/mcb/D/HBL/UBL/' + acc + '.csv',index=False)
but it is generating following error.
AttributeError: 'function' object has no attribute 'to_csv'
You can use a pivot table.
In the rows put all your dates. In the columns, put the account numbers. You can then add the DR and CR columns to your values, making sure you sum them.
This will then aggregate all information, per date, for each account number.

Grouping by substring of a column value in Pandas

While grouping a pandas dataframe, I've found a problem in the data that doesn't group my dataframe effectively, and now my grouping looks like -
challenge count mean
['acsc1', '[object Object]'] 1 0.000000
['acsc1', 'undefined'] 1 0.000000
['acsc1', 'wind-for'] 99 379.284146
['acsc1'] 47 19.340045
['acsc10', 'wind-for'] 73 370.148354
['acsc10'] 22 143.580856
How can I group these rows starting with ascs1 as one row (summing the other column values) and acsc10 into one row and so on? The desired result should look something like -
challenge category count mean
acsc1 wind-for 148 398.62
acsc10 wind-for 95 513.72
But I know the category column might be a stretch with the noise in this column.
This should get you the result you requested initially (without the category column)
df.groupby(df.challenge.apply(lambda x: x.split(",")[0].strip("[']"))).sum().reset_index()
Output
challenge count mean
0 acsc1 148 398.624191
1 acsc10 95 513.729210
We can do
s=pd.DataFrame(df['challenge'].tolist(),index=df.index,columns=['challenge','cate'])
d={'cate':'last','count':'count','mean':'sum'}
df=pd.concat([df.drop('challenge',1),s],axis=1).\
groupby('challenge').agg(d).reset_index()
Update fix the string type list
import ast
df.challenge=df.challenge.apply(ast.literal_eval)
df.groupby(df.challenge.str[0]).sum()
count mean
challenge
acsc1 148 398.624191
acsc10 95 513.729210

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

Reorder Rows into Columns in Pandas (Python 3, Pandas)

Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42

Resources