Seperating data based on cell value - excel

I have data as below
Account-Num Date Dr Cr
123 29-04-2020 100
123 28-04-2020 50
258 28-04-2020 75
258 29-04-2020 30
How do I separate data of each account number and save it on new sheet or file.
I have tried and came up with following code
import pandas as pd
soa = pd.read_excel('ubl.xlsx')
acc = '218851993'
df2 = soa.where(soa['ACCT_NO']== acc)
df2.to_csv('C:/Users/user/Desktop/mcb/D/HBL/UBL/' + acc + '.csv',index=False)
but it is generating following error.
AttributeError: 'function' object has no attribute 'to_csv'

You can use a pivot table.
In the rows put all your dates. In the columns, put the account numbers. You can then add the DR and CR columns to your values, making sure you sum them.
This will then aggregate all information, per date, for each account number.

Related

Pandas - Get the first n rows if a column has a specific value

I have a DataFrame that has 5 columns including User and MP.
I need to extract a sample of n rows for each User, n being a percentage based on User (if User has 1000 entries and n is 5, select the first 50 rows and and go to the next User. After that I have to add all the samples to a new DataFrame. Also if User has multiple values on the column MP, for example if the user has 2 values in the column MP, select 2.5% for 1 value and 2.5% for the other.
Somehow my logic isn't that good(started with the first step, without adding the logic for multiple MPs)
df = pd.read_excel("Results/fullData.xlsx")
dfSample = pd.DataFrame()
uniqueValues = df['User'].unique()
print(uniqueValues)
n = 5
for u in uniqueValues:
sm = df["User"].str.count(u).sum()
print(sm)
for u in df['User']:
sample = df.head(int(sm*(n/100)))
#print(sample)
dfSample.append(sample)
print(dfSample)
dfSample.to_excel('testFinal.xlsx')
Check Below example. It is intentionally verbose for understanding. The column that solve problem is "ROW_PERC". You can filter it based on the requirement (50% rows or 25% rows) that are required for each USR/MP.
import pandas as pd
df = pd.DataFrame({'USR':[1,1,1,1,2,2,2,2],'MP':['A','A','A','A','B','B','A','A'],"COL1":[1,2,3,4,5,6,7,8]})
df['USR_MP_RANK'] = df.groupby(['USR','MP']).rank()
df['USR_MP_RANK_MAX'] = df.groupby(['USR','MP'])['USR_MP_RANK'].transform('max')
df['ROW_PERC'] = df['USR_MP_RANK']/df['USR_MP_RANK_MAX']
df
Output:

How to get multiple aggregation in a dataframe? cumsum and count columns

I need a column which aggregates using the count() function and another field using the cumsum() function in a dataframe
I would like to group it only once and the cumsum should be grouped with Site almost just like the count. How can I do this?
#I get the count by grouping site and arrived
df_arrived_gby = df.groupby(['Site','Arrived']).size().reset_index(name='Count_X')
#I do the cumsum but it should be groupby Site and Arrived same as above
#How can I do this?
df_arrived_gby['Cumsum_X'] = df_arrived_gby['Count_X'].cumsum()
print(df_arrived_gby)
Data example (it is not grouped by Site, so it continues adding the others):
Site Arrived Count Cumsum
198 T 30/06/2020 146 22368
199 T 31/05/2020 76 22444
200 V 05/01/2020 77 22521
201 V 05/02/2020 57 22578
First you need to get the values from the Count_X column, then you can cumsum():
df_arrived_gby['Cumsum_X'] = df_arrived_gby.Count_X.values.cumsum()
Let me know if that helps
I was able to do it using groupby on a new dataframe column as shown below:
df_arrived_gby['Cumsum'] = df_arrived_gby.groupby(['Site'])['Count X'].apply(lambda x: x.cumsum())

pandas is not summing a numeric column

I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())
The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())
After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.
Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()

What is the simplest way to use Python to group based on a combination of columns (4 columns) and sum the amount (1 column) column?

Using Python 3.6 and Pandas 0.23.0 to automate accounting.
I want to group 4 columns based on certain combined values (63 different combinations) and then sum the 5th column. Then take the output of those 63 different values to a 2 column output: Combination, Amount.
The 63 combinations will always be the same.
For example:
There are columns A, B, C, D, E.
Column A can have 3 values:
Ebay
Amazon
Shopify
Column B can have 5 values:
Sale
Refund
etc.
Column C can have 8 values:
StorePrice
StoreFee
Tax
TaxRefund
etc.
Column D can have 30 values:
SoldAmount
TaxAmount
PromotionAmount
RefundAmount
OtherAmount
etc.
Column E can have a numerical value:
-1,000,000 - 1,000,000
NOTE: The amount of unique combined values is 63 for our purpose. Refunds can’t be Promotions, etc.
I need to find the sum of Column E for each combination.
For perspective, this is typically done with a Pivot Table in excel, except I have to do it manually, so that is 63 different sorts. So I will group by Ebay, Sale, StorePrice, SoldAmount to get the summed amount of all Sold Ebay sales over a period.
I thought about storing a list of the 63 combinations in my code and then looping through the .txt file. Sum For w, x, y, z: sort of thing. Here is where I started and then got stuck:
import pandas as pd
data = pd.read_csv('/Users/XXX/Desktop/statement.txt', sep='\t', header=0)
df = pd.DataFrame(data)
test3 = df.groupby(['Column A','Column B', 'Column A', 'Column D']).sum()
This gets me close, but I'm stuck.
What is the simplest way to solve this problem? Any help is appreciated!
Your arg list should instead be this:
df.groupby(['Column A','Column B', 'Column C', 'Column D']).sum()
If you told us the actual and expected result,
we'd be in a better position to help you.
https://stackoverflow.com/help/mcve

Slicing specific rows of a column in pandas Dataframe

In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.

Resources