pandas is not summing a numeric column - excel

I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.

Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())

The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())

After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.

Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()

Related

Converting simple returns to monthly log returns

I have a pandas DataFrame with simple daily returns. I need to convert it to monthly log returns and add a column to the current DataFrame. I have to use np.log to compute the monthly return. But I can only compute daily log return. Below is my code.
df[‘return_monthly’]= np.log(data([‘Simple Daily Returns’]+1)
The code only produces daily log returns. Is there any particular methods I should be using in the above code to get monthly return??
Please see my input for pandas Dataframe, the third column in excel is the expected out.
The question is a little confusing, but it seems like you want to group the rows by month. This can be done with pandas.resample if you have a datetime index, pandas.groupby, or pandas.pivot.
Here is a simple implementation, let us know if this isn't what you're looking for. Furthermore, your values are less than 1, so the log is negative. You can adjust as needed. I aggregated the months with sum, but there are many other aggregation functions such as mean(), median(), size() and many more. See the link for a list of aggregating functions.
#create dataframe with 1220 values that match your dataset
df = pd.DataFrame({
'Date':pd.date_range(start = '1/1/2019' , end ='5/4/2022' , freq='1D'),
'Return':np.random.uniform(low=1e-6, high=1.0, size=1220) #avoid log 0 which returns NAN
}).set_index('Date') #set the index to the date so we can use resample
Return Log_return
Date
2019-01-31 14.604863 -33.950987
2019-02-28 13.118111 -32.025086
2019-03-31 14.541947 -32.962914
2019-04-30 14.212689 -33.684422
2019-05-31 14.154918 -33.347081
2019-06-30 10.710209 -43.474120
2019-07-31 12.358001 -43.051723
2019-08-31 17.932673 -30.328784
...

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

Arithmetic operations for groups within a dataframe

I have loaded multiple CSV (time series) to create one dataframe. This dataframe contains data for multiple stocks. Now I want to calculate 1 month return for all the datapoints.
There 172 datapoints for each stock i.e. from index 0 to 171. The time series for next stock starts from index 0 again.
When I am trying to calculate the 1 month return its getting calculated correctly for all data points except for index 0 of new stock. Because it is taking the difference with index 171 of the previous stock.
I want the return to be calculated per stock name basis so I tried the for loop but it doesnt seem working.
e.g. In the attached image (highlighted) the 1 month return is calculated for company name ITC with SHREECEM. I expect for SHREECEM the first value of 1Mreturn should be NaN
Using groupby instead of a for loop you can get the result you want:
Mreturn_function = lambda df: df['mean_price'].diff(periods=1)/df['mean_price'].shift(1)*100
gw_stocks.groupby('CompanyName').apply(Mreturn_function)

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

Summary statistics for each group and transpose using pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

Resources