How to apply formula in pandas - python-3.x

I am trying to apply the formula in the column but not able to.
I have data in dataframe:
Date 2018-04-16 00:00:00
Quantity 8317.000
Total Value (Lacs) 259962.50
I want to apply a formula in Total Value (Lacs) column
formula is: = [ Total Value (Lacs) multiplied by 100000 ] divided by [Quantity (000’s) multiplied by 100] by using pandas
I have tried something
a = df['Total Value (Lacs)']
b = df['Quantity']
c = (a * 100000 / b * 100)
print (c)
or
df['Price'] = ((df['Total Value (Lacs)']) * 100000 / (df['Quantity']) * 100)
print (df)
error:
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Edit
I have tried below code:
df['Price'] = float((float(df['Total Value (Lacs)'])) * 100000 / float((df['Quantity'])) * 100)
but getting the wrong value
price 312567632.6
expecting
price 31256.76326

Edit 1
Type error means that you've tried to apply operator / to two strings. There's no such operator defined for str type in python, so you should convert you data to some numeric type, float in your case.
I didn't understand extactly how your data looks like. But if it's like this:
df
Out:
Date Quantity Total Value (Lacs)
2018-04-16 00:00:00 8317.000 259962.50
2018-04-17 00:00:00 7823.000 234004.50
You can convert it to numeric type, convert all the columns to the correct type (I suppose that Date column is an index column):
df_float = df.apply(pd.to_numeric)
df_float.dtypes()
Out:
Quantity float64
Total Value (Lacs) int64
dtype: object
After all, you can just deal with columns:
df['Price'] = (df_float['Total Value (Lacs)'] * 100000
/ df_float['Quantity'] * 100)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
Another approach is define the function and apply it to each row with pd.DataFrame.apply:
def get_price(row):
try:
price = (float(row['Total Value (Lacs)']) * 100000
/ float(row['Quantity']) * 100)
except (TypeError, ValueError): # If bad data in this row, can't convert to float
price = None
return price
df['Price'] = df.apply(get_price, axis=1)
df['Price']
Out:
2018-04-16 00:00:00 319930.7592441217
2018-04-17 00:00:00 334309.8102814262
axis=1 means "aplly to each row"
If you have transposed data - as in your example, you should transpose it or to apply function to each column using axis=0.
Eidt 2:
Looks like your data is just single column, and it has dtype pd.Series. So if you select a row with data['Quantity'], you'll get something like 8317.000 of type str. There's no pd.Series.apply method, of course. So, in that case you may act in this way:
index_to_convert = ['Quantity', 'Total Value (Lacs)']
data[index_to_convert] = pd.to_numeric(data[index_to_convert])
and only numeric columns were converted. The just do the formula:
data['Price'] = (data['Total Value (Lacs)'] * 100000
/ data['Quantity'] * 100)
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317
Total Value (Lacs) 259962
Price 3.12568e+08
But in most cases this solution not so handy, I strongly advice convert your data to DataFrame and deal with it, because DataFrame provides more flexibility and сapabilities.
Сonverting process:
df = data.to_frame().T.set_index('Date')
There are three consecutive actions:
Convert your data into DataFrame
Transpose it to (now columns are vertical virtually)
Set "Date" as index column
Results:
df
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
After the previous steps you can apply Edit 1 code to your data. Also it's applicable there is more than one series in your data.
One more:
If your data has more than one value for each index, i.e multiple quantities ets:
data
Out:
Date 2018-04-16 00:00:00
Quantity 8317.00
Total Value (Lacs) 259962.50
Date 2018-04-17 00:00:00
Quantity 6434.00
Total Value (Lacs) 230002.50
You also can convert it into pd.DataFrame, step-by-step.
Group your data by an index entries and apply a list to groups:
data.groupby(level=0).apply(list)
Out:
Date [2018-04-16 00:00:00, 2018-04-17 00:00:00]
Quantity [8317.00, 6434.00]
Total Value (Lacs) [259962.50, 230002.50]
Then apply pd.Series to each row:
data.groupby(level=0).apply(list).apply(pd.Series)
Out: 0 1
Date 2018-04-16 00:00:00 2018-04-17 00:00:00
Quantity 8317.00 6434.00
Total Value (Lacs) 259962.50 230002.50
Transpose returned DataFrame, set 'Date' column as index:
series.groupby(level=0).apply(list).apply(pd.Series).T.set_index('Date')
Out:
Quantity Total Value (Lacs)
Date
2018-04-16 00:00:00 8317.00 259962.50
2018-04-17 00:00:00 6434.00 230002.50
Apply the solution from Edit 1.
Hope it helps!

You are getting this error because the data extracted from the dataframe are strings as shown in your error, you will need to convert the string into a float.
Convert your dataframe to values instead of strings. You can achieve that by:
values = df.values
Then you can extract the values from this array.
Alternatively, after extracting data from the dataframe convert it to float by using:
b=float(df['Quantity'])

use this:
df['price'] = ((df['Total Value (Lacs)'].apply(pd.to_numeric)) * 100000 / (df['Quantity'].apply(pd.to_numeric)) * 100)

Related

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

getting another column is one column of data has maximum value in python?

I have a data frame
date object price
190403 a 1
190405 a 23
190506 b -4
190507 d 56
I want to get a date of a column having a maximum Price i.e 190507
expected output
190507
For scalar output, always one max date value use Series.idxmax with convert date to index by DataFrame.set_index:
df.set_index('date')['price'].idxmax()
If want all max values in Series use boolean indexing and compare all values by max, DataFrame.loc is fir also filtering date column:
df.loc[df['price'].eq(df['price'].max()), 'date']
You can subset the df so that you only have the row where the price column is at its maximum, and then choose the date column:
df[df.price==df.price.max()].date

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Why is call to sum() on a data frame generating wrong numbers?

I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?
This code was run using Python 3.6 :
import pandas as pd
import numpy as np
data = np.array([
['', 'week','storeA','storeB','storeC','storeD'],
[0,"2014-05-04",2643,8257,3893,6231],
[1,"2014-05-11",6444,5736,5634,7092],
[2,"2014-05-18",9646,2552,4253,5447],
[3,"2014-05-25",5960,10740,8264,6063],
[4,"2014-06-04",5960,10740,8264,6063],
[5,"2014-06-12",7412,7374,3208,3985]
])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)
# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)
Output of Sum column for rows 4 and 5:
Row4 - 5.960107e+16
Row5 - 7.412737e+15
Use astype, to convert those strings to ints and sum works properly:
df2['Sum'] = df2[col_list].astype(int).sum(axis=1)
Output:
week storeA storeB storeC storeD Sum
4 2014-06-04 5960 10740 8264 6063 31027
5 2014-06-12 7412 7374 3208 3985 21979
What was happening,you were summing (concatenating) strings.
Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:
df.dtypes
week object
storeA object
storeB object
storeC object
storeD object
dtype: object
You have columns of strings, and sum on string dataframes results in concatenation.
The solution is to convert these to integers first -
df2[col_list] = df2[col_list].astype(int)
Your code then works.
df2[col_list].sum(axis=1)
4 31027
5 21979
dtype: int64
Alternatively, declare data as a object array -
data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
Next, perform a soft conversion using infer_objects (new in v0.22):
df = df.infer_objects()
df.dtypes
week object
storeA int64
storeB int64
storeC int64
storeD int64
dtype: object
Works like a charm.

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources