Converting timedeltas to integers for consecutive time points in pandas - python-3.x

Suppose I have the dataframe
import pandas as pd
df = pd.DataFrame({"Time": ['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04']})
print(df)
Time
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04
If I want to calculate the time from the lowest time point for each time in the dataframe, I can use the apply function like
df['Time'] = pd.to_datetime(df['Time'])
df.sort_values(inplace = True)
df['Time'] = df['Time'].apply(lambda x: (x - df['Time'].iloc[0]).days)
print(df)
Time
0 0
1 1
2 2
3 3
Is there a function in Pandas that does this already?

I will recommend not use apply
(df.Time-df.Time.iloc[0]).dt.days
0 0
1 1
2 2
3 3
Name: Time, dtype: int64

Related

Time series resampling with column of type object

Good evening,
I want to resample on an irregular time series with column of type object but it does not work
Here is my sample data:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2006-07-30 LEVETIRACETAM 1
2008-03-19 FLAVOXATE HYDROCHLORIDE 1
2010-01-01 LEVOTHYROXINE SODIUM 1
2011-04-01 BIMATOPROST 1
I tried to re-sample my data frame daily but it does not work with my code which is as follows:
df3 = df1.resample('D', on='Actual start date').sum()
and here is what it gives:
Actual start date NumberShortage
2002-01-01 1
2002-01-02 0
2002-01-03 0
2002-01-04 0
2002-01-05 0
and what I want as a result:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2002-01-02 NAN 0
2002-01-03 NAN 0
2002-01-04 NAN 0
2002-01-05 NAN 0
Any ideas?
details on the data
So I use an excel file which contains several attributes before it is a csv file (this file can be downloaded from this site web https://www.drugshortagescanada.ca/search?perform=0 ) then I group by 'Actual start date' and 'Ingredients'to obtain 'NumberShortage'
and here is the source code:
import pandas as pd
df = pd.read_excel("Data/Data.xlsx")
df = df.dropna(how='any')
df = df.groupby(['Actual start date','Ingredients']).size().reset_index(name='NumberShortage')
finally after having applied your source code here is the eureur which gives me :
and here is the sample excel file :
Brand name Company Name Ingredients Actual start date
ACETAMINOPHEN PHARMASCIENCE INC ACETAMINOPHEN CODEINE 2017-03-23
PMS-METHYLPHENIDATE ER PHARMASCIENCE INC METHYLPHENIDATE 2017-03-28
You rather need to reindex using date_range as a source of new dates, and the time series as temporary index:
df['Actual start date'] = pd.to_datetime(df['Actual start date'])
(df
.set_index('Actual start date')
.reindex(pd.date_range(df['Actual start date'].min(),
df['Actual start date'].max(), freq='D'))
.fillna({'NumberShortage': 0}, downcast='infer')
.reset_index()
)
output:
index Ingredients NumberShortage
0 2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
1 2002-01-02 NaN 0
2 2002-01-03 NaN 0
3 2002-01-04 NaN 0
4 2002-01-05 NaN 0
... ... ... ...
3373 2011-03-28 NaN 0
3374 2011-03-29 NaN 0
3375 2011-03-30 NaN 0
3376 2011-03-31 NaN 0
3377 2011-04-01 BIMATOPROST 1
[3378 rows x 3 columns]

How to change the format for values in a dataframe?

I need to change the format for values in a column in a dataframe. If I have a dataframe in that format:
df =
sector funding_total_usd
1 NaN
2 10,00,000
3 3,90,000
4 34,06,159
5 2,17,50,000
6 20,00,000
How to change it to that format:
df =
sector funding_total_usd
1 NaN
2 10000.00
3 3900.00
4 34061.59
5 217500.00
6 20000.00
This is my code:
for row in df['funding_total_usd']:
dt1 = row.replace (',','')
print (dt1)
This is the error that I got "AttributeError: 'float' object has no attribute 'replace'"
I need really to your help in how to do that?
Here's the way to get the decimal places:
import pandas as pd
import numpy as np
df= pd.DataFrame({'funding_total_usd': [np.nan, 1000000, 390000, 3406159,21750000,2000000]})
print(df)
df['funding_total_usd'] /= 100
print(df)
funding_total_usd
0 NaN
1 1000000.0
2 390000.0
3 3406159.0
4 21750000.0
funding_total_usd
0 NaN
1 10000.00
2 3900.00
3 34061.59
4 217500.00
To solve your comma problem, please run this as your first command before you print. It will remove all your commas for the float values.
pd.options.display.float_format = '{:.2f}'.format

Format the output in Pandas

I have a excel sheet:
I read it out:
import pandas as pd
import numpy as np
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
print(df)
it shows:
name value
0 a 10000.000000
1 b 20000.000000
2 c 30000.000000
3 d 40000.000000
4 e 50000.000000
5 f 1.142857
6 g 1.285714
how can I format the number output to like %.2f, can I format it directly by print (df) like add something in print like %.2f, or I must first modify it content and commit back to df and then print again?
UPDATE:
I try the answer below, df['value'].apply("{0:.2f}".format) doesn't work:
import pandas as pd
import numpy as np
excel_file = 'test.xlsx'
df = pd.read_excel(excel_file, sheet_name=0)
print(df['value'])
df['value'].apply("{0:.2f}".format)
print(df['value'])
print(df)
it shows:
0 10000.000000
1 20000.000000
2 30000.000000
3 40000.000000
4 50000.000000
5 1.142857
6 1.285714
Name: value, dtype: float64
0 10000.000000
1 20000.000000
2 30000.000000
3 40000.000000
4 50000.000000
5 1.142857
6 1.285714
Name: value, dtype: float64
name value
0 a 10000.000000
1 b 20000.000000
2 c 30000.000000
3 d 40000.000000
4 e 50000.000000
5 f 1.142857
6 g 1.285714
pd.set_option('display.float_format', lambda x: '%.2f' % x) works:
0 10000.00
1 20000.00
2 30000.00
3 40000.00
4 50000.00
5 1.14
6 1.29
Name: value, dtype: float64
name value
0 a 10000.00
1 b 20000.00
2 c 30000.00
3 d 40000.00
4 e 50000.00
5 f 1.14
6 g 1.29
You can change the float_format of pandas in pandas set_option like this
pd.set_option('display.float_format', lambda x: '%.2f' % x)
If you want to change the formating of just one column, you can do a apply and change formating like this
df['age'].apply("{0:.2f}".format)
# Output
0 10000.00
1 20000.00
2 30000.00
3 40000.00
4 50000.00
5 1.14
6 1.29
Name: age, dtype: object
The default precision is 6. You can override this this with pandas.set_option:
pd.set_option('display.precision', 2)

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources