This may be a little silly to ask but I've tried to search examples to manipulate dates in a Data Frame using pandas. But what confuses me is that my dates have this format:
Time A B C D
1.000347257 626.9966431 0 0 -99.98999786
1.001041651 626.9967651 0 0 -99.98999786
1.001736164 627.0130005 0 0 -99.98999786
1.002430558 627.0130005 0 0 -99.98999786
1.003124952 627.0455933 0 0 -99.98999786
1.003819466 627.0618286 0 0 -99.98999786
...
1.998263836 627.7052002 0.3417936265 0.2321419418 0.07069379836
1.998958349 627.7216187 0.3260073066 0.2284916639 0.073251158
1.999652743 627.6726074 0.3180454969 0.2164463699 0.07418025285
2.000347137 627.7371826 0.3161731362 0.2277853489 0.07479456067
2.001041651 627.7365723 0.301556468 0.2394933105 0.07920494676
2.001736164 627.7686157 0.3718534708 0.2506033182 0.07810453326
...
366.996887 625.413574 3.168393 2.114161 2.119713
366.997559 625.413391 3.163851 2.104703 2.117746
366.998261 625.461792 3.184296 2.113827 2.117964
366.998962 625.449463 3.163331 2.117869 2.116489
366.999664 625.510681 3.166895 2.126145 2.110077
This is an extract of the file where I have the data stored. Is there a way to convert this format using the datetime library to something like 2010-10-23? The year here is 2011 but is not specified in the data.
Thank you!
I looked into the documentation of pandas, though I don't understand very well, it worked. The time was in decimal format, and by day. So I just defined it and used a timestamp to declare the year that I already knew of.
df['Time'] = pd.to_datetime(
df['Time'], unit='D', origin=pd.Timestamp('2011-01-01')
)
With this the result is what I wanted it to be. And it goes through 366 days, as shown below:
Time A B C D
2016-01-02 00:00:30.003004800 626.996643 0.000000 0.000000 -99.989998
2016-01-02 00:01:29.998646400 626.996765 0.000000 0.000000 -99.989998
2016-01-02 00:02:30.004569600 627.013000 0.000000 0.000000 -99.989998
2016-01-02 00:03:30.000211200 627.013000 0.000000 0.000000 -99.989998
2016-01-02 00:04:29.995852800 627.045593 0.000000 0.000000 -99.989998
... ... ... ... ...
2017-01-01 23:55:31.054080000 625.413574 2.706322 2.086675 2.094654
2017-01-01 23:56:29.063040000 625.413391 2.738388 2.082261 2.092784
2017-01-01 23:57:29.707200000 625.461792 2.762815 2.097127 2.091273
2017-01-01 23:58:30.351360000 625.449463 2.698989 2.105750 2.090060
2017-01-01 23:59:30.995520000 625.510681 2.751848 2.109448 2.090664
Your column Time seems to be day fractions. If you know the year, you can convert that to a datetime column using
# 1 - convert the year to nanoseconds since the epoch
# 2 - add the day fraction, after you convert that to nanoseconds as well
# 3 - convert the resulting nanoseconds since the epoch to datetime
year = '2011'
df['datetime'] = pd.to_datetime(pd.to_datetime(year).value + df['Time']*86400*1e9)
which will give you e.g.
df
Time A B C D datetime
0 1.000347 626.996643 0 0 -99.989998 2011-01-02 00:00:30.003004928
1 1.001042 626.996765 0 0 -99.989998 2011-01-02 00:01:29.998646272
2 1.001736 627.013000 0 0 -99.989998 2011-01-02 00:02:30.004569600
3 1.002431 627.013000 0 0 -99.989998 2011-01-02 00:03:30.000211200
4 1.003125 627.045593 0 0 -99.989998 2011-01-02 00:04:29.995852800
5 1.003819 627.061829 0 0 -99.989998 2011-01-02 00:05:30.001862400
You can cast column to datetime with pd.to_datetime():
df.Time = pd.to_datetime(df.Time)
df.head(2)
Time A B C D
0 1970-01-01 00:00:01.000347257 626.996643 0 0 -99.989998
1 1970-01-01 00:00:01.001041651 626.996765 0 0 -99.989998
Related
Good evening,
I want to resample on an irregular time series with column of type object but it does not work
Here is my sample data:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2006-07-30 LEVETIRACETAM 1
2008-03-19 FLAVOXATE HYDROCHLORIDE 1
2010-01-01 LEVOTHYROXINE SODIUM 1
2011-04-01 BIMATOPROST 1
I tried to re-sample my data frame daily but it does not work with my code which is as follows:
df3 = df1.resample('D', on='Actual start date').sum()
and here is what it gives:
Actual start date NumberShortage
2002-01-01 1
2002-01-02 0
2002-01-03 0
2002-01-04 0
2002-01-05 0
and what I want as a result:
Actual start date Ingredients NumberShortage
2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
2002-01-02 NAN 0
2002-01-03 NAN 0
2002-01-04 NAN 0
2002-01-05 NAN 0
Any ideas?
details on the data
So I use an excel file which contains several attributes before it is a csv file (this file can be downloaded from this site web https://www.drugshortagescanada.ca/search?perform=0 ) then I group by 'Actual start date' and 'Ingredients'to obtain 'NumberShortage'
and here is the source code:
import pandas as pd
df = pd.read_excel("Data/Data.xlsx")
df = df.dropna(how='any')
df = df.groupby(['Actual start date','Ingredients']).size().reset_index(name='NumberShortage')
finally after having applied your source code here is the eureur which gives me :
and here is the sample excel file :
Brand name Company Name Ingredients Actual start date
ACETAMINOPHEN PHARMASCIENCE INC ACETAMINOPHEN CODEINE 2017-03-23
PMS-METHYLPHENIDATE ER PHARMASCIENCE INC METHYLPHENIDATE 2017-03-28
You rather need to reindex using date_range as a source of new dates, and the time series as temporary index:
df['Actual start date'] = pd.to_datetime(df['Actual start date'])
(df
.set_index('Actual start date')
.reindex(pd.date_range(df['Actual start date'].min(),
df['Actual start date'].max(), freq='D'))
.fillna({'NumberShortage': 0}, downcast='infer')
.reset_index()
)
output:
index Ingredients NumberShortage
0 2002-01-01 LEVOBUNOLOL HYDROCHLORIDE 1
1 2002-01-02 NaN 0
2 2002-01-03 NaN 0
3 2002-01-04 NaN 0
4 2002-01-05 NaN 0
... ... ... ...
3373 2011-03-28 NaN 0
3374 2011-03-29 NaN 0
3375 2011-03-30 NaN 0
3376 2011-03-31 NaN 0
3377 2011-04-01 BIMATOPROST 1
[3378 rows x 3 columns]
I have the following dataframe:
0 1
0 0.224960 -1.376689
1 0.059706 -1.330823
2 -0.133850 -1.251549
3 -0.234644 -1.190972
4 -0.281469 -1.156635
... ... ...
295 0.655912 -1.040209
296 0.618599 -1.068238
297 0.594964 -1.109484
298 0.578758 -1.151496
299 0.570207 -1.179523
I added the index as a column and then generate fake time from this column like that:
df['timestamp'] = df.index
# convert the column (it's a string) to datetime type
datetime_series = pd.to_datetime(df['timestamp'])
# create datetime index passing the datetime series
datetime_index = pd.DatetimeIndex(datetime_series.values)
# set timestamp as datframe index
df=df.set_index('timestamp')
df
The result is:
0 1
timestamp
1970-01-01 00:00:00.000000000 0.224960 -1.376689
1970-01-01 00:00:00.000000001 0.059706 -1.330823
1970-01-01 00:00:00.000000002 -0.133850 -1.251549
1970-01-01 00:00:00.000000003 -0.234644 -1.190972
1970-01-01 00:00:00.000000004 -0.281469 -1.156635
... ... ...
1970-01-01 00:00:00.000000295 0.655912 -1.040209
1970-01-01 00:00:00.000000296 0.618599 -1.068238
1970-01-01 00:00:00.000000297 0.594964 -1.109484
1970-01-01 00:00:00.000000298 0.578758 -1.151496
1970-01-01 00:00:00.000000299 0.570207 -1.179523
I want the timestamp to be like 1970-01-01 00:00:00 then 1970-01-01 00:00:01 and so on.
This will be correct answer.
d = { i:"1970-01-01 00:00:00.{:0>9}".format(i) for i in df.index}
df.index = pd.Series(df.index).replace(d)
Tested like following:
tem = pd.DataFrame({'0':[1,2,3,4,5],'1':[3,4,5,6,7]},columns=['0','1'])
d = {0:"asdf",1:"asdf",2:"sdfs",3:"sdfs",4:"sdfs"}
tem.index = pd.Series(tem.index).replace(d)
tem print:
0 1
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
d print:
{0: 'asdf', 1: 'asdf', 2: 'sdfs', 3: 'sdfs', 4: 'sdfs'}
result tem print;
0 1
asdf 1 3
asdf 2 4
sdfs 3 5
sdfs 4 6
sdfs 5 7
I'm working with a dataframe has one messy date column with irregular format, ie:
date
0 19.01.01
1 19.02.01
2 1991/01/01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Is it possible convert it to standard format XXXX-XX-XX, which represents year-month-date? Thank you.
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Use pd.to_datetime with yearfirst=True
Ex:
df = pd.DataFrame({"date": ['19.01.01', '19.02.01', '1991/01/01', '1996-01-01', '1996-06-30', '1995-12-31', '1997-01-01']})
df['date'] = pd.to_datetime(df['date'], yearfirst=True).dt.strftime("%Y-%m-%d")
print(df)
Output:
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
It depends of format, the most general solution is specify each format and use Series.combine_first:
date1 = pd.to_datetime(df['date'], format='%y.%m.%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%Y/%m/%d', errors='coerce')
date3 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
df['date'] = date1.combine_first(date2).combine_first(date3)
print (df)
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Try the following
df['date'].replace('\/|.','-', regex=True)
Use pd.to_datetime()
pd.to_datetime(df['date])
Output:
0 2001-01-19
1 2001-02-19
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Name: 0, dtype: datetime64[ns]
I have a dataframe containing two columns: id and val.
df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})
id val
0 1 2.644347
1 1 0.378770
2 1 -2.107230
3 2 -0.043051
4 2 0.115948
5 2 0.054485
6 3 0.574845
7 3 -0.228612
8 3 -2.648036
9 3 0.569929
And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:
df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
df1 = df[df.id==ids.iloc[i]]
df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']
And the result is:
id val scaled
0 1 0.457713 1.000000
1 1 -0.464513 0.000000
2 1 0.216352 0.738285
3 2 0.633652 0.990656
4 2 -1.099065 0.000000
5 2 0.649995 1.000000
6 3 -0.251099 0.306631
7 3 -1.003295 0.081387
8 3 2.064389 1.000000
9 3 -1.275086 0.000000
How can I do this faster without a loop?
You can do this with groupby:
In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())
In [7]: df.groupby('id')['val'].apply(minmaxscale)
Out[7]:
0 0.000000
1 1.000000
2 0.654490
3 1.000000
4 0.524256
5 0.000000
6 0.000000
7 0.100238
8 0.014697
9 1.000000
Name: val, dtype: float64
(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)
This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.
Taking the first group, for example:
In [11]: s = df[df.id == 1]['val']
In [12]: s
Out[12]:
0 0.002722
1 0.656233
2 0.430438
Name: val, dtype: float64
In [13]: s.max() - s.min()
Out[13]: 0.6535106879021447
In [14]: (s - s.min()) / (s.max() - s.min())
Out[14]:
0 0.00000
1 1.00000
2 0.65449
Name: val, dtype: float64
Solution from sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]:
id val scaled new
0 1 0.457713 1.000000 1.000000
1 1 -0.464513 0.000000 0.000000
2 1 0.216352 0.738285 0.738284
3 2 0.633652 0.990656 0.990656
4 2 -1.099065 0.000000 0.000000
5 2 0.649995 1.000000 1.000000
6 3 -0.251099 0.306631 0.306631
7 3 -1.003295 0.081387 0.081387
8 3 2.064389 1.000000 1.000000
9 3 -1.275086 0.000000 0.000000
I have a DF like this:
ID Time
1 20:29
1 20:45
1 23:16
2 11:00
2 13:00
3 01:00
I want to create a new column that puts a 1 next to the largest time value within each ID grouping like so:
ID Time Value
1 20:29 0
1 20:45 0
1 23:16 1
2 11:00 0
2 13:00 1
3 01:00 1
I know the answer involves a groupby mechanism and have been fiddling around with something like:
df.groupby('ID')['Time'].max() = 1
The idea is to write an anonymous function that operates on each of your groups and feed this to your groupby using apply:
df['Value']=df.groupby('ID',as_index=False).apply(lambda x : x.Time == max(x.Time)).values
Assuming that your 'Time' column is already a datetime64 then you want to groupby on 'ID' column and then call transform to apply a lambda to create a series with an index aligned with your original df:
In [92]:
df['Value'] = df.groupby('ID')['Time'].transform(lambda x: (x == x.max())).dt.nanosecond
df
Out[92]:
ID Time Value
0 1 2015-11-20 20:29:00 0
1 1 2015-11-20 20:45:00 0
2 1 2015-11-20 23:16:00 1
3 2 2015-11-20 11:00:00 0
4 2 2015-11-20 13:00:00 1
5 3 2015-11-20 01:00:00 1
The dt.nanosecond call is because the dtype returned is a datetime for some reason rather than a boolean:
In [93]:
df.groupby('ID')['Time'].transform(lambda x: (x == x.max()))
Out[93]:
0 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000000
2 1970-01-01 00:00:00.000000001
3 1970-01-01 00:00:00.000000000
4 1970-01-01 00:00:00.000000001
5 1970-01-01 00:00:00.000000001
Name: Time, dtype: datetime64[ns]