I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)
Related
I have a dataframe that has the format as below. I am looking to get the minimum time value for each column and save it in a list with excluding a specific time value with a format (00:00:00) to be a minimum value in any column in a dataframe.
df =
10.0.0.155 192.168.1.240 192.168.0.242
0 19:48:46 16:23:40 20:14:07
1 20:15:46 16:23:39 20:14:09
2 19:49:37 16:23:20 00:00:00
3 20:15:08 00:00:00 00:00:00
4 19:48:46 00:00:00 00:00:00
5 19:47:30 00:00:00 00:00:00
6 19:49:13 00:00:00 00:00:00
7 20:15:50 00:00:00 00:00:00
8 19:45:34 00:00:00 00:00:00
9 19:45:33 00:00:00 00:00:00
I tried to use the code below, but it doesn't work:
minValues = []
for column in df:
#print(df[column])
if "00:00:00" in df[column]:
minValues.append (df[column].nlargest(2).iloc[-1])
else:
minValues.append (df[column].min())
print (df)
print (minValues)
Idea is replace 0 to missing values and then get minimal timedeltas:
df1 = df.astype(str).apply(pd.to_timedelta)
s1 = df1.mask(df1.eq(pd.Timedelta(0))).min()
print (s1)
10.0.0.155 0 days 19:45:33
192.168.1.240 0 days 16:23:20
192.168.0.242 0 days 20:14:07
dtype: timedelta64[ns]
Or with get minimal datetimes and last convert output to HH:MM:SS values:
df1 = df.astype(str).apply(pd.to_datetime)
s2 = (df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.strftime('%H:%M:%S')
print (s2)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object
Or to times:
df1 = df.astype(str).apply(pd.to_datetime)
s3 = df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.time
print (s3)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object
Please, help me with the following example,
I have DataFrame:
data ={'Сlient':['1', '2', '3', '3', '3', '4'], \
'date1':['2019-11-07', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-08', '2019-11-11'], \
'date2':['2019-11-01', '2019-11-02', '2019-11-06', '2019-11-07', '2019-11-10', '2019-11-15'] }
df =pd.DataFrame(data)
I need to create a column with a date, which from the group of date2 for each client selects the maximum value, and it should be less than the value of date1 for this client.
For example for client 3, I need to get 2019-11-07.
Can this be done with Lambda function?
First use boolean indexing with Series.lt for filter out rows less like date1 values and then get index values by maximum date2 values by DataFrameGroupBy.idxmax and seelct by loc:
df[['date1','date2']] = df[['date1','date2']].apply(pd.to_datetime)
df1 = df.loc[df[df['date2'].lt(df['date1'])].groupby('Сlient')['date2'].idxmax()]
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
Another solution with filtering by DataFrame.query, sorting by DataFrame.sort_values and remove duplicated by DataFrame.drop_duplicates:
df1 = (df.query('date2 < date1')
.sort_values(['Сlient','date2'], ascending=[True, False])
.drop_duplicates('Сlient'))
print (df1)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
3 3 2019-11-08 2019-11-07
EDIT:
Then last step is use Series.map:
df['date2'] = df['Сlient'].map(df1.set_index('Сlient')['date2'])
print (df)
Сlient date1 date2
0 1 2019-11-07 2019-11-01
1 2 2019-11-08 2019-11-02
2 3 2019-11-08 2019-11-07
3 3 2019-11-08 2019-11-07
4 3 2019-11-08 2019-11-07
5 4 2019-11-11 NaT
I'm working with a dataframe has one messy date column with irregular format, ie:
date
0 19.01.01
1 19.02.01
2 1991/01/01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Is it possible convert it to standard format XXXX-XX-XX, which represents year-month-date? Thank you.
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Use pd.to_datetime with yearfirst=True
Ex:
df = pd.DataFrame({"date": ['19.01.01', '19.02.01', '1991/01/01', '1996-01-01', '1996-06-30', '1995-12-31', '1997-01-01']})
df['date'] = pd.to_datetime(df['date'], yearfirst=True).dt.strftime("%Y-%m-%d")
print(df)
Output:
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
It depends of format, the most general solution is specify each format and use Series.combine_first:
date1 = pd.to_datetime(df['date'], format='%y.%m.%d', errors='coerce')
date2 = pd.to_datetime(df['date'], format='%Y/%m/%d', errors='coerce')
date3 = pd.to_datetime(df['date'], format='%Y-%m-%d', errors='coerce')
df['date'] = date1.combine_first(date2).combine_first(date3)
print (df)
date
0 2019-01-01
1 2019-02-01
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Try the following
df['date'].replace('\/|.','-', regex=True)
Use pd.to_datetime()
pd.to_datetime(df['date])
Output:
0 2001-01-19
1 2001-02-19
2 1991-01-01
3 1996-01-01
4 1996-06-30
5 1995-12-31
6 1997-01-01
Name: 0, dtype: datetime64[ns]
I need to find the next month and current month Datetime in python
I have a dataframe which has date columns and i need to filter the values based on todays date.
if todays date is less than 15, i need all row values starting from current month (2019-08-01 00:00:00)
if todays date is >=15, i need all row values starting from next month(2019-09-01 00:00:00)
Dataframe:
PC GEO Month Values
A IN 2019-08-01 00:00:00 1
B IN 2019-08-02 00:00:00 1
C IN 2019-09-14 00:00:00 1
D IN 2019-10-01 00:00:00 1
E IN 2019-07-01 00:00:00 1
if today's date is < 15
PC GEO Month Values
A IN 2019-08-01 00:00:00 1
B IN 2019-08-02 00:00:00 1
C IN 2019-09-14 00:00:00 1
D IN 2019-10-01 00:00:00 1
if today's date is >= 15
PC GEO Month Values
C IN 2019-09-14 00:00:00 1
D IN 2019-10-01 00:00:00 1
I am passing todays date as below
dat = date.today()
dat = dat.strftime("%d")
dat = int(dat)
Use:
#convert column to datetimes
df['Month'] = pd.to_datetime(df['Month'])
#get today timestamp
dat = pd.to_datetime('now')
print (dat)
2019-08-27 13:40:54.272257
#convert datetime to month periods
per = df['Month'].dt.to_period('m')
#convert timestamp to period
today_per = dat.to_period('m')
#compare day and filter
if dat.day < 15:
df = df[per >= today_per]
print (df)
else:
df = df[per > today_per]
print (df)
PC GEO Month Values
2 C IN 2019-09-14 1
3 D IN 2019-10-01 1
Test with values <15:
df['Month'] = pd.to_datetime(df['Month'])
dat = pd.to_datetime('2019-08-02')
print (dat)
2019-08-02 00:00:00
per = df['Month'].dt.to_period('m')
today_per = dat.to_period('m')
if dat.day < 15:
df = df[per >= today_per]
print (df)
PC GEO Month Values
0 A IN 2019-08-01 1
1 B IN 2019-08-02 1
2 C IN 2019-09-14 1
3 D IN 2019-10-01 1
else:
df = df[per > today_per]
I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)