I have a situation where month and date are messed up for few dates in my dataframe. For e.g here is the input:
df['work_date'].head(15)
0 2018-01-01
1 2018-02-01
2 2018-03-01
3 2018-04-01
4 2018-05-01
5 2018-06-01
6 2018-07-01
7 2018-08-01
8 2018-09-01
9 2018-10-01
10 2018-11-01
11 2018-12-01
12 2018-01-13
13 2018-01-14
14 2018-01-15
The date is stored as a string. As you can see, the date is in the format yyyy-dd-mm till 12th of Jan and then becomes yyyy-mm-dd. The dataframe consists of 3 years worth data and this pattern repeats for all months for all years.
My expected output is to standardize the date to format dddd-mm-yy like below.
0 2018-01-01
1 2018-01-02
2 2018-01-03
3 2018-01-04
4 2018-01-05
5 2018-01-06
6 2018-01-07
7 2018-01-08
8 2018-01-09
9 2018-01-10
10 2018-01-11
11 2018-01-12
12 2018-01-13
13 2018-01-14
14 2018-01-15
Below is the code that I wrote and it gets the job done. Basically, I split the date string and do some string manipulations. However, as you can see its not too pretty. I am checking to see if there could be some other elegant solution to this other than doing the df.apply and the loops.
def func(x):
d = x.split('-')
print(d)
if (int(d[1]) <= 12) & (int(d[2]) <= 12) :
d = [d[0],d[2],d[1]]
x = '-'.join(d)
return x
else:
return x
df['work_date'] = df['work_date'].apply(lambda x:func(x))
You could just update the column based on the fact that it is in order and there is only one date and all days of the year are included consecutively:
df['Date'] = pd.date_range(df['work_date'].min(), '2018-01-12', freq='1D')
# you can specify df['work_date'].min() OR df['work_date'].max) OR A STRING. It really depends on what format your minimum and your maximum is
df
Out[1]:
work_date date
0 2018-01-01 2018-01-01
1 2018-02-01 2018-01-02
2 2018-03-01 2018-01-03
3 2018-04-01 2018-01-04
4 2018-05-01 2018-01-05
5 2018-06-01 2018-01-06
6 2018-07-01 2018-01-07
7 2018-08-01 2018-01-08
8 2018-09-01 2018-01-09
9 2018-10-01 2018-01-10
10 2018-11-01 2018-01-11
11 2018-12-01 2018-01-12
12 2018-01-13 2018-01-13
13 2018-01-14 2018-01-14
14 2018-01-15 2018-01-15
To make this more dynamic, you could also do some try / except shown below:
minn = df['work_date'].min()
maxx = df['work_date'].max()
try:
df['Date'] = pd.date_range(minn, maxx, freq='1D')
except ValueError:
s = maxx.split('-')
df['Date'] = pd.date_range(minn, f'{s[0]}-{s[2]}-{s[1]}', freq='1D')
except ValueError:
s = minn.split('-')
df['Date'] = pd.date_range(f'{s[0]}-{s[2]}-{s[1]}', maxx, freq='1D')
df
Related
I have a dataframe of start date and closed date of cases. I want to do a count of how many cases are available at the start of each case.
caseNo startDate closedDate
1 2019-01-01 2019-01-03
2 2019-01-02 2019-01-10
3 2019-01-03 2019-01-04
4 2019-01-05 2019-01-10
5 2019-01-06 2019-01-10
6 2019-01-07 2019-01-12
7 2019-01-11 2019-01-15
Output will be:
caseNo startDate closedDate numCases
1 2019-01-01 2019-01-03 0
2 2019-01-02 2019-01-10 1
3 2019-01-03 2019-01-04 1
4 2019-01-05 2019-01-10 1
5 2019-01-06 2019-01-10 2
6 2019-01-07 2019-01-12 3
7 2019-01-11 2019-01-15 1
For example, for case 6, cases 2,4,5 still have not been closed. So there are 3 cases outstanding.
Also, the dates are actually datetimes rather than just date. I have only included the date here for brevity.
Solution in numba should increase performance (best test in real data):
from numba import jit
#jit(nopython=True)
def nb_func(x, y):
res = np.empty(x.size, dtype=np.int64)
for i in range(x.size):
res[i] = np.sum(x[:i] > y[i])
return res
df['case'] = nb_func(df['closedDate'].to_numpy(), df['startDate'].to_numpy())
print (df)
caseNo startDate closedDate case
0 1 2019-01-01 2019-01-03 0
1 2 2019-01-02 2019-01-10 1
2 3 2019-01-03 2019-01-04 1
3 4 2019-01-05 2019-01-10 1
4 5 2019-01-06 2019-01-10 2
5 6 2019-01-07 2019-01-12 3
6 7 2019-01-11 2019-01-15 1
Use:
res = []
temp = pd.to_datetime(df['closedDate'])
for i, row in df.iterrows():
temp_res = np.sum(row['startDate']<temp.iloc[:i])
print(temp_res)
res.append(temp_res)
output:
Then you can add the result as a df column:
I'm totally new to Time Series Analysis and I'm trying to work on examples available online
this is what I have currently:
# Time based features
data = pd.read_csv('Train_SU63ISt.csv')
data['Datetime'] = pd.to_datetime(data['Datetime'],format='%d-%m-%Y %H:%M')
data['Hour'] = data['Datetime'].dt.hour
data['minute'] = data['Datetime'].dt.minute
data.head()
ID Datetime Count Hour Minute
0 0 2012-08-25 00:00:00 8 0 0
1 1 2012-08-25 01:00:00 2 1 0
2 2 2012-08-25 02:00:00 6 2 0
3 3 2012-08-25 03:00:00 2 3 0
4 4 2012-08-25 04:00:00 2 4 0
What I'm looking for is something like this:
ID Datetime Count Hour Minute 4-Hour-window
0 0 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
1 1 2012-08-25 04:00:00 22 8 0 04:00:00 - 08:00:00
2 2 2012-08-25 08:00:00 18 12 0 08:00:00 - 12:00:00
3 3 2012-08-25 12:00:00 16 16 0 12:00:00 - 16:00:00
4 4 2012-08-25 16:00:00 18 20 0 16:00:00 - 20:00:00
5 5 2012-08-25 20:00:00 14 24 0 20:00:00 - 00:00:00
6 6 2012-08-25 00:00:00 20 4 0 00:00:00 - 04:00:00
7 7 2012-08-26 04:00:00 24 8 0 04:00:00 - 08:00:00
8 8 2012-08-26 08:00:00 20 12 0 08:00:00 - 12:00:00
9 9 2012-08-26 12:00:00 10 16 0 12:00:00 - 16:00:00
10 10 2012-08-26 16:00:00 18 20 0 16:00:00 - 20:00:00
11 11 2012-08-26 20:00:00 14 24 0 20:00:00 - 00:00:00
I think what you are looking for is the resample function, see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
Something like this should work (not tested):
sampled_data = data.resample(
'4H',
kind='timestamp',
on='Datetime',
label='left'
).sum()
The function is very similar to groupby and groups the data into chunks of the column specified in on=, in this case we use timestamps and chunks of 4 hours.
Finally, you need to use some kind of disaggregation, in this case sum(), to convert all elements of each group into a single element per timechunk
I have a list of data with total number of orders and I would like to calculate the average number of orders per day of the week. For example, average number of order on Monday.
0 2018-01-01 00:00:00 3162
1 2018-01-02 00:00:00 1146
2 2018-01-03 00:00:00 396
3 2018-01-04 00:00:00 848
4 2018-01-05 00:00:00 1624
5 2018-01-06 00:00:00 3052
6 2018-01-07 00:00:00 3674
7 2018-01-08 00:00:00 1768
8 2018-01-09 00:00:00 1190
9 2018-01-10 00:00:00 382
10 2018-01-11 00:00:00 3170
Make sure your date column is in datetime format (looks like it already is)
Add column to convert date to day of week
Group by the day of week and take average
df['Date'] = pd.to_datetime(df['Date']) # Step 1
df['DayofWeek'] =df['Date'].dt.day_name() # Step 2
df.groupby(['DayofWeek']).mean() # Step 3
I want to create another column in dataframe which consists value of difference. The difference is calculated by subtracting different rows of different columns for unique date values.
I tried looking for various stackoverflow links but didn't find the answer.
The difference should be the value after subtracting value of ATA of 2st row with ATD of 1st row and so on for unique date values. For ex, ATA of 1st january cannot be subtracted from ATD of 2nd january.
For example:-
The difference column's first values should be NAN.
Second values should be 50 Mins (17:13:00 - 16:23:00)
But ATD of 02-01-2019 should not be subtracted with ATA of 01-01-2019
You want to apply a shift grouped by Date and then subtract this with ATD
>>> df = pd.DataFrame({'ATA':range(0,365),'ATD':range(10,375),'Date':pd.date_range(start="2018-01-01",end="2018-12-31")})
>>> df['ATD'] = df['ATD']/6.0
>>> df = pd.concat([df,df,df,df])
>>> df['shifted_ATA'] = df.groupby('Date')['ATA'].transform('shift')
>>> df['result'] = df['ATD'] - df['shifted_ATA']
>>> df = df.sort_values(by='Date', ascending=[1])
>>> df.head(20)
ATA ATD Date shifted_ATA result
0 0 1.666667 2018-01-01 NaN NaN
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
1 1 1.833333 2018-01-02 NaN NaN
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 NaN NaN
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 2.0 0.000000
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 NaN NaN
3 3 2.166667 2018-01-04 3.0 -0.833333
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 NaN NaN
I am working with this dataset:
TPdata:
id Tp1 Sp2 time
A 1 7 08:00:00
B 2 8 09:00:00
C 3 9 18:30:00
D 4 10 20:00:00
E 5 11 08:00:00
F 6 12 09:00:00
I would like to change the entries 08:00:00 in column time to 'early'. I thought this would work but it isn't:
TPdata$time[TPdata$time == 18:30:00] <- "early"
Can anyone help?