My dataset df looks like this:
DateTimeVal Open
2017-01-01 17:00:00 5.1532
2017-01-01 17:01:00 5.3522
2017-01-01 17:02:00 5.4535
2017-01-01 17:03:00 5.3567
2017-01-01 17:04:00 5.1512
....
It is a Minute based data
The Time value starts from 17:00:00 however I want to only change the Time value to start from 00:00:00 as a Minute based data and up to 23:59:00
The current Time starts at 17:00:00 and increments per Minute and ends on 16:59:00. The total row value is 1440 so I can confirm that it is a Minute based 24 Hour data
My new df should looks like this:
DateTimeVal Open
2017-01-01 00:00:00 5.1532
2017-01-01 00:01:00 5.3522
2017-01-01 00:02:00 5.4535
2017-01-01 00:03:00 5.3567
2017-01-01 00:04:00 5.1512
....
Here, we did not change anything except the Time part.
What did I do?
My logic was to remove the Time and then populate with new Time
Here is what I did:
pd.DatetimeIndex(df['DateTimeVal'].astype(str).str.rsplit(' ', 1).str[0], dayfirst=True)
But I do not know how to add the new Time data. Could you please help?
How about subtracting 17 hours from your DateTimeVal:
df['DateTimeVal'] -= pd.Timedelta(hours=17)
Related
I have two dataframes, one with a metric as of the last day of the month. The other contains a metric summed for the whole month. The former (monthly_profit) looks like this:
profit
yyyy_mm_dd
2018-01-01 8797234233.0
2018-02-01 3464234233.0
2018-03-01 5676234233.0
...
2019-10-01 4368234233.0
While the latter (monthly_employees) looks like this:
employees
yyyy_mm_dd
2018-01-31 924358
2018-02-28 974652
2018-03-31 146975
...
2019-10-31 255589
I want to get profit per employee, so I've done this:
profit_per_employee = (monthly_profit['profit']/monthly_employees['employees'])*100
This is the output that I get:
yyyy_mm_dd
2018-01-01 NaN
2018-01-31 NaN
2018-02-01 NaN
2018-02-28 NaN
How could I fix this? The reason that one dataframe is the last day of the month and the other is the first day of the month is due to rolling vs non-rolling data.
monthly_profit is the result of grouping and summing daily profit data:
monthly_profit = df.groupby(['yyyy_mm_dd'])[['proft']].sum()
monthly_profit = monthly_profit.resample('MS').sum()
While monthly_employees is a running total, so I need to take the current value for the last day of each month:
monthly_employees = df.groupby(['yyyy_mm_dd'])[['employees']].sum()
monthly_employees = monthly_employees.groupby([monthly_employees.index.year, monthly_employees.index.month]).tail(1)
Change MS to M for end of months for match both DatatimeIndex:
monthly_profit = monthly_profit.resample('M').sum()
My dataset looks like this:
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0536
2018-01-01 00:01:00 1.0527
2018-01-01 00:02:00 1.0558
2018-01-01 00:03:00 1.0534
2018-01-01 00:04:00 1.0524
...
What I want to do is get the value at 05:00:00 daily and create a new column called, OpenVal_5AM and put that corresponding value on that column. The new df will look like this:
# df2 - minute based dataset with 05:00:00 Open value
date Open OpenVal_5AM
2018-01-01 00:00:00 1.0536 1.0133
2018-01-01 00:01:00 1.0527 1.0133
2018-01-01 00:02:00 1.0558 1.0133
2018-01-01 00:03:00 1.0534 1.0133
2018-01-01 00:04:00 1.0524 1.0133
...
Since this is a minute based data, we will have 1440 same data point in the new column OpenVal_5AM for each day. It is because we are just grabbing the value at a point in time in a day and creating a new column.
What did I do?
I used this step:
df['OpenVal_5AM'] = df.groupby(df.date.dt.date,sort=False).Open.dt.hour.between(5, 5)
That's the closest I could come but it does not work.
Here's my suggestion:
df['OpenVal_5AM'] = df.apply(lambda r: df.Open.loc[r.name.replace(minute=5)], axis=1)
Disclaimer: I didn't test it with a huge dataset; so I don't know how it'll perform in that situation.
I need help generating Minute based time-range for a pre-defined Date.
The Date range values will change so I should be able to update it.
I also want to exclude Friday and Saturday from the generated data.
What did I do?
I successfully generated the date-range by doing this:
pd.date_range(start='1/1/2017', end='8/06/2019', freq='T')
But how do I add Minute data and exclude Friday and Saturday?
Once this is done, I want to create a column name called 'TIME_MIN' and assign it to a df
Could you please help?
You can exclude Friday and Saturday using:
df = pd.DataFrame({
'time': pd.date_range(start='1/1/2017', end='8/06/2019', freq='T')
})
df.loc[~df['time'].dt.weekday_name.isin(['Friday', 'Saturday'])]
Output:
time
0 2017-01-01 00:00:00
1 2017-01-01 00:01:00
2 2017-01-01 00:02:00
3 2017-01-01 00:03:00
4 2017-01-01 00:04:00
5 2017-01-01 00:05:00
6 2017-01-01 00:06:00
7 2017-01-01 00:07:00
...
I have a HUGHE DataFrame that looks as follows (this is just an example to illustrate the problem):
id timestamp target_time interval
1 08:00:00 10:20:00 (10-11]
1 08:30:00 10:21:00 (10-11]
1 09:10:00 11:30:00 (11-12]
2 09:15:00 10:15:00 (10-11]
2 09:35:00 10:11:00 (10-11]
3 09:45:00 11:12:00 (11-12]
...
I would like to create a series looking as follows:
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 1
09:35:00 1
(11-12] 09:10:00 1
09:45:00 2
The objective is to count, for each time interval, how many unique ids had their corresponding target_time within the interval at their timestamp. Note that the target_time for each id can change at different timestamps. For instance, for the id 1 the interval is (10-11] from 08:00:00 to 08:30:00, but then it changes to (11-12] at 09:10:00. Therefore, at 09:15:00 I do not want to count the id 1 in the resulting Series.
I tried a groupby -> expand -> np.unique approach, but it does not provide the result that I want:
df.set_index('timestamp').groupby('interval').id.expanding().apply(lambda x: np.unique(x).shape[0])
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 2
09:35:00 2
(11-12] 09:10:00 1
09:45:00 2
Any hint on how can I approach this problem? I want to make use of pandas routines as much as possible, in order to reduce computational time, since the length of the DataFrame is 1453076...
Many thanks in advance!
I have a dataset where the transaction date is stored as YYYY-MM-DD 00:00:00 and the transaction time is stored as 1900-01-01 HH:MM:SS
I need to truncate these timestamps and then either leave as is or convert to a singular timestamp. I've tried several methods and all continue to return the full timestamp. Thoughts?
Use split and pd.to_datetime:
df = pd.DataFrame({'TransDate':['2015-01-01 00:00:00','2015-01-02 00:00:00','2015-01-03 00:00:00'],
'TransTime':['1900-01-01 07:00:00','1900-01-01 08:30:00','1900-01-01 09:45:15']})
df['Date'] = (pd.to_datetime(df['TransDate'].str.split().str[0] +
' ' +
df['TransTime'].str.split().str[1]))
Output:
TransDate TransTime Date
0 2015-01-01 00:00:00 1900-01-01 07:00:00 2015-01-01 07:00:00
1 2015-01-02 00:00:00 1900-01-01 08:30:00 2015-01-02 08:30:00
2 2015-01-03 00:00:00 1900-01-01 09:45:15 2015-01-03 09:45:15
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
TransDate 3 non-null object
TransTime 3 non-null object
Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 152.0+ bytes
None