Copy a single row value and apply it as a column using Pandas - python-3.x

My dataset looks like this:
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0536
2018-01-01 00:01:00 1.0527
2018-01-01 00:02:00 1.0558
2018-01-01 00:03:00 1.0534
2018-01-01 00:04:00 1.0524
...
What I want to do is get the value at 05:00:00 daily and create a new column called, OpenVal_5AM and put that corresponding value on that column. The new df will look like this:
# df2 - minute based dataset with 05:00:00 Open value
date Open OpenVal_5AM
2018-01-01 00:00:00 1.0536 1.0133
2018-01-01 00:01:00 1.0527 1.0133
2018-01-01 00:02:00 1.0558 1.0133
2018-01-01 00:03:00 1.0534 1.0133
2018-01-01 00:04:00 1.0524 1.0133
...
Since this is a minute based data, we will have 1440 same data point in the new column OpenVal_5AM for each day. It is because we are just grabbing the value at a point in time in a day and creating a new column.
What did I do?
I used this step:
df['OpenVal_5AM'] = df.groupby(df.date.dt.date,sort=False).Open.dt.hour.between(5, 5)
That's the closest I could come but it does not work.

Here's my suggestion:
df['OpenVal_5AM'] = df.apply(lambda r: df.Open.loc[r.name.replace(minute=5)], axis=1)
Disclaimer: I didn't test it with a huge dataset; so I don't know how it'll perform in that situation.

Related

Dividing two dataframes gives NaN

I have two dataframes, one with a metric as of the last day of the month. The other contains a metric summed for the whole month. The former (monthly_profit) looks like this:
profit
yyyy_mm_dd
2018-01-01 8797234233.0
2018-02-01 3464234233.0
2018-03-01 5676234233.0
...
2019-10-01 4368234233.0
While the latter (monthly_employees) looks like this:
employees
yyyy_mm_dd
2018-01-31 924358
2018-02-28 974652
2018-03-31 146975
...
2019-10-31 255589
I want to get profit per employee, so I've done this:
profit_per_employee = (monthly_profit['profit']/monthly_employees['employees'])*100
This is the output that I get:
yyyy_mm_dd
2018-01-01 NaN
2018-01-31 NaN
2018-02-01 NaN
2018-02-28 NaN
How could I fix this? The reason that one dataframe is the last day of the month and the other is the first day of the month is due to rolling vs non-rolling data.
monthly_profit is the result of grouping and summing daily profit data:
monthly_profit = df.groupby(['yyyy_mm_dd'])[['proft']].sum()
monthly_profit = monthly_profit.resample('MS').sum()
While monthly_employees is a running total, so I need to take the current value for the last day of each month:
monthly_employees = df.groupby(['yyyy_mm_dd'])[['employees']].sum()
monthly_employees = monthly_employees.groupby([monthly_employees.index.year, monthly_employees.index.month]).tail(1)
Change MS to M for end of months for match both DatatimeIndex:
monthly_profit = monthly_profit.resample('M').sum()

How to generate Fixed Minute based DateTime using Pandas

I need help generating Minute based time-range for a pre-defined Date.
The Date range values will change so I should be able to update it.
I also want to exclude Friday and Saturday from the generated data.
What did I do?
I successfully generated the date-range by doing this:
pd.date_range(start='1/1/2017', end='8/06/2019', freq='T')
But how do I add Minute data and exclude Friday and Saturday?
Once this is done, I want to create a column name called 'TIME_MIN' and assign it to a df
Could you please help?
You can exclude Friday and Saturday using:
df = pd.DataFrame({
'time': pd.date_range(start='1/1/2017', end='8/06/2019', freq='T')
})
df.loc[~df['time'].dt.weekday_name.isin(['Friday', 'Saturday'])]
Output:
time
0 2017-01-01 00:00:00
1 2017-01-01 00:01:00
2 2017-01-01 00:02:00
3 2017-01-01 00:03:00
4 2017-01-01 00:04:00
5 2017-01-01 00:05:00
6 2017-01-01 00:06:00
7 2017-01-01 00:07:00
...

How to join Minute based time-range with Date using Pandas?

My dataset df looks like this:
DateTimeVal Open
2017-01-01 17:00:00 5.1532
2017-01-01 17:01:00 5.3522
2017-01-01 17:02:00 5.4535
2017-01-01 17:03:00 5.3567
2017-01-01 17:04:00 5.1512
....
It is a Minute based data
The Time value starts from 17:00:00 however I want to only change the Time value to start from 00:00:00 as a Minute based data and up to 23:59:00
The current Time starts at 17:00:00 and increments per Minute and ends on 16:59:00. The total row value is 1440 so I can confirm that it is a Minute based 24 Hour data
My new df should looks like this:
DateTimeVal Open
2017-01-01 00:00:00 5.1532
2017-01-01 00:01:00 5.3522
2017-01-01 00:02:00 5.4535
2017-01-01 00:03:00 5.3567
2017-01-01 00:04:00 5.1512
....
Here, we did not change anything except the Time part.
What did I do?
My logic was to remove the Time and then populate with new Time
Here is what I did:
pd.DatetimeIndex(df['DateTimeVal'].astype(str).str.rsplit(' ', 1).str[0], dayfirst=True)
But I do not know how to add the new Time data. Could you please help?
How about subtracting 17 hours from your DateTimeVal:
df['DateTimeVal'] -= pd.Timedelta(hours=17)

Adding Minutes to Pandas DatetimeIndex

I have an index that contain dates.
DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
'2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
'2004-01-14', '2004-01-15',
...
'2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
'2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
'2015-12-30', '2015-12-31'],
dtype='datetime64[ns]', length=3021, freq=None)
Now for each day I would like to generate every minute (24*60=1440 minutes) within each day and make an index with all days and minutes.
The result should look like:
['2004-01-02 00:00:00', '2004-01-02 00:01:00', ..., '2004-01-02 23:59:00',
'2004-01-03 00:00:00', '2004-01-03 00:01:00', ..., '2004-01-03 23:59:00',
...
'2015-12-31 00:00:00', '2015-12-31 00:01:00', ..., '2015-12-31 23:59:00']
Is there a smart trick for this?
You should be able to use .asfreq() here:
>>> import pandas as pd
>>> days = pd.date_range(start='2018-01-01', days=10)
>>> df = pd.DataFrame(list(range(len(days))), index=days)
>>> df.asfreq('min')
0
2018-01-01 00:00:00 0.0
2018-01-01 00:01:00 NaN
2018-01-01 00:02:00 NaN
2018-01-01 00:03:00 NaN
2018-01-01 00:04:00 NaN
2018-01-01 00:05:00 NaN
2018-01-01 00:06:00 NaN
# ...
>>> df.shape
(10, 1)
>>> df.asfreq('min').shape
(12961, 1)
If that doesn't work for some reason, you might also want to have a look into pd.MultiIndex.from_product(); then pd.to_datetime() on the concatenated result.

How to get all indexes which had a particular value in last row of a Pandas DataFrame?

For a sample DataFrame like,
>>> import pandas as pd
>>> index = pd.date_range(start='1/1/2018', periods=6, freq='15T')
>>> data = ['ON_PEAK', 'OFF_PEAK', 'ON_PEAK', 'ON_PEAK', 'OFF_PEAK', 'OFF_PEAK']
>>> df = pd.DataFrame(data, index=index, columns=['tou'])
>>> df
tou
2018-01-01 00:00:00 ON PEAK
2018-01-01 00:15:00 OFF PEAK
2018-01-01 00:30:00 ON PEAK
2018-01-01 00:45:00 ON PEAK
2018-01-01 01:00:00 OFF PEAK
2018-01-01 01:15:00 OFF PEAK
How to get all indexes for which tou value is not ON_PEAK but of row before them is ON_PEAK, i.e. the output would be:
['2018-01-01 00:15:00', '2018-01-01 01:00:00']
Or, if it's easier to get all rows with ON_PEAK and the first row next to them, i.e
['2018-01-01 00:00:00', '2018-01-01 00:15:00', '2018-01-01 00:30:00', '2018-01-01 00:45:00', '2018-01-01 01:00:00']
You need to find rows where tou is not ON_PEAK and the previous tou found using pandas.shift() is ON_PEAK. Note that positive values in shift give nth previous values and negative values gives nth next value in the dataframe.
df.loc[(df['tou']!='ON_PEAK') & (df['tou'].shift(1)=='ON_PEAK')]
Output:
tou
2018-01-01 00:15:00 OFF_PEAK
2018-01-01 01:00:00 OFF_PEAK

Resources