How to spread Time value with Date pair using Pandas - python-3.x

I am trying to figure out how to spread the time value in the date value. My Date value looks like this:
date_list = ['2017-01-07',
'2017-01-08',
'2017-01-04',
'2017-01-05',
'2017-01-03',
'2017-01-04'
.... ]
Here, as you can see the date are in somewhat pair format in order. For Example:
'2017-01-07' and '2017-01-08' or '2017-01-04' and '2017-01-05' etcs...
Basically, every two date pair value are one day apart.
I also have a time value:
time_list = [
datetime.time(23, 0),
datetime.time(0, 0),
datetime.time(1, 0),
.... ]
What I am looking to do is to spread the time from 23 to 1 or basically form 11 PM to 1 AM with the two pair date '2017-01-07' and '2017-01-08' or '2017-01-04' and '2017-01-05' etcs... by preserving the original order of date_list with corresponding time_list
So the new df will look like this:
DateTimeList
2017-01-07 23:00:00
2017-01-08 00:00:00
2017-01-08 01:00:00
2017-01-04 23:00:00
2017-01-05 00:00:00
2017-01-05 01:00:00
2017-01-03 23:00:00
2017-01-04 00:00:00
2017-01-04 01:00:00
What did I do?
I put the time in between using:
time = df.between_time('23:00:00','01:00:00')
and then time[time.index.normalize().isin(date_list)]
however, this does not work because it does not spread the time_list after midnight on two date pair. It spreads the entire time from 22 to 01 on a single day. It also sorts the data.
But what I want is to spread the time value into two date pair by preserving the original order of date_list with corresponding time_list. Can you please help solve it?

How about using datetime.datetime.combine() with some modulo logic?
import datetime
def combine_pairs(date_list, time_list):
for i, x in enumerate(date_list):
dt = datetime.date.fromisoformat(x)
if not i % 2:
yield datetime.datetime.combine(dt, time_list[0])
else:
yield datetime.datetime.combine(dt, time_list[1])
yield datetime.datetime.combine(dt, time_list[2])
Demo:
>>> from pprint import pprint
>>> date_list = ['2017-01-07',
... '2017-01-08',
... '2017-01-04',
... '2017-01-05',
... '2017-01-03',
... '2017-01-04',]
>>> time_list = [
... datetime.time(23, 0),
... datetime.time(0, 0),
... datetime.time(1, 0),]
>>> pprint(list(combine_pairs(date_list, time_list)))
[datetime.datetime(2017, 1, 7, 23, 0),
datetime.datetime(2017, 1, 8, 0, 0),
datetime.datetime(2017, 1, 8, 1, 0),
datetime.datetime(2017, 1, 4, 23, 0),
datetime.datetime(2017, 1, 5, 0, 0),
datetime.datetime(2017, 1, 5, 1, 0),
datetime.datetime(2017, 1, 3, 23, 0),
datetime.datetime(2017, 1, 4, 0, 0),
datetime.datetime(2017, 1, 4, 1, 0)]

Related

Pandas qcut ValueError: Input array must be 1 dimensional

I was trying to categorize my values into 10 bins and I met with this error. How can I avoid this error and bin them smoothly?
Attached are samples of the data and code.
Data
JPM
2008-01-02 NaN
2008-01-03 NaN
2008-01-04 NaN
2008-01-07 NaN
2008-01-08 NaN
... ...
2009-12-24 -0.054014
2009-12-28 0.002679
2009-12-29 -0.030015
2009-12-30 -0.019058
2009-12-31 -0.010090
505 rows × 1 columns
Code
group_names = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
discretized_roc = pd.qcut(df, 10, labels=group_names)
Pass column JPM and for only integer indicators of the bins use labels=False:
discretized_roc = pd.qcut(df['JPM'], 10, labels=False)
If need first column instead label use DataFrame.iloc:
discretized_roc = pd.qcut(df.iloc[:, 0], 10, labels=False)

Edit multiple values with df.at()

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]
.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.
On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Get sum of group subset using pandas groupby

I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.

Pandas resample fill NaN

I have this df:
Timestamp List Power Energy Status
0 2020-01-01 01:05:50 [5, 5, 5] 7000 15000 online
1 2020-01-01 01:06:20 [6, 6, 6] 7500 16000 online
2 2020-01-01 01:08:30 [0, 0, 0] 5 0 offline
...
no i want to resample it. Use .resample as following:
df2 = df.set_index('timestamp').resample('min').?
i want the df in 1min - intervalls. To each intervall i want to match with the rows as follows:
List: if status = online: last entry of the intervall else '0';
Power: if status = online: the mean value of the intervall else '0'; Energy: if status = online: last entry of the intervall else '0; Status: the last status of the intervall;
how do i fill the NaN values, which .resample outputs, if there is no data in df? E.g. no data for an interval, then the df should be filled as follows Power = 0; Energy = 0; status = offline;...
I tried something like that:
df2 = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
and got:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 NaN NaN NaN NaN
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
Expected outcome:
Timestamp List Power Energy Status
0 2020-01-01 01:05 [5, 5, 5] (average of the interval) 15000 online
1 2020-01-01 01:06 [6, 6, 6] (average of the interval) 16000 online
2 2020-01-01 01:07 [0, 0, 0] 0 0 offline
3 2020-01-01 01:08 [0, 0, 0] 5 0 offline
There is no way to pass fillna rule to separately handle each column NA values during .resample().agg() as viewed in docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
In your case even interpolation does not work, so, try to manually handle each column NA values
Firstly, let's initialize your sample frame.
import pandas as pd
data = {"Timestamp":{"0": "2020-01-01 01:05:50",
"1": "2020-01-01 01:06:20",
"2": "2020-01-01 01:08:30"},
"List": {"0": [5, 5, 5],
"1": [6, 6, 6],
"2": [0, 0, 0]},
"Power": {"0": 7000,
"1": 7500,
"2": 5},
"Energy": {"0": 15000,
"1": 16000,
"2": 0},
"Status": {"0": "online",
"1": "online",
"2": "offline"},
}
df = pd.DataFrame(data)
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.set_index('Timestamp').resample('T').agg({'List':'last',
'Power':'mean',
'Energy':'last',
'Status':'last'})
Now we can manually replace NA in each column separately
df["List"] = df["List"].fillna("[0, 0, 0]")
df["Status"] = df["Status"].fillna('offline')
df = df.fillna(0)
or more convenient dict way to do it
values = {
'List': '[0, 0, 0]',
'Status': 'offline',
'Power': 0,
'Energy': 0
}
df = df.fillna(value=values)
Timestamp List Power Energy Status
0 2020-01-01 01:05:00 [5, 5, 5] 7000.0 15000.0 online
1 2020-01-01 01:06:00 [6, 6, 6] 7500.0 16000.0 online
2 2020-01-01 01:07:00 [0, 0, 0] 0.0 0.0 offline
3 2020-01-01 01:08:00 [0, 0, 0] 5.0 0.0 offline

Extract Day of Week More Pythonically

I have a df with fields year, month, day, formatted as integers. I have used the following to extract the day of the week.
How can I do this more pythonically?
### First Attempt - Succeeds
lst = []
for i in zip(df['day'], df['month'], df['year']):
lst.append(calendar.weekday(i[2], i[1], i[0]))
df['weekday'] = lst
### Second Attempt -- Fails
df['weekday'] = df.apply(lambda x: calendar.weekday(x.year, x.month, x.day))
AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index cons_conf')
Try .to_datetime and the dt accessor:
import pandas as pd
data = pd.DataFrame({'year': [2018, 2018, 2018], 'month': [12, 12, 12], 'day': [1, 2, 3]})
data['weekday'] = pd.to_datetime(data[['year', 'month', 'day']]).dt.weekday
print(data)
Giving:
year month day weekday
0 2018 12 1 5
1 2018 12 2 6
2 2018 12 3 0
Note that weekday is zero-indexed.

Resources