I have the data like follows:
df.index value
2019-02-28 00:00:00 101
2019-02-28 00:10:00 97
2019-02-28 00:20:00 97
2019-02-28 00:30:00 96
2019-02-28 00:40:00 110
2019-02-28 00:50:00 117
2019-02-28 01:00:00 121
2019-02-28 01:10:00 114
2019-02-28 01:20:00 112
2019-02-28 01:30:00 103
2019-02-28 01:40:00 104
2019-02-28 01:50:00 105
2019-02-28 02:00:00 106
2019-02-28 02:10:00 104
2019-02-28 02:20:00 103
2019-02-28 02:30:00 97
2019-02-28 02:40:00 101
2019-02-28 02:50:00 103
2019-02-28 03:00:00 102
2019-02-28 03:10:00 101
Is there to method helping me to make a new dataframe resampled by 15min, where for full hours and 30min takes value from above data and for 15min (the average of 10 and 20) and similarly for 45 (the average of 40 and 50)? Additionally how to check if the dataframe starts with full hour?
The part of the code I tried to use is
df_15=pd.date_range(start=df.index[0], end=df.index[-1], freq='15T')
df_15=df_15.to_frame(index=False)
> for row in range(0,len(df_15+1),6):
> mean=df.iloc[row]
> df_mean=pd.concat([df_mean,mean])
> mean=(df.iloc[row+1]+df.iloc[row+2])/2
> df_mean=pd.concat([df_mean,mean])
> mean=df.iloc[row+3]
> df_mean=pd.concat([df_mean,mean])
> mean=(df.iloc[row+4]+df.iloc[row+5])/2
> df_mean=pd.concat([df_mean,mean])
but I get an error
TypeError: Addition/subtraction of integers and integer-arrays with DatetimeArray is no longer supported. Instead of adding/subtracting n, use n * obj.freq
Related
I recently started working with timeseries data and I want to find the start and end times of values in a column that exceed a defined threshold of 150.
Datetime Value
0 11/30/2022 0:00 100
1 11/30/2022 0:01 110
2 11/30/2022 0:02 105
3 11/30/2022 0:03 105
4 11/30/2022 0:04 155
5 11/30/2022 0:05 160
6 11/30/2022 0:06 160
7 11/30/2022 0:07 160
8 11/30/2022 0:08 160
9 11/30/2022 0:09 165
10 11/30/2022 0:10 165
11 11/30/2022 0:11 160
12 11/30/2022 0:12 160
13 11/30/2022 0:13 150
14 11/30/2022 0:14 120
15 11/30/2022 0:15 110
16 11/30/2022 0:16 115
17 11/30/2022 0:17 115
18 11/30/2022 0:18 130
19 11/30/2022 0:19 145
20 11/30/2022 0:20 150
21 11/30/2022 0:21 155
22 11/30/2022 0:22 155
23 11/30/2022 0:23 155
24 11/30/2022 0:24 155
25 11/30/2022 0:25 155
26 11/30/2022 0:26 140
27 11/30/2022 0:27 130
28 11/30/2022 0:28 120
I want to get an output in the form of a dataframe having multiple start and end times along with the duration in seconds:
Start_Time End_Time Duration
0 2022-11-30 00:04:00 2022-11-30 00:13:00 540.0
1 2022-11-30 00:20:00 2022-11-30 00:25:00 300.0
I can compute the duration using df['Duration'] = (df['End_Time']-df['Start_Time']).dt.total_seconds() however I cannot get those start and end times. Can someone please help me out with this?
First, creating your dataframe:
start = pd.to_datetime('30-11-2022, 0:0:0', format="%d-%m-%Y, %H:%M:%S")
df = pd.DataFrame({
"StartTime": pd.date_range(start=start, freq="1s", periods=29),
"Value": [100,110,105,105,155,160,160,160,160,165,165,160,160,150,120,110,115,115,130,145,150,155,155,155,155,155,140,130,120]})
Next, let's filter to only the values you want:
df = df.loc[df.Value > 150].copy()
Then, let's combine the consecutive windows by shifting the start and end columns, then checking if they line up with each other:
merge_endtimes = df.EndTime[df.StartTime.shift(-1) != df.EndTime].reset_index(drop=True)
merge_starttimes = df.StartTime[df.EndTime.shift(1) != df.StartTime].reset_index(drop=True)
merged_df = pd.concat([merge_starttimes, merge_endtimes], axis=1)
merged_df['Duration'] = merged_df['EndTime'] - merged_df['StartTime']
Code
s = df['Value'].ge(150)
grouper = s.ne(s.shift(1)).cumsum()
df[s].groupby(grouper)['Datetime'].agg([min, max])
output:
min max
Value
2 2022-11-30 00:00:04 2022-11-30 00:00:13
4 2022-11-30 00:00:20 2022-11-30 00:00:25
change index & columns of output
(df[s].groupby(grouper)['Datetime'].agg([min, max])
.set_axis(['start_time', 'end_time'], axis=1)
.reset_index(drop=True))
result:
start_time end_time
0 2022-11-30 00:00:04 2022-11-30 00:00:13
1 2022-11-30 00:00:20 2022-11-30 00:00:25
I have a set of 5 time series dataframes with resolution of 15mins but they do not end on the same date and time. However, the starting date and time are same. So, I would prefer to clip them so that they are of the same length.
And, then I would like to reshape the data to see weekly pattern or 14-days pattern.
The data looks like this:
I think what you mean by clipping is to resample the dates and take the last value. (correct me if I'm wrong)
To resample you can use .resample() method from pandas (set your timestamp column as index before using this method), followed by a .last() to take the last value.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
... 'Timestamp':pd.date_range('2021-06-21 23:07', periods = 200, freq='15min'),
... 'Speed':range(200)
>>> })
>>> print(df)
Timestamp Speed
0 2021-06-21 23:07:00 0
1 2021-06-21 23:22:00 1
2 2021-06-21 23:37:00 2
3 2021-06-21 23:52:00 3
4 2021-06-22 00:07:00 4
.. ... ...
195 2021-06-23 23:52:00 195
196 2021-06-24 00:07:00 196
197 2021-06-24 00:22:00 197
198 2021-06-24 00:37:00 198
199 2021-06-24 00:52:00 199
[200 rows x 2 columns]
>>> df_grouped = df.set_index('Timestamp').resample('1H').last()
>>> print(df_grouped.head(10))
Speed
Timestamp
2021-06-21 23:00:00 3
2021-06-22 00:00:00 7
2021-06-22 01:00:00 11
2021-06-22 02:00:00 15
2021-06-22 03:00:00 19
2021-06-22 04:00:00 23
2021-06-22 05:00:00 27
2021-06-22 06:00:00 31
2021-06-22 07:00:00 35
2021-06-22 08:00:00 39
I have two dfs as shown below.
df1:
Date t_factor
2020-02-01 5
2020-02-02 23
2020-02-03 14
2020-02-04 23
2020-02-05 23
2020-02-06 23
2020-02-07 30
2020-02-08 29
2020-02-09 100
2020-02-10 38
2020-02-11 38
2020-02-12 38
2020-02-13 70
2020-02-14 70
2020-02-15 38
2020-02-16 38
2020-02-17 70
2020-02-18 70
2020-02-19 38
2020-02-20 38
2020-02-21 70
2020-02-22 70
2020-02-23 38
2020-02-24 38
2020-02-25 70
2020-02-26 70
2020-02-27 70
df2:
From to plan score
2020-02-03 2020-02-05 start 20
2020-02-07 2020-02-08 foundation 25
2020-02-10 2020-02-12 learn 10
2020-02-14 2020-02-16 practice 20
2020-02-15 2020-02-21 exam 30
2020-02-20 2020-02-23 test 10
From the above I would like to append the plan column to df1 based on the From and to date value in df2 and Date value in df1.
Expected output:
output_df
Date t_factor plan
2020-02-01 5 NaN
2020-02-02 23 NaN
2020-02-03 14 start
2020-02-04 23 start
2020-02-05 23 start
2020-02-06 23 NaN
2020-02-07 30 foundation
2020-02-08 29 foundation
2020-02-09 100 NaN
2020-02-10 38 learn
2020-02-11 38 learn
2020-02-12 38 learn
2020-02-13 70 NaN
2020-02-14 70 practice
2020-02-15 38 NaN
2020-02-16 38 NaN
2020-02-17 70 exam
2020-02-18 70 exam
2020-02-19 38 exam
2020-02-20 38 NaN
2020-02-21 70 NaN
2020-02-22 70 test
2020-02-23 38 test
2020-02-24 38 NaN
2020-02-25 70 NaN
2020-02-26 70 NaN
2020-02-27 70 NaN
Note:
If there is any overlapping date, then keep plan as NaN for that date.
Example:
2020-02-14 to 2020-02-16 plan is practice.
And 2020-02-15 to 2020-02-21 plan is exam.
So there is overlap is on 2020-02-15 and 2020-02-16.
Hence plan should be NaN for that date range.
I would like to implement function shown below.
def (df1, df2)
return output_df
Use: (This solution if From and to dates in dataframe df2 overlaps and we need to choose the values from column plan with respect to earliest date possible)
d1 = df1.sort_values('Date')
d2 = df2.sort_values('From')
df = pd.merge_asof(d1, d2[['From', 'plan']], left_on='Date', right_on='From')
df = pd.merge_asof(df, d2[['to', 'plan']], left_on='Date', right_on='to',
direction='forward', suffixes=['', '_r']).drop(['From', 'to'], 1)
df['plan'] = df['plan'].mask(df['plan'].ne(df.pop('plan_r')))
Details:
Use pd.merge_asof to perform a asof merge on the dataframes d1 and d2 on corresponding columns Date and From with default direction='backward' to create a new merged dataframe df, again use pd.merge_asof to asof merge the dataframes df and d2 on corresponding columns Date and to with direction='forward'.
print(df)
Date t_factor plan plan_r
0 2020-02-01 5 NaN start
1 2020-02-02 23 NaN start
2 2020-02-03 14 start start
3 2020-02-04 23 start start
4 2020-02-05 23 start start
5 2020-02-06 23 start foundation
6 2020-02-07 30 foundation foundation
7 2020-02-08 29 foundation foundation
8 2020-02-09 100 foundation learn
9 2020-02-10 38 learn learn
10 2020-02-11 38 learn learn
11 2020-02-12 38 learn learn
12 2020-02-13 70 learn practice
13 2020-02-14 70 practice practice
14 2020-02-15 38 exam practice
15 2020-02-16 38 exam practice
16 2020-02-17 70 exam exam
17 2020-02-18 70 exam exam
18 2020-02-19 38 exam exam
19 2020-02-20 38 test exam
20 2020-02-21 70 test exam
21 2020-02-22 70 test test
22 2020-02-23 38 test test
23 2020-02-24 38 test NaN
24 2020-02-25 70 test NaN
25 2020-02-26 70 test NaN
26 2020-02-27 70 test NaN
Use Series.ne + Series.mask to mask the values in column plan where plan is not equal to plan_r.
print(df)
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 NaN
15 2020-02-16 38 NaN
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 exam
19 2020-02-20 38 NaN
20 2020-02-21 70 NaN
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN
Using pd.to_datetime convert the date like columns to pandas datetime series:
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['From', 'to']] = df2[['From', 'to']].apply(pd.to_datetime)
Create a pd.IntervalIndex from the columns From and to of df2, then use Series.map on the column Date of df1 to map it to column plan from df2 (after setting the idx):
idx = pd.IntervalIndex.from_arrays(df2['From'], df2['to'], closed='both')
df1['plan'] = df1['Date'].map(df2.set_index(idx)['plan'])
Result:
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 practice
15 2020-02-16 38 practice
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 NaN
19 2020-02-20 38 test
20 2020-02-21 70 test
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN
I have a timeseries data of ice thickness. The plot is only useful for winter months and there is no interest in seeing big gaps during summer months. Would it be possible to skip summer months (say April to October) in the x-axis and have a smaller area with different color and labeled Summer?
Let's take this data:
import datetime
n_samples = 20
index = pd.date_range(start='1/1/2018', periods=n_samples, freq='M')
values = np.random.randint(0,100, size=(n_samples))
data = pd.Series(values, index=index)
print(data)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-04-30 87
2018-05-31 51
2018-06-30 67
2018-07-31 22
2018-08-31 66
2018-09-30 55
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
2019-04-30 50
2019-05-31 75
2019-06-30 80
2019-07-31 84
2019-08-31 19
Freq: M, dtype: int64
You can filter the data that is not in the range of the months, so you take the index of Serie, take the month, check if is in the range, and take the negative (with ~)
filtered1 = data[~data.index.month.isin(range(4,10))]
print(filtered1)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
If you plot that,
filtered1.plot()
you will have this image
so you need to set the frecuency, in this case, monthly (M)
filtered1.asfreq('M').plot()
Additionally, you could use filters like:
filtered2 = data[data.index.month.isin([1,2,3,11,12])]
filtered3 = data[~ data.index.month.isin([4,5,6,7,8,9,10])]
if you need keep/filter specific months.
I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0
To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.