How to merge timestamps that are only few seconds apart [Pandas] - python-3.x

I have this dataframe with shape 22341x3:
tID DateTime
0 1 2020-04-04 10:15:40
1 2 2020-04-04 10:15:56
2 2 2020-04-04 11:07:11
3 3 2020-04-04 11:08:14
4 3 2020-04-04 11:18:46
5 4 2020-04-04 11:23:56
6 5 2020-04-04 11:24:14
7 6 2020-04-04 11:29:12
8 7 2020-04-04 11:29:23
9 8 2020-04-04 11:34:23
Now I have to create a column called merged_timestamp that merges all the timestamps that are only a few seconds apart and give them a new number: mtID
So for example: if we consider 2020-04-04 10:15:40 as a reference, the timestamps with few seconds apart can be from 40 seconds until 44 seconds. They can have hours and minutes with a big gap as compared to the reference, but their seconds should be only few seconds apart in order to get merged.
Any help would be appreciated.
EDIT: I tried doing dfd.resample('5s')[0:5] where dfd is my dataframe. It gives me this error TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

resample works on the index, so make the datetime your index;
df.index = pd.to_datetime(df['DateTime'])
then you can resample with;
df.resample('5s').count()
or some other aggregation, not sure what you are trying to do. And then you could drop the rows that your not interested in.

Related

Computing 10d rolling average on a descending date column in pandas [duplicate]

Suppose I have a time series:
In[138] rng = pd.date_range('1/10/2011', periods=10, freq='D')
In[139] ts = pd.Series(randn(len(rng)), index=rng)
In[140]
Out[140]:
2011-01-10 0
2011-01-11 1
2011-01-12 2
2011-01-13 3
2011-01-14 4
2011-01-15 5
2011-01-16 6
2011-01-17 7
2011-01-18 8
2011-01-19 9
Freq: D, dtype: int64
If I use one of the rolling_* functions, for instance rolling_sum, I can get the behavior I want for backward looking rolling calculations:
In [157]: pd.rolling_sum(ts, window=3, min_periods=0)
Out[157]:
2011-01-10 0
2011-01-11 1
2011-01-12 3
2011-01-13 6
2011-01-14 9
2011-01-15 12
2011-01-16 15
2011-01-17 18
2011-01-18 21
2011-01-19 24
Freq: D, dtype: float64
But what if I want to do a forward-looking sum? I've tried something like this:
In [161]: pd.rolling_sum(ts.shift(-2, freq='D'), window=3, min_periods=0)
Out[161]:
2011-01-08 0
2011-01-09 1
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
Freq: D, dtype: float64
But that's not exactly the behavior I want. What I am looking for as an output is:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
ie - I want the sum of the "current" day plus the next two days. My current solution is not sufficient because I care about what happens at the edges. I know I could solve this manually by setting up two additional columns that are shifted by 1 and 2 days respectively and then summing the three columns, but there's got to be a more elegant solution.
Why not just do it on the reversed Series (and reverse the answer):
In [11]: pd.rolling_sum(ts[::-1], window=3, min_periods=0)[::-1]
Out[11]:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
Freq: D, dtype: float64
I struggled with this then found an easy way using shift.
If you want a rolling sum for the next 10 periods, try:
df['NewCol'] = df['OtherCol'].shift(-10).rolling(10, min_periods = 0).sum()
We use shift so that "OtherCol" shows up 10 rows ahead of where it normally would be, then we do a rolling sum over the previous 10 rows. Because we shifted, the previous 10 rows are actually the future 10 rows of the unshifted column. :)
Pandas recently added a new feature which enables you to implement forward looking rolling. You have to upgrade to pandas 1.1.0 to get the new feature.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
ts.rolling(window=indexer, min_periods=1).sum()
Maybe you can try bottleneck module. When ts is large, bottleneck is much faster than pandas
import bottleneck as bn
result = bn.move_sum(ts[::-1], window=3, min_count=1)[::-1]
And bottleneck has other rolling functions, such as move_max, move_argmin, move_rank.
Try this one for a rolling window of 3:
window = 3
ts.rolling(window).sum().shift(-window + 1)

Need helping moving data from one column to another

New to python and pandas and trying to figure this out.
I'm dealing with a data set that's pretty messy. There are 500 rows and 9 columns. In a few instances, data that should be in coulmn 9 has been indexed into column 8, along with column 8 data.
... Col 8 Col 9
0 2 weeks No. 13
1 1 week No. 2
2 12 weeks, No 1
3 15 weeks No. 8
4 7 weeks, No. 1
How can I separate the data and move to the proper column?
I applied a split(), but don't know how to move it over.
I'm thinking I need to use the apply(), but not sure on how.
Any suggestions?
You can split() with expand=True, then fillna() to fill the missing values:
df[['Col 8', 'Col 9']] = df['Col 8'].str.split(',', expand=True).fillna({1: df['Col 9']})
# Col 8 Col 9
# 0 2 weeks No. 13
# 1 1 week No. 2
# 2 12 weeks No 1
# 3 15 weeks No. 8
# 4 7 weeks No. 1

How to group by an Attribute and calculate time between consecutive tickets for that Attribute

So, I am working with a Dataframe where there are around 20 columns, but only two columns are really of importance.
Index
ID
Date
1
01-40-50
2021-12-01 16:54:00
2
01-10
2021-10-11 13:28:00
3
03-48-58
2021-11-05 16:54:00
4
01-40-50
2021-12-06 19:34:00
5
03-48-58
2021-12-09 12:14:00
6
01-10
2021-08-06 19:34:00
7
03-48-58
2021-10-01 11:44:00
There are 90 different ID's and a few thousand rows in total. What I want to do is:
Group the entries by the ID's
Order those ID rows by the Date
Then calculate the difference between one timestamp to another
And create a column that has those entries (to then visualize it for the 90 different ID's)
While I thought it would be an easy thing to use the function groupby, I am having quite a bit of trouble. Would appreciate any input as to how to start this! Thank you!
You can do it this way:
>>> df.groupby("ID")["Date"].apply(lambda x: x.sort_values().diff())
ID Index
01-10 6 NaT
2 65 days 17:54:00
01-40-50 1 NaT
4 5 days 02:40:00
03-48-58 7 NaT
3 35 days 05:10:00
5 33 days 19:20:00

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Why there are many "NaN" in the index after importing a MultiIndex Dataframe from an Excel file?

I have an Excel file looks like below in Excel:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
pm 3 2 4 5
300200 am 2 5 2 6
pm 5 1 3 7
300300 am 1 6 3 2
pm 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
But after I imported it by pd.read_excel and printed it, it was displayed like below in Python:
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
NaN pm 3 2 4 5
300200 am 2 5 2 6
NaN pm 5 1 3 7
300300 am 1 6 3 2
NaN pm 3 7 2 3
300400 am 3 1 1 3
NaN pm 2 5 5 2
300500 am 1 6 6 1
NaN pm 5 7 7 5
How can I solve this to make the Dataframe look like the format in Excel, without so many "NaN"? Thanks!
Most of the time when Excel looks like what you have in your example, it does actually have blanks where those spaces are. But, the cells are merged, so it looks pretty. When you import it into pandas, it reads them as empty or NaN.
To fix it, forward fill the empty cells, then set as the index.
df.ffill()
Without access to the Excel files or knowledge of the versions it's impossible to be sure, but it just looks like you have a column of numbers (the first column) with every other row blank. Pandas expects uniformly filled columns, so while in Excel you have a sort of "structure" of the information for both AM and PM for each first-column number (id?), Pandas just sees two rows, one with an invalid first column. Depending on how you actually want to access this data, an easy fix would be to replace every NaN with the number directly above it, so each row contains either the AM or PM information for the "id". Another fix would be to change your column structure to have 2016-1-1-am and 2016-1-1-pm fields.
You're looking for the fillna method:
df = df.fillna('')

Resources