How to group by an Attribute and calculate time between consecutive tickets for that Attribute - python-3.x

So, I am working with a Dataframe where there are around 20 columns, but only two columns are really of importance.
Index
ID
Date
1
01-40-50
2021-12-01 16:54:00
2
01-10
2021-10-11 13:28:00
3
03-48-58
2021-11-05 16:54:00
4
01-40-50
2021-12-06 19:34:00
5
03-48-58
2021-12-09 12:14:00
6
01-10
2021-08-06 19:34:00
7
03-48-58
2021-10-01 11:44:00
There are 90 different ID's and a few thousand rows in total. What I want to do is:
Group the entries by the ID's
Order those ID rows by the Date
Then calculate the difference between one timestamp to another
And create a column that has those entries (to then visualize it for the 90 different ID's)
While I thought it would be an easy thing to use the function groupby, I am having quite a bit of trouble. Would appreciate any input as to how to start this! Thank you!

You can do it this way:
>>> df.groupby("ID")["Date"].apply(lambda x: x.sort_values().diff())
ID Index
01-10 6 NaT
2 65 days 17:54:00
01-40-50 1 NaT
4 5 days 02:40:00
03-48-58 7 NaT
3 35 days 05:10:00
5 33 days 19:20:00

Related

How to find again the index after pivoting dataframe?

I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year):
dataD = pd.read_csv('MY_FILE.csv', sep=',')
First rows (out of 902...) of output are :
dataD
Year Month Deaths
0 2021 2 55500
1 2021 1 65400
2 2020 12 62800
3 2020 11 64700
4 2020 10 56900
As expected, the dataframe contains an index numbered 0,1,2, ... and so on.
Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths')
The first rows of the result are now:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
My question is:
What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD.
Any possibility?
You can reset_index() after pivoting:
dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index()
This would give you the following:
Month Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred:
df.columns.name = None
Which then gives you:
Year 1 2 3 4 5 6 7 8 9 10 11 12
0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0
1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0
2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0
3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0
4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0

Plot many plots for each unique Value of one column

So, I am working with a Dataframe where there are around 20 columns, but only three columns are really of importance.
Index
ID
Date
Time_difference
1
01-40-50
2021-12-01 16:54:00
0 days 00:12:00
2
01-10
2021-10-11 13:28:00
2 days 00:26:00
3
03-48-58
2021-11-05 16:54:00
2 days 00:26:00
4
01-40-50
2021-12-06 19:34:00
7 days 00:26:00
5
03-48-58
2021-12-09 12:14:00
1 days 00:26:00
6
01-10
2021-08-06 19:34:00
0 days 00:26:00
7
03-48-58
2021-10-01 11:44:00
0 days 02:21:00
There are 90 unique ID's and a few thousand rows in total. What I want to do is:
Create a plot for each unique ID
Each plot with an y-axis of 'Time_difference' and a x-axis of 'date'
Each plot with a trendline
Optimally a plot that has the average of all other plots
Would appreciate any input as to how to start this! Thank you!
For future documentation, solved it as follows:
First transforming the time_delta to an integer:
df['hour_difference'] = df['time_difference'].dt.days * 24 +
df['time_difference'].dt.seconds / 60 / 60
Then creating a list with all unique entries of the ID:
id_list = df['ID'].unique()
And last, the for-loop for the plotting:
for i in id_list:
df.loc[(df['ID'] == i)].plot(y=["hour_difference"], figsize=(15,4))
plt.title(i, fontsize=18) #Labeling titel
plt.xlabel('Title name', fontsize=12) #Labeling x-axis
plt.ylabel('Title Name', fontsize=12) #Labeling y-axis

How to merge timestamps that are only few seconds apart [Pandas]

I have this dataframe with shape 22341x3:
tID DateTime
0 1 2020-04-04 10:15:40
1 2 2020-04-04 10:15:56
2 2 2020-04-04 11:07:11
3 3 2020-04-04 11:08:14
4 3 2020-04-04 11:18:46
5 4 2020-04-04 11:23:56
6 5 2020-04-04 11:24:14
7 6 2020-04-04 11:29:12
8 7 2020-04-04 11:29:23
9 8 2020-04-04 11:34:23
Now I have to create a column called merged_timestamp that merges all the timestamps that are only a few seconds apart and give them a new number: mtID
So for example: if we consider 2020-04-04 10:15:40 as a reference, the timestamps with few seconds apart can be from 40 seconds until 44 seconds. They can have hours and minutes with a big gap as compared to the reference, but their seconds should be only few seconds apart in order to get merged.
Any help would be appreciated.
EDIT: I tried doing dfd.resample('5s')[0:5] where dfd is my dataframe. It gives me this error TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
resample works on the index, so make the datetime your index;
df.index = pd.to_datetime(df['DateTime'])
then you can resample with;
df.resample('5s').count()
or some other aggregation, not sure what you are trying to do. And then you could drop the rows that your not interested in.

Function in Excel to average numbers under weeks column and show them in month

In the attached image, I would like to sum the average of every week and show it in months. For example,
weeks no: 1 2 3 4 5 6 7 8 9 10
Numbers : 12 15 16 5 8 9 45 78 8 96
Show the average of the week in month.
Thanks for help.
Cheers
Hosein
The following formula will give you the average for each month for a given row:
=AVERAGEIFS($A$4:$V$4,$A$1:$V$1,">="&A9,$A$1:$V$1,"<="&EOMONTH(A9,0))
Here's the final result:
I'm hoping it's a good starting point.

Difference between the last and the first item of a criteria in a list

I have the following Excel spreadsheet
A B C D
1 Product ID Time of Event
2 27152 01.04.2017 08:45:00 27152 70 Min.
3 27152 01.04.2017 09:00:00 29297 108 Min.
4 27152 01.04.2017 09:55:00 28802 28 Min.
5 29297 02.04.2017 11:02:00
6 29297 02.04.2017 12:50:00
7 28802 18.04.2017 11:48:00
8 28802 18.04.2017 12:00:00
9 28802 18.04.2017 12:13:00
10 28802 18.04.2017 12:16:00
In Column A you can find different Product IDs.
In Column B the time when an event happens in the Product ID.
Each event is listed in the table; therefore, a ProductID can appear
several times in Column A.
In Column D I want to show now the difference in minutes between
the first and the last event which happens in a product ID.
D2 = 9:55:00 - 8:45:00 = 70 Min.
D3 = 12:50:00 - 11:02:00 = 108 Min.
D4 = 12:16:00 - 11:48:00 = 28 Min.
Therefore, I would need something like a DIFFERENCE-IF-Formula.
One of my ideas so far was going by the LARGE and SMALL function.
=LARGE(B2:B4;1)-SMALL(B2:B4;1)
However, this way I would have to find each array (B2:B4, B5:B6, B7:B10) seperatly; therefore, I would prefer to have the productID as a criteria in the formula.
Summarized:
Do you have any idea how I could calculate the difference in minutes between the last and the first event of a certain ProdcutID in the list?
I would prefer to avoid any kind of array formula.
=ROUND(MMULT(AGGREGATE({14,15},6,B$2:B$10/(A$2:A$10=C2),1),{1;-1})*1440,1)&" Min"
and copied down.
I've a feeling the separators for horizontal and vertical arrays in German versions of Excel are the period (.) and semicolon (;) respectively, so I believe you'll need:
=RUNDEN(MMULT(AGGREGAT({14.15};6;B$2:B$10/(A$2:A$10=C2);1);{1;-1})*1440;1)&" Min"
though please let me know if that doesn't give the required results.
Regards
With some conditions,
1. assuming that you convert column B into 2 columns
2. times is in ascending order
A B C
Product ID Time of Event TIMES
27152 01.04.2017 8:45:00
27152 01.04.2017 9:00:00
27152 01.04.2017 9:55:00
29297 02.04.2017 11:02:00
29297 02.04.2017 12:50:00
28802 18.04.2017 11:48:00
28802 18.04.2017 12:00:00
28802 18.04.2017 12:13:00
28802 18.04.2017 12:16:00
This will work without using array
=(INDEX($C$2:$C$10,SUMPRODUCT(MAX(ROW($A$2:$A$10)*(D2=$A$2:$A$10))-1))-INDEX($C$2:$C$10,MATCH(D2,$A$2:$A$10,0)))*1440
Convert time into minutes
=(time*1440)
Look for first matching value
=INDEX($C$2:$C$10,MATCH(D2,$A$2:$A$10,0))
Look for last matching value
=INDEX($C$2:$C$10,SUMPRODUCT(MAX(ROW($A$2:$A$10)*(D2=$A$2:$A$10))-1)
NOTE If last value is SMALLER then first value, you will receive an error.

Resources