Databricks SQL LAG() Dealing with Duplicates

Databricks SQL LAG() Dealing with Duplicates - databricks

I need help with a query and I cannot wrap my head around what would be a good approach to deal with it.
I have the following data in the following shape:
PK
ticket_number
timestamp
previous_state
current_state
1
1
2022-05-01 11:55:00
Under Consideration
2
1
2022-05-01 12:00:00
Under Consideration
Backlog
3
1
2022-05-01 12:00:00
Backlog
Development
4
1
2022-05-05 13:00:00
Development
Review
5
1
2022-05-01 13:05:00
Review
Development
6
1
2022-05-05 13:10:00
Development
Done
I want to calculate the duration of each stage that the ticket went through. However, for this I need to be able to match correctly the previous timestamp, and currently in my data it is possible that 2 states might have the same timestamp, which makes it difficult to use the LAG window function to get correctly the timestamp. But I could still get the correct order of the events because we have the previous state available in the table.
So, how can I make sure that I get the correct order of the events for duplicated timestamps, utilizing the previous_state column to get the previous timestamp.
The desired output would be this:
PK
ticket_number
timestamp
previous_state
current_state
previous_pk
previous_timestamp
1
1
2022-05-01 11:55:00
Under Consideration
NULL
NULL
2
1
2022-05-01 12:00:00
Under Consideration
Backlog
1
2022-05-01 11:55:00
3
1
2022-05-01 12:00:00
Backlog
Development
2
2022-05-01 12:00:00
4
1
2022-05-05 13:00:00
Development
Review
3
2022-05-01 12:00:00
5
1
2022-05-01 13:05:00
Review
Development
4
2022-05-05 13:00:00
6
1
2022-05-05 13:10:00
Development
Done
5
2022-05-01 13:05:00

Related

Spotfire calculate difference with respect to previous row value

I have a data as below. I have created column "difference in values" manually, the calculation is value at 8:15 AM - value at 8:00 AM which is 2 in second row and so on for all values of column Tushar and Lohit respectively. How can i do this calculation in Spotfire i believe over and previous function can help but i am unable find anything on this. Please help
Name Time Values Difference in values
Tushar 08:00 AM 2 0
Tushar 08:15 AM 4 2
Tushar 08:30 AM 5 1
Tushar 08:45 AM 6 1
Tushar 09:00 AM 7 1
Lohit 08:00 AM 2 0
Lohit 08:15 AM 4 2
Lohit 08:30 AM 5 1
Lohit 08:45 AM 6 1

This should work
SN([Values] - Max([Values]) over (Intersect(Previous([Time]),[Name])),0)
where Max(..) is just to have an aggregation, since it is only looking at the previous Time row for each value of Name. [so Min would work just as well].
SN(...) is there to set the result to 0 when it is empty (as in the first row of each Name).

How to group by an Attribute and calculate time between consecutive tickets for that Attribute

So, I am working with a Dataframe where there are around 20 columns, but only two columns are really of importance.
Index
ID
Date
1
01-40-50
2021-12-01 16:54:00
2
01-10
2021-10-11 13:28:00
3
03-48-58
2021-11-05 16:54:00
4
01-40-50
2021-12-06 19:34:00
5
03-48-58
2021-12-09 12:14:00
6
01-10
2021-08-06 19:34:00
7
03-48-58
2021-10-01 11:44:00
There are 90 different ID's and a few thousand rows in total. What I want to do is:
Group the entries by the ID's
Order those ID rows by the Date
Then calculate the difference between one timestamp to another
And create a column that has those entries (to then visualize it for the 90 different ID's)
While I thought it would be an easy thing to use the function groupby, I am having quite a bit of trouble. Would appreciate any input as to how to start this! Thank you!

You can do it this way:
>>> df.groupby("ID")["Date"].apply(lambda x: x.sort_values().diff())
ID Index
01-10 6 NaT
2 65 days 17:54:00
01-40-50 1 NaT
4 5 days 02:40:00
03-48-58 7 NaT
3 35 days 05:10:00
5 33 days 19:20:00

How to make conditional comparisions between a value with Date Differemce (DATETIME - DATETIME)?

Motivation: I want to check users who made the action within 5 days since his first login.
Here is the sample data:
ID DATE_LOGIN DATE_ACTION
1 2019-01-01 2019-01-03
2 2019-01-05 2019-01-06
3 2019-01-19 2019-01-25
Here is the expected result
ID DATE_LOGIN DATE_ACTION
1 2019-01-01 2019-01-03
2 2019-01-05 2019-01-06
This my try so far:
df['date_diff'] = pd.to_datetime(df['DATE_ACTION']) - pd.to_datetime(df['DATE_LOGIN'])
`
ID DATE_LOGIN DATE_ACTION date_diff
1 2019-01-01 2019-01-03 2 days
2 2019-01-05 2019-01-06 1 days
3 2019-01-10 2019-01-25 15 days
`
df[df['date_diff'] <= 5]
However, I get this errors
TypeError: Invalid comparison between dtype=timedelta64[ns] and int

You want compre timedeltas with integer, so raised error. So you can convert timedeltas to days by Series.dt.days for numeric, so possible compare by number:
df[df['date_diff'].dt.days <= 5]
Or you can compare with Timedelta:
df[df['date_diff'] <= pd.Timedelta(5, unit='d')]

Groupby expanding count - elements changing of group at different time stamps

I have a HUGHE DataFrame that looks as follows (this is just an example to illustrate the problem):
id timestamp target_time interval
1 08:00:00 10:20:00 (10-11]
1 08:30:00 10:21:00 (10-11]
1 09:10:00 11:30:00 (11-12]
2 09:15:00 10:15:00 (10-11]
2 09:35:00 10:11:00 (10-11]
3 09:45:00 11:12:00 (11-12]
...
I would like to create a series looking as follows:
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 1
09:35:00 1
(11-12] 09:10:00 1
09:45:00 2
The objective is to count, for each time interval, how many unique ids had their corresponding target_time within the interval at their timestamp. Note that the target_time for each id can change at different timestamps. For instance, for the id 1 the interval is (10-11] from 08:00:00 to 08:30:00, but then it changes to (11-12] at 09:10:00. Therefore, at 09:15:00 I do not want to count the id 1 in the resulting Series.
I tried a groupby -> expand -> np.unique approach, but it does not provide the result that I want:
df.set_index('timestamp').groupby('interval').id.expanding().apply(lambda x: np.unique(x).shape[0])
interval timestamp unique_ids
(10-11] 08:00:00 1
08:30:00 1
09:15:00 2
09:35:00 2
(11-12] 09:10:00 1
09:45:00 2
Any hint on how can I approach this problem? I want to make use of pandas routines as much as possible, in order to reduce computational time, since the length of the DataFrame is 1453076...
Many thanks in advance!

Check Multiple Columns for the highest value

So say I had a table like this that has the score of 3 different teams for the week.
Day Team1 Team2 Team3
Mon 5 2 2
Tue 0 7 7
Wed 6 3 2
Thu 0 0 1
Fri 13 6 5
I want a formula that can find the highest score for the day and mark it on a identical table with a value of 1 and mark the other teams 0.
If there are 2 values that are the highest I want them to both be marked 1.
There will never be day with all 0's
Using the data from the table above my other table would look like this.
Day Team1 Team2 Team3
Mon 1 0 0
Tue 0 1 1
Wed 1 0 0
Thu 0 0 1
Fri 1 0 0
I have a working formula
=IF(AND(B2>=$C2,B2>=$D2,B2>=$E2),1,0)
I was just hoping there was a better way to write this formula, so that I can drag it across the teams and have it still work.
If I try to drag my formula now. I have to update the formula for each column. Sometimes I might have 20 + teams.
Any advice is appreciated.

Use MAX():
=IF(B2=MAX($B2:$D2),1,0)
Then copy/drag over and down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks SQL LAG() Dealing with Duplicates - databricks

Related

Spotfire calculate difference with respect to previous row value

How to group by an Attribute and calculate time between consecutive tickets for that Attribute

How to make conditional comparisions between a value with Date Differemce (DATETIME - DATETIME)?

Groupby expanding count - elements changing of group at different time stamps

Check Multiple Columns for the highest value

Categories

Resources