Pyspark window function with condition

Pyspark window function with condition - apache-spark

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previous or next event:
+--------+-------------------+--------+
|userid |eventtime |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60 |
|37397e29|2017-06-04 03:01:00|60 |
|37397e29|2017-06-04 03:02:00|60 |
|37397e29|2017-06-04 03:03:00|180 |
|37397e29|2017-06-04 03:06:00|60 |
|37397e29|2017-06-04 03:07:00|420 |
|37397e29|2017-06-04 03:14:00|60 |
|37397e29|2017-06-04 03:15:00|1140 |
|37397e29|2017-06-04 03:34:00|540 |
|37397e29|2017-06-04 03:53:00|540 |
+--------+----------------- -+--------+
The challenge is to group by the start_time and end_time of the latest eventtime that has the condition of being within 5 minutes. The output should be like this table:
+--------+-------------------+--------------------+-----------+
|userid |start_time |end_time |events |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6 |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2 |
+--------+-------------------+--------------------+-----------+
So far I have used window lag functions and some conditions, however, I do not know where to go from here:
%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")
nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
- F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))
# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))
# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))
result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")
My questions: Is this a viable approach, and if so, how can I "go forward" and look at the maximum eventtime that fulfill the 5 minutes condition. To my knowledge, iterate through values of a Spark SQL Column, is it possible? wouldn't it be too expensive?. Is there another way to achieve this result?
Result of Solution suggested by #Aku:
+--------+--------+---------------------+---------------------+------+
|userid |subgroup|start_time |end_time |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0 |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5 |
|37397e29|1 |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2 |
|37397e29|2 |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1 |
|37397e29|3 |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2 |
+------------------------------------+-----------------------+-------+
It doesn't give the result expected. 3:07 - 3:14 and 03:34-03:43 are being counted as ranges within 5 minutes, it shouldn't be like that. Also, 3:07 should be the end_time in the first row as it is within 5 minutes of the previous row 3:06.

You'll need one extra window function and a groupby to achieve this.
What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):
w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count("*").alias("events")
)
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
| 2|2017-06-04 03:34:00|2017-06-04 03:34:00| 1|
| 3|2017-06-04 03:53:00|2017-06-04 03:53:00| 1|
+--------+-------------------+-------------------+------+
It seems that you also filter out lines with only one event, hence:
DF = DF.filter("events != 1")
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
+--------+-------------------+-------------------+------+

So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:
First some imports
from pyspark.sql.window import Window
import pyspark.sql.functions as func
Then setting windows, I assumed you would partition by userid
w = Window.partitionBy("userid").orderBy("eventtime")
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
Then some aggregation functions and you should be done
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count(func.lit(1)).alias("events")
)

Approach can be grouping the dataframe based on your timeline criteria.
You can create a dataframe with the rows breaking the 5 minutes timeline.
Those rows are criteria for grouping the records and
that rows will set the startime and endtime for each group.
Then find the count and max timestamp(endtime) for each group.

Related

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?

The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.

pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN

You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

pyspark - weighted moving average through uneven period lengths

I am trying calculate a weighted (based on duration) moving average of a dataframe with uneven timestamp records.
Below is an example df.
+-----+-------------------+
|value| date|
+-----+-------------------+
| 9.0|2017-03-15 11:42:00|
| 7.0|2017-03-16 13:02:00|
| 7.0|2017-03-16 19:02:00|
| 7.0|2017-03-16 21:38:00|
| 7.0|2017-03-16 21:58:00|
| 6.0|2017-03-18 10:07:00|
| 22.0|2017-03-18 12:21:00|
| 21.0|2017-03-20 23:21:00|
| 19.0|2017-03-21 10:21:00|
| 17.0|2017-03-04 11:01:00|
| 16.0|2017-03-09 18:41:00|
+-----+-------------------+
I have tried to use rangeBetween but I think it only takes simple average
Then tried to use pyspark.sql.functions.window method with w = window('date','7 days','5 minutes'), and calculate weighted average with a udf, but I haven't been able to even calculate a simple average because it took forever to calculate it.
w = window('date','7 days','5 minutes')
win = Window.partitionBy(w).orderBy(df['date'].asc())
new_df = df.withColumn('average',avg('value').over(win))
I was also advised to transform the table to an evenly distributed time period.
Which one do you advise & why, and how to approach window sliding and filling?
I am a newbie in pyspark
Thanks

Pyspark - combine 2 rows 2 one, every 2 rows

I have a pyspark dataframe here like the picture below. I would like to group every 2 rows, but in a way that:
the first row would be that user from row 1 and 2 and
the second row would be from row 2 and 3 etc.
Something like this:
---CustomerID--previous_stockcodes----stock_codes-----
Prices and quantities are not used, previous basket and current basket are put into one. For example, the first row of CustomerID 12347 would be:
12347----[85116, 22375, 71...]-----[84625A, 84625C, ...]
I have written loops to do that but that's really inefficient and slow. I wonder if I can do something like that efficiently using pyspark but I am having trouble figuring that out. Thanks a lot in advance

You could get the next row by using lead function provided by spark-sql.
lead is a window function.
Syntax : lead(column_name,int_value,default_value) over (partition by column_name order by column_name)
int_value takes number of rows you want to lead from current row.
default_value takes input for case when leading rows are not found
>>> input_df.show()
+----------+---------+----------------+
|customerID|invoiceNo| stockCode_list|
+----------+---------+----------------+
| 12347| 537626| [85116, 22375]|
| 12347| 542237|[84625A, 84625C]|
| 12347| 549222| [22376, 22374]|
| 12347| 556201| [23084, 23162]|
| 12348| 539318| [84992, 22951]|
| 12348| 541998| [21980, 21985]|
| 12348| 548955| [23077, 23078]|
+----------+---------+----------------+
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import lead,col
>>> win_func = Window.partitionBy("customerID").orderBy("invoiceNo")
>>> new_col = lead("stockCode_list",1,None).over(win_func)
>>> req_df = input_df.select(col("customerID"),col("invoiceNo"),col("stockCode_list"),new_col.alias("req_col"))
>>> req_df.orderBy("customerID","invoiceNo").show()
+----------+---------+----------------+----------------+
|customerID|invoiceNo| stockCode_list| req_col|
+----------+---------+----------------+----------------+
| 12347| 537626| [85116, 22375]|[84625A, 84625C]|
| 12347| 542237|[84625A, 84625C]| [22376, 22374]|
| 12347| 549222| [22376, 22374]| [23084, 23162]|
| 12347| 556201| [23084, 23162]| null|
| 12348| 539318| [84992, 22951]| [21980, 21985]|
| 12348| 541998| [21980, 21985]| [23077, 23078]|
| 12348| 548955| [23077, 23078]| null|
+----------+---------+----------------+----------------+

Distribution of time periods over rows with certain status (column value)

I have a Pyspark dataframe containing logs, with each row corresponding to the state of the system at the time it is logged, and a group number. I would like to find the lengths of the time periods for which each group is in an unhealthy state.
For example, if this were my table:
TIMESTAMP | STATUS_CODE | GROUP_NUMBER
--------------------------------------
02:03:11 | healthy | 000001
02:03:04 | healthy | 000001
02:03:03 | unhealthy | 000001
02:03:00 | unhealthy | 000001
02:02:58 | healthy | 000008
02:02:57 | healthy | 000008
02:02:55 | unhealthy | 000001
02:02:54 | healthy | 000001
02:02:50 | healthy | 000007
02:02:48 | healthy | 000004
I would want to return Group 000001 having an unhealthy time period of 9 seconds (from 02:02:55 to 02:03:04).
Other groups could also have unhealthy time periods, and I would want to return those as well.
Due to the possibility of consecutive rows with the same status, and since rows of different groups are interspersed, I am struggling to find a way to do this efficiently.
I cannot convert the Pyspark dataframe to a Pandas dataframe, as it is much too large.
How can I efficiently determine the lengths of these time periods?
Thanks so much!

the pyspark with spark-sql solution would look like this.
First we create the sample data-set. In addition to the dataset we generate row_number field partition on group and order by the timestamp. then we register the generated dataframe as a table say table1
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([
('2017-01-01 02:03:11','healthy','000001'),
('2017-01-01 02:03:04','healthy','000001'),
('2017-01-01 02:03:03','unhealthy','000001'),
('2017-01-01 02:03:00','unhealthy','000001'),
('2017-01-01 02:02:58','healthy','000008'),
('2017-01-01 02:02:57','healthy','000008'),
('2017-01-01 02:02:55','unhealthy','000001'),
('2017-01-01 02:02:54','healthy','000001'),
('2017-01-01 02:02:50','healthy','000007'),
('2017-01-01 02:02:48','healthy','000004')
],['timestamp','state','group_id'])
df = df.withColumn('rownum', row_number().over(Window.partitionBy(df.group_id).orderBy(unix_timestamp(df.timestamp))))
df.registerTempTable("table1")
once the dataframe is registered as a table (table1). the required data can be computed as below using spark-sql
>>> spark.sql("""
... SELECT t1.group_id,sum((t2.timestamp_value - t1.timestamp_value)) as duration
... FROM
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1 WHERE state = 'unhealthy') t1
... LEFT JOIN
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1) t2
... ON t1.group_id = t2.group_id
... AND t1.rownum = t2.rownum - 1
... group by t1.group_id
... """).show()
+--------+--------+
|group_id|duration|
+--------+--------+
| 000001| 9|
+--------+--------+
the sample dateset had unhealthy data for group_id 00001 only. but this solution works for cases other group_ids with unhealthy state.

One straightforward way (may be not optimal) is:
Map to [K,V] with GROUP_NUMBER as the Key K
Use repartitionAndSortWithinPartitions, so you will have all data for every single group in the same partition and have them sorted by TIMESTAMP. Detailed explanation how it works is in this answer: Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria
And finally use mapPartitions to get an iterator over sorted data in single partition, so you could easily find the answer you needed. (explanation for mapPartitions: How does the pyspark mapPartitions function work?)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark window function with condition - apache-spark

Related

Using multiple parent IDs for cutoff times in deep feature synthesis

Extract a substring new column based on a substring based on conditions ideally with Pandas

pyspark - weighted moving average through uneven period lengths

Pyspark - combine 2 rows 2 one, every 2 rows

Distribution of time periods over rows with certain status (column value)

Categories

Resources