Distribution of time periods over rows with certain status (column value) - apache-spark

I have a Pyspark dataframe containing logs, with each row corresponding to the state of the system at the time it is logged, and a group number. I would like to find the lengths of the time periods for which each group is in an unhealthy state.
For example, if this were my table:
TIMESTAMP | STATUS_CODE | GROUP_NUMBER
--------------------------------------
02:03:11 | healthy | 000001
02:03:04 | healthy | 000001
02:03:03 | unhealthy | 000001
02:03:00 | unhealthy | 000001
02:02:58 | healthy | 000008
02:02:57 | healthy | 000008
02:02:55 | unhealthy | 000001
02:02:54 | healthy | 000001
02:02:50 | healthy | 000007
02:02:48 | healthy | 000004
I would want to return Group 000001 having an unhealthy time period of 9 seconds (from 02:02:55 to 02:03:04).
Other groups could also have unhealthy time periods, and I would want to return those as well.
Due to the possibility of consecutive rows with the same status, and since rows of different groups are interspersed, I am struggling to find a way to do this efficiently.
I cannot convert the Pyspark dataframe to a Pandas dataframe, as it is much too large.
How can I efficiently determine the lengths of these time periods?
Thanks so much!

the pyspark with spark-sql solution would look like this.
First we create the sample data-set. In addition to the dataset we generate row_number field partition on group and order by the timestamp. then we register the generated dataframe as a table say table1
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([
('2017-01-01 02:03:11','healthy','000001'),
('2017-01-01 02:03:04','healthy','000001'),
('2017-01-01 02:03:03','unhealthy','000001'),
('2017-01-01 02:03:00','unhealthy','000001'),
('2017-01-01 02:02:58','healthy','000008'),
('2017-01-01 02:02:57','healthy','000008'),
('2017-01-01 02:02:55','unhealthy','000001'),
('2017-01-01 02:02:54','healthy','000001'),
('2017-01-01 02:02:50','healthy','000007'),
('2017-01-01 02:02:48','healthy','000004')
],['timestamp','state','group_id'])
df = df.withColumn('rownum', row_number().over(Window.partitionBy(df.group_id).orderBy(unix_timestamp(df.timestamp))))
df.registerTempTable("table1")
once the dataframe is registered as a table (table1). the required data can be computed as below using spark-sql
>>> spark.sql("""
... SELECT t1.group_id,sum((t2.timestamp_value - t1.timestamp_value)) as duration
... FROM
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1 WHERE state = 'unhealthy') t1
... LEFT JOIN
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1) t2
... ON t1.group_id = t2.group_id
... AND t1.rownum = t2.rownum - 1
... group by t1.group_id
... """).show()
+--------+--------+
|group_id|duration|
+--------+--------+
| 000001| 9|
+--------+--------+
the sample dateset had unhealthy data for group_id 00001 only. but this solution works for cases other group_ids with unhealthy state.

One straightforward way (may be not optimal) is:
Map to [K,V] with GROUP_NUMBER as the Key K
Use repartitionAndSortWithinPartitions, so you will have all data for every single group in the same partition and have them sorted by TIMESTAMP. Detailed explanation how it works is in this answer: Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria
And finally use mapPartitions to get an iterator over sorted data in single partition, so you could easily find the answer you needed. (explanation for mapPartitions: How does the pyspark mapPartitions function work?)

Related

Reading list of queries from a file and executing them in pyspark

I want to read a list of queries stored in a text file(csv or any delimeter separated) queries and want to execute them one by one in pyspark. I am very new to spark and wanted to know if there is any related spark api which I can use for doing this.
sample data
C1 | C2 | C3
1 | 2 | 3
0 | 0 | 0
sample queries text file
select * from sample_data_table where C1 = 0,
select * from sample_data_table where C1 != 0
output
df1 ==> C1 | C2 | C3
0 | 0 | 0
df2 ==> C1 | C2 | C3
1 | 2 | 3
You can get the desired result by reading file as a dataframe and passing each query to spark.sql() method.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master('local[*]').getOrCreate()
df = spark.read.text("file.txt")
# To have single query per row
rows = df.select(explode(split('value', ','))).collect()
'''
Using collect is not recommended since it brings all the data to driver.
If the driver memory is not enough we get OOM errors.
We can do something like
df.foreach(#...)
But in your case we need to use spark session object within foreach.
Code with in foreach will be sent to executors and will not be able to access
spark session within executor and you will get errors.
That's why I used collect(). Need to check some alternatives here.
'''
for sql in rows:
spark.sql(sql[0]).show()

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

How to cycle a Pandas dataframe grouping by hierarchical multiindex from top to bottom and store results

I'm trying to create a forecasting process using hierarchical time series. My problem is that I can't find a way to create a for loop that hierarchically extracts daily time series from a pandas dataframe grouping the sum of quantities by date. The resulting daily time series should be passed to a function inside the loop, and the results stored in some other object.
Dataset
The initial dataset is a table that represents the daily sales data of 3 hierarchical levels: city, shop, product. The initial table has this structure:
+============+============+============+============+==========+
| Id_Level_1 | Id_Level_2 | Id_Level_3 | Date | Quantity |
+============+============+============+============+==========+
| Rome | Shop1 | Prod1 | 01/01/2015 | 50 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 02/01/2015 | 25 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 03/01/2015 | 73 |
+------------+------------+------------+------------+----------+
| Rome | Shop1 | Prod1 | 04/01/2015 | 62 |
+------------+------------+------------+------------+----------+
| ... | ... | ... | ... | ... |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 185 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 147 |
+------------+------------+------------+------------+----------+
| Milan | Shop3 | Prod9 | 31/12/2018 | 206 |
+------------+------------+------------+------------+----------+
Each City (Id_Level_1) has many Shops (Id_Level_2), and each one has some Products (Id_Level_3). Each shop has a different mix of products (maybe shop1 and shop3 have product7, which is not available in other shops). All data are daily and the measure of interest is the quantity.
Hierarchical Index (MultiIndex)
I need to create a tree structure (hierarchical structure) to extract a time series for each "node" of the structure. I call a "node" a cobination of the hierarchical keys, i.e. "Rome" and "Milan" are nodes of Level 1, while "Rome|Shop1" and "Milan|Shop9" are nodes of level 2. In particulare, I need this on level 3, because each product (Id_Level_3) has different sales in each shop of each city. Here is the strict hierarchy.
Nodes of level 3 are "Rome, Shop1, Prod1", "Rome, Shop1, Prod2", "Rome, Shop2, Prod1", and so on. The key of the nodes is logically the concatenation of the ids.
For each node, the time series is composed by two columns: Date and Quantity.
# MultiIndex dataframe
Liv_Labels = ['Id_Level_1', 'Id_Level_2', 'Id_Level_3', 'Date']
df.set_index(Liv_Labels, drop=False, inplace=True)
The I need to extract the aggregated time series in order but keeping the hierarchical nodes.
Level 0:
Level_0 = df.groupby(level=['Data'])['Qta'].sum()
Level 1:
# Node Level 1 "Rome"
Level_1['Rome'] = df.loc[idx[['Rome'],:,:]].groupby(level=['Data']).sum()
# Node Level 1 "Milan"
Level_1['Milan'] = df.loc[idx[['Milan'],:,:]].groupby(level=['Data']).sum()
Level 2:
# Node Level 2 "Rome, Shop1"
Level_2['Rome',] = df.loc[idx[['Rome'],['Shop1'],:]].groupby(level=['Data']).sum()
... repeat for each level 2 node ...
# Node Level 2 "Milan, Shop9"
Level_2['Milan'] = df.loc[idx[['Milan'],['Shop9'],:]].groupby(level=['Data']).sum()
Attempts
I already tried creating dictionaries and multiindex, but my problem is that I can't get a proper "node" use inside the loop. I can't even extract the unique level nodes keys, so I can't collect a specific node time series.
# Get level labels
Level_Labels = ['Id_Liv'+str(n) for n in range(1, Liv_Num+1)]+['Data']
# Initialize dictionary
TimeSeries = {}
# Get Level 0 time series
TimeSeries["Level_0"] = df.groupby(level=['Data'])['Qta'].sum()
# Get othe levels time series from 1 to Level_Num
for i in range(1, Liv_Num+1):
TimeSeries["Level_"+str(i)] = df.groupby(level=Level_Labels[0:i]+['Data'])['Qta'].sum()
Desired result
I would like a loop the cycles my dataset with these actions:
Creates a structure of all the unique node keys
Extracts the node time series grouped by Date and Quantity
Store the time series in a structure for later use
Thanks in advance for any suggestion! Best regards.
FR
I'm currently working on a switch dataset that I polled from an sql database where each port on the respective switch has a data frame which has a time series. So to access this time series information for each specific port I represented the switches by their IP addresses and the various number of ports on the switch, and to make sure I don't re-query what I already queried before I used the .unique() method to get unique queries of each.
I set my index to be the IP and Port indices and accessed the port information like so:
def yield_df(df):
for ip in df.index.get_level_values('ip').unique():
for port in df.loc[ip].index.get_level_values('port').unique():
yield df.loc[ip].loc[port]
Then I cycled the port data frames with a for loop like so:
for port_df in yield_df(adb_df):
I'm sure there are faster ways to carry out these procedures in pandas but I hope this helps you start solving your problem

pyspark - weighted moving average through uneven period lengths

I am trying calculate a weighted (based on duration) moving average of a dataframe with uneven timestamp records.
Below is an example df.
+-----+-------------------+
|value| date|
+-----+-------------------+
| 9.0|2017-03-15 11:42:00|
| 7.0|2017-03-16 13:02:00|
| 7.0|2017-03-16 19:02:00|
| 7.0|2017-03-16 21:38:00|
| 7.0|2017-03-16 21:58:00|
| 6.0|2017-03-18 10:07:00|
| 22.0|2017-03-18 12:21:00|
| 21.0|2017-03-20 23:21:00|
| 19.0|2017-03-21 10:21:00|
| 17.0|2017-03-04 11:01:00|
| 16.0|2017-03-09 18:41:00|
+-----+-------------------+
I have tried to use rangeBetween but I think it only takes simple average
Then tried to use pyspark.sql.functions.window method with w = window('date','7 days','5 minutes'), and calculate weighted average with a udf, but I haven't been able to even calculate a simple average because it took forever to calculate it.
w = window('date','7 days','5 minutes')
win = Window.partitionBy(w).orderBy(df['date'].asc())
new_df = df.withColumn('average',avg('value').over(win))
I was also advised to transform the table to an evenly distributed time period.
Which one do you advise & why, and how to approach window sliding and filling?
I am a newbie in pyspark
Thanks

Pyspark window function with condition

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previous or next event:
+--------+-------------------+--------+
|userid |eventtime |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60 |
|37397e29|2017-06-04 03:01:00|60 |
|37397e29|2017-06-04 03:02:00|60 |
|37397e29|2017-06-04 03:03:00|180 |
|37397e29|2017-06-04 03:06:00|60 |
|37397e29|2017-06-04 03:07:00|420 |
|37397e29|2017-06-04 03:14:00|60 |
|37397e29|2017-06-04 03:15:00|1140 |
|37397e29|2017-06-04 03:34:00|540 |
|37397e29|2017-06-04 03:53:00|540 |
+--------+----------------- -+--------+
The challenge is to group by the start_time and end_time of the latest eventtime that has the condition of being within 5 minutes. The output should be like this table:
+--------+-------------------+--------------------+-----------+
|userid |start_time |end_time |events |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6 |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2 |
+--------+-------------------+--------------------+-----------+
So far I have used window lag functions and some conditions, however, I do not know where to go from here:
%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")
nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
- F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))
# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))
# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))
result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")
My questions: Is this a viable approach, and if so, how can I "go forward" and look at the maximum eventtime that fulfill the 5 minutes condition. To my knowledge, iterate through values of a Spark SQL Column, is it possible? wouldn't it be too expensive?. Is there another way to achieve this result?
Result of Solution suggested by #Aku:
+--------+--------+---------------------+---------------------+------+
|userid |subgroup|start_time |end_time |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0 |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5 |
|37397e29|1 |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2 |
|37397e29|2 |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1 |
|37397e29|3 |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2 |
+------------------------------------+-----------------------+-------+
It doesn't give the result expected. 3:07 - 3:14 and 03:34-03:43 are being counted as ranges within 5 minutes, it shouldn't be like that. Also, 3:07 should be the end_time in the first row as it is within 5 minutes of the previous row 3:06.
You'll need one extra window function and a groupby to achieve this.
What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):
w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count("*").alias("events")
)
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
| 2|2017-06-04 03:34:00|2017-06-04 03:34:00| 1|
| 3|2017-06-04 03:53:00|2017-06-04 03:53:00| 1|
+--------+-------------------+-------------------+------+
It seems that you also filter out lines with only one event, hence:
DF = DF.filter("events != 1")
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
+--------+-------------------+-------------------+------+
So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:
First some imports
from pyspark.sql.window import Window
import pyspark.sql.functions as func
Then setting windows, I assumed you would partition by userid
w = Window.partitionBy("userid").orderBy("eventtime")
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
Then some aggregation functions and you should be done
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count(func.lit(1)).alias("events")
)
Approach can be grouping the dataframe based on your timeline criteria.
You can create a dataframe with the rows breaking the 5 minutes timeline.
Those rows are criteria for grouping the records and
that rows will set the startime and endtime for each group.
Then find the count and max timestamp(endtime) for each group.

Resources