Fill missing timestamp with multiple categories using pyspark - apache-spark

im trying to fill missing timestamp using pyspark in aws Glue.
My raw data's date cloumns format is like 20220202
I want to convert 20220202 to 2022-02-02.
so, i used the code like this.
(There are 5 columns.
(1)'date' is date column(like 20220202),
(2)'sku' is categorical data like A,B,C..and it has 25 different values and each sku has their own timestamp,
(3)'unitprice' is numeric data and each sku has different unitprice. Forexample, if sku A has unitprice 30 and sku A has 300 rows in dataframe, 300 rows have same unitprice. However sku B has different unitprice.
(4) 'trand_item' is categorical data. It's kind of metadata of sku like color. It is just categorical data and same condition of (3)
(5) 'target' is numeric data and each row has different value.
When we fill missing timestamp, i want to fill timestamp per day and i want same value of 'unitprice', 'trand_item' for each SKU but want to fill 0 in target when we add new rows for new timestamp.
sparkDF = sparkDF.select('date', 'sku', 'unitprice', 'trand_item', 'target')
sparkDF = sparkDF.withColumn("date",sparkDF["date"].cast(StringType()))
sparkDF = sparkDF.withColumn("date", to_date(col("date"), "yyyymmdd"))
In data, there is 'sku' column.
This column is categorical data and it has 25 different values like A,B,C...
Each value has their own timestamp and each value's starting date is different.(ending date is same.)
sparkDF = sparkDF.dropDuplicates(['date', 'sku'])
sparkDF = sparkDF.sort("sku", "date")
Each sku(we have 25 sku in data) has their own timestamp and has missing timestamp, so i want to fill it.
How can i handle this?
<sample data>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9
<output i want>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-03-01 A 10 Black 0
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-09-01 B 20 White 0
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9

Your input:
data = [('2018-01-01','A',10,'Black',3),
('2018-02-01','A',10,'Black',7),
('2018-04-01','A',10,'Black',13),
('2017-08-01','B',20,'White',4),
('2017-10-01','B',20,'White',17),
('2017-11-01','B',20,'White',9)]
cols = ['date', 'sku', 'unitprice', 'trand_item', 'target']
df = sqlContext.createDataFrame(data, cols)
Inspired by amazing solution from #blackbishop on PySpark generate missing dates and fill data with previous value
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("date", F.to_date(F.col("date"), "yyyy-dd-MM"))
dates_range = df.groupBy("sku").agg(
F.date_trunc("dd", F.max(F.col("date"))).alias("max_date"),
F.date_trunc("dd", F.min(F.col("date"))).alias("min_date")
).select(
"sku",
F.expr("sequence(min_date, max_date, interval 1 day)").alias("date")
).withColumn(
"date", F.explode("date")
).withColumn(
"date",
F.date_format("date", "yyyy-MM-dd")
)
w = Window.partitionBy("sku").orderBy("date")
result = dates_range\
.join(df, ["sku", "date"], "left")\
.select("sku","date",*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)\
for c in df.columns if c not in ("sku", "date", "target")],"target")\
.fillna(0, subset=['target'])
result.show()
+---+----------+---------+----------+------+
|sku| date|unitprice|trand_item|target|
+---+----------+---------+----------+------+
| A|2018-01-01| 10| Black| 3|
| A|2018-01-02| 10| Black| 7|
| A|2018-01-03| 10| Black| 0|
| A|2018-01-04| 10| Black| 13|
| B|2017-01-08| 20| White| 4|
| B|2017-01-09| 20| White| 0|
| B|2017-01-10| 20| White| 17|
| B|2017-01-11| 20| White| 9|
+---+----------+---------+----------+------+

Related

transition matrix from pyspark dataframe

I have two columns (such as):
from
to
1
2
1
3
2
4
4
2
4
2
4
3
3
3
And I want to create a transition matrix (where sum of rows in a columns add up to 1):
1. 2. 3. 4.
1. 0 0 0 0
2. 0.5* 0 0 2/3
3. 0.5 0.5 1 1/3
4. 0 0.5 0 0
where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).
You can create this kind of transition matrix using a window and pivot.
First some dummy data:
import pandas as pd
import numpy as np
np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)
df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
| 3| 3|
| 4| 2|
| 1| 2|
...
To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.
Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.
from pyspark.sql import functions as F, Window
w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from| 1| 2| 3| 4|
+----+----+----+----+----+
| 1| 0.2| 0.2|0.25|0.35|
| 2|0.27|0.31|0.19|0.23|
| 3|0.46|0.17|0.21|0.17|
| 4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+

How to get max closed date and status for the below input dataset in spark dataframe?

I have a below spark dataframe and I need to check if the job is closed or not. Each job can have sub jobs and a job is considered as closed once all subjobs are closed.
Please can you advise the way to achieve this in pyspark.
For example: input df
JobNum CloseDt ClosedFlg
12 N
12-01 2012-01-01 Y
12-02 2012-02-01 Y
13 2013-01-01 Y
14
14-01 2015-01-02 Y
14-02 N
Output_df:
JobNum IsClosedFlg Max_ClosedDt
12 Y 2012-02-01
13 Y 2013-01-01
14 N
You can assign a row number partitioned by the jobnum and ordered by the sub-jobnum in descending order, and filter the rows with row number = 1.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy(F.split('JobNum', '-')[0])
.orderBy(F.split('JobNum', '-')[1].desc())
)
).filter('rn = 1').select(
F.split('JobNum', '-')[0].alias('JobNum'),
F.col('ClosedFlg').alias('IsClosedFlg'),
F.col('CloseDt').alias('Max_ClosedDt')
)
df2.show()
+------+-----------+------------+
|JobNum|IsClosedFlg|Max_ClosedDt|
+------+-----------+------------+
| 12| Y| 2012-02-01|
| 13| Y| 2013-01-01|
| 14| N| null|
+------+-----------+------------+

how I can make a column pair with respect of a group?

I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+

number of zero days in a row field

I have a spark dataframe like the input column below. It has a date column "dates" and a int column "qty". I would like to create a new column "daysout" that has the difference in days between the current date value and the first consecutive date where qty=0. I've provided example input and output below. Any tips are greatly appreciated.
input df:
dates qty
2020-04-01 1
2020-04-02 0
2020-04-03 0
2020-04-04 3
2020-04-05 0
2020-04-06 7
output:
dates qty daysout
2020-04-01 1 0
2020-04-02 0 0
2020-04-03 0 1
2020-04-04 3 2
2020-04-05 0 0
2020-04-06 7 1
Here is a possible approach which compares if current row is 0 and lagged row is not 0 , then takes a sum of that window , which then acts as a window for a row number to be assigned and subtract 1 to get your desired result:
import pyspark.sql.functions as F
w = Window().partitionBy().orderBy(F.col("dates"))
w1 = F.sum(F.when((F.col("qty")==0)&(F.lag("qty").over(w)!=0),1).otherwise(0)).over(w)
w2 = Window.partitionBy(w1).orderBy('dates')
df.withColumn("daysout",F.row_number().over(w2) - 1).show()
+----------+---+-------+
| dates|qty|daysout|
+----------+---+-------+
|2020-04-01| 1| 0|
|2020-04-02| 0| 0|
|2020-04-03| 0| 1|
|2020-04-04| 3| 2|
|2020-04-05| 0| 0|
|2020-04-06| 7| 1|
+----------+---+-------+

how to calculate aggregations on a window when sensor readings are not sent if they haven't changed since last event?

How can I calculate aggregations on a window, from a sensor when new events are only sent if the sensor value has changed since the last event? The sensor readings are taken at fixed times, e.g. every 5 seconds, but are only forwarded if the reading changes since the last reading.
So, if I'm would like to create an average of signal_stength for each device:
eventsDF = ...
avgSignalDF = eventsDF.groupBy("deviceId").avg("signal_strength")
For example, events sent by the device for a one minute window:
event_time device_id signal_strength
12:00:00 1 5
12:00:05 1 4
12:00:30 1 5
12:00:45 1 6
12:00:55 1 5
The same dataset with the events that aren't actually sent filled in:
event_time device_id signal_strength
12:00:00 1 5
12:00:05 1 4
12:00:10 1 4
12:00:15 1 4
12:00:20 1 4
12:00:25 1 4
12:00:30 1 5
12:00:35 1 5
12:00:40 1 5
12:00:45 1 6
12:00:50 1 6
12:00:55 1 5
The signal_strength sum is 57 and the avg is 57/12
How can this missing data be inferred by spark structured streaming and the average calculated from the inferred values?
Note: I have used average as an example of an aggregation, but the solution needs to work for any aggregation function.
EDITED:
I have modified the logic to compute the average only from the filtered dataframe, so that it addresses the gaps.
//input structure
case class StreamInput(event_time: Long, device_id: Int, signal_strength: Int)
//columns for which we want to maintain state
case class StreamState(prevSum: Int, prevRowCount: Int, prevTime: Long, prevSignalStrength: Int, currentTime: Long, totalRow: Int, totalSum: Int, avg: Double)
//final result structure
case class StreamResult(event_time: Long, device_id: Int, signal_strength: Int, avg: Double)
val filteredDF = ??? //get input(filtered rows only)
val interval = 5 // event_time interval
// using .mapGroupsWithState to maintain state for runningSum & total row count till now
// you need to set the timeout threshold to indicate how long you wish to maintain the state
val avgDF = filteredDF.groupByKey(_.device_id)
.mapGroupsWithState[StreamState, StreamResult](GroupStateTimeout.NoTimeout()) {
case (id: Int, eventIter: Iterator[StreamInput], state: GroupState[StreamState]) => {
val events = eventIter.toSeq
val updatedSession = if (state.exists) {
//if state exists update the state with the new values
val existingState = state.get
val prevTime = existingState.currentTime
val currentTime = events.map(x => x.event_time).last
val currentRowCount = (currentTime - prevTime)/interval
val rowCount = existingState.rowCount + currentRowCount.toInt
val currentSignalStength = events.map(x => x.signal_strength).last
val total_signal_strength = currentSignalStength +
(existingState.prevSignalStrength * (currentRowCount -1)) +
existingState.total_signal_strength
StreamState(
existingState.total_signal_strength,
existingState.rowCount,
prevTime,
currentSignalStength,
currentTime,
rowCount,
total_signal_strength.toInt,
total_signal_strength/rowCount.toDouble
)
} else {
// if there are no earlier state
val runningSum = events.map(x => x.signal_strength).sum
val size = events.size.toDouble
val currentTime = events.map(x => x.event_time).last
StreamState(0, 1, 0, runningSum, currentTime, 1, runningSum, runningSum/size)
}
//save the updated state
state.update(updatedSession)
StreamResult(
events.map(x => x.event_time).last,
id,
events.map(x => x.signal_strength).last,
updatedSession.avg
)
}
}
val result = avgDF
.writeStream
.outputMode(OutputMode.Update())
.format("console")
.start
The idea is to calculate two new Columns:
totalRowCount: the running total of number of rows that are supposed to be present if you have not filtered.
total_signal_strength: the running total of signal_strength till now. (this INCLUDES missed row totals too).
Its calculated by:
total_signal_strength =
current row's signal_strength +
(total_signal_strength of previous row * (rowCount -1)) +
//rowCount is the count of missed rows computed by comparing previous and current event_time.
previous total_signal_strength
format of the intermediate state:
+----------+---------+---------------+---------------------+--------+
|event_time|device_id|signal_strength|total_signal_strength|rowCount|
+----------+---------+---------------+---------------------+--------+
| 0| 1| 5| 5| 1|
| 5| 1| 4| 9| 2|
| 30| 1| 5| 30| 7|
| 45| 1| 6| 46| 10|
| 55| 1| 5| 57| 12|
+----------+---------+---------------+---------------------+--------+
final output:
+----------+---------+---------------+-----------------+
|event_time|device_id|signal_strength| avg|
+----------+---------+---------------+-----------------+
| 0| 1| 5| 5.0|
| 5| 1| 4| 4.5|
| 30| 1| 5|4.285714285714286|
| 45| 1| 6| 4.6|
| 55| 1| 5| 4.75|
+----------+---------+---------------+-----------------+
Mathematically equivalent to a weighted average problem based on duration:
avg=(signal_strength*duration)/60
the challenge here is to get duration for each signal, one option here is for each micro-batch, collect result in driver then it`s all statistic problem,to get duration you can do a left shift on start time then subtracts, something like this:
window.start.leftShift(1)-window.start
which would give you:
event_time device_id signal_strength duration
12:00:00 1 5 5(5-0)
12:00:05 1 4 25(30-5)
12:00:30 1 5 15(45-30)
12:00:45 1 6 10(55-45)
12:00:55 1 5 5 (60-55)
(5*5+4*25+5*15+6*10+5*5)/60=57/12
As of Spark structured streaming 2.3.2, you need to write your own customized sink to collect the result of each stage to driver and do the math work like that.

Resources