Removing redundant rows in a Spark data frame with time series data

Removing redundant rows in a Spark data frame with time series data - apache-spark

I have a Spark data frame that looks like this (simplifying timestamp and id column values for clarity):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 4 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 8 | 1 | pending |
| 9 | 1 | in-progress |
It's a time series of status events. What I'd like to end up with is only the rows representing a status change. In that sense, the problem can be seen as one of removing redundant rows - e.g. entries at times 4 and 8 - both for id = 1 - should be dropped as they do not represent a change of status for a given id.
For the above set of rows, this would give (order being unimportant):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 9 | 1 | in-progress |
Original plan was to partition by id and status, order by timestamp, and pick the first row for each partition - however this would give
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 7 | 4 | closed |
i.e. it loses repeated status changes.
Any pointers appreciated, I'm new to data frames and may be missing a trick or two.

Using the lag window function should do the trick
case class Event(timestamp: Int, id: Int, status: String)
val events = sqlContext.createDataFrame(sc.parallelize(
Event(1, 1, "pending") :: Event(2, 2, "pending") ::
Event(3, 1, "in-progress") :: Event(4, 1, "in-progress") ::
Event(5, 3, "in-progress") :: Event(6, 1, "pending") ::
Event(7, 4, "closed") :: Event(8, 1, "pending") ::
Event(9, 1, "in-progress") :: Nil
))
events.registerTempTable("events")
val query = """SELECT timestamp, id, status FROM (
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events) tmp
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id"""
sqlContext.sql(query).show
Inner query
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events
creates table as below where prev_status is a previous value of status for a given id and ordered by timestamp.
+---------+--+-----------+-----------+
|timestamp|id| status|prev_status|
+---------+--+-----------+-----------+
| 1| 1| pending| null|
| 3| 1|in-progress| pending|
| 4| 1|in-progress|in-progress|
| 6| 1| pending|in-progress|
| 8| 1| pending| pending|
| 9| 1|in-progress| pending|
| 2| 2| pending| null|
| 5| 3|in-progress| null|
| 7| 4| closed| null|
+---------+--+-----------+-----------+
Outer query
SELECT timestamp, id, status FROM (...)
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id
simply filters rows where prev_status is NULL (first row for a given id) or prev_status is different than status (there was a status change between consecutive timestamps). Order added just to make a visual inspection easier.

Related

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.

I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

pyspark detect change of categorical variable

I have a spark dataframe consisting of two columns.
+-----------------------+-----------+
| Metric|Recipe_name|
+-----------------------+-----------+
| 100. | A |
| 200. | A |
| 300. | A |
| 10. | A |
| 20. | A |
| 10. | B |
| 20. | B |
| 10. | A |
| 20. | A |
| .. | .. |
| .. | .. |
| 10. | B |
The dataframe is time ordered ( you can imagine there is a increasing timestamp column ). I need to add a column 'Cycles'. There are two scenarios when I say a new cycle begins :
If the same recipe is running lets say recipe 'A', and the value of Metric decreases (with respect to the last row) then a new cycle begins.
Lets say we switch from current recipe 'A' to second recipe 'B' and switch back to recipe 'A' we say a new cycle for recipe 'A' has begun.
So in the end i would like to have a column 'Cycle' which looks like this :
+-----------------------+-----------+-----------+
| Metric|Recipe_name| Cycle|
+-----------------------+-----------+-----------+
| 100. | A | 0 |
| 200. | A | 0 |
| 300. | A | 0 |
| 10. | A | 1 |
| 20. | A | 1 |
| 10. | B | 0 |
| 20. | B | 0 |
| 10. | A | 2 |
| 20. | A | 2 |
| .. | .. | 2 |
| .. | .. | 2 |
| 10. | B | 1 |
So it means recipe A has cycle 0 then metric decreases and cycle changes to 1.
Then a new recipe starts B so it has a new cycle 0.
Then again we get back to recipe A we say a new cycle begins for recipe A and with respect to last cycle number it has cycle 2 ( and similarly for recipe B).
In total there are 200 recipes.
Thanks for the help.

Replace my order column to your ordering column. Compare your condition by using lag function where the Recipe_name column is being partitioned.
w = Window.partitionBy('Recipe_name').orderBy('order')
df.withColumn('Cycle', when(col('Metric') < lag('Metric', 1, 0).over(w), 1).otherwise(0)) \
.withColumn('Cycle', sum('Cycle').over(w)) \
.orderBy('order') \
.show()
+------+-----------+-----+
|Metric|Recipe_name|Cycle|
+------+-----------+-----+
| 100| A| 0|
| 200| A| 0|
| 300| A| 0|
| 10| A| 1|
| 20| A| 1|
| 10| B| 0|
| 20| B| 0|
| 10| A| 2|
| 20| A| 2|
| 10| B| 1|
+------+-----------+-----+

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.

spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.

Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

pyspark function.lag on condition

I am trying to solve a problem with pyspark,
I have a dataset such as:
Condition | Date
0 | 2019/01/10
1 | 2019/01/11
0 | 2019/01/15
1 | 2019/01/16
1 | 2019/01/19
0 | 2019/01/23
0 | 2019/01/25
1 | 2019/01/29
1 | 2019/01/30
I would like to get the latest lag value of the date column when condition == 1 was met
The desired output would be something like:
Condition | Date | Lag
0 | 2019/01/10 | NaN
1 | 2019/01/11 | NaN
0 | 2019/01/15 | 2019/01/11
1 | 2019/01/16 | 2019/01/11
1 | 2019/01/19 | 2019/01/16
0 | 2019/01/23 | 2019/01/19
0 | 2019/01/25 | 2019/01/19
1 | 2019/01/29 | 2019/01/19
1 | 2019/01/30 | 2019/01/29
How can I perform that?
Please do keep in mind its a very large dataset - which I will have to partition and group by an UUID so the solution has to be somewhat performatic.
Thank you,

Here is a solution with Pyspark. The logic remains the same as #GordonLinoff's solution with SQL query.
w = Window.orderBy("Date").rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("Lag", max(when(col("Condition") == lit(1), col("Date"))).over(w)).show()
Gives:
+---------+----------+----------+
|Condition| Date| Lag|
+---------+----------+----------+
| 0|2019/01/10| null|
| 1|2019/01/11| null|
| 0|2019/01/15|2019/01/11|
| 1|2019/01/16|2019/01/11|
| 1|2019/01/19|2019/01/16|
| 0|2019/01/23|2019/01/19|
| 0|2019/01/25|2019/01/19|
| 1|2019/01/29|2019/01/19|
| 1|2019/01/30|2019/01/29|
+---------+----------+----------+

In SQL, you can use a conditional running max():
select t.*,
max(case when condition = 1 then date end) over (order by date
rows between unbounded preceding and 1 preceding
) as prev_condition_1_date
from t;

I like to use SQL to solve that:
from pyspark.sql.functions import expr
display(
df.withColumn(
'lag',
expr('max(case when Condition == 1 then Date end) over (order by Date rows between unbounded preceding and 1 preceding)'
)
)

Same transaction returns different results when i ran multiply times

When i was using TiDB, I found it strange when i make two transactions run at the same time. I was expecting to get the the same value 2 like what MySQL did, but all i got is like 0, 2, 0, 2, 0, 2...
For both databases, the tx_isolation is set to 'read-committed'. So it is reasonable that the select statement returns 2 as it has already committed.
Here's the test code:
for i in range(10):
conn1 = mysql.connector.connect(host='',
port=4000,
user='',
password='',
database='',
charset='utf8')
conn2 = mysql.connector.connect(host='',
port=4000,
user='',
password='',
database='',
charset='utf8')
cur1 = conn1.cursor()
cur2 = conn2.cursor()
conn1.start_transaction()
conn2.start_transaction()
cur2.execute("update t set b=%d where a=1" % 2)
conn2.commit()
cur1.execute("select b from t where a=1")
a = cur1.fetchone()
print(a)
cur1.execute("update t set b=%d where a=1" % 0)
conn1.commit()
cur1.close()
cur2.close()
conn1.close()
conn2.close()
The table t is created like this:
CREATE TABLE `t` (
`a` int(11) NOT NULL AUTO_INCREMENT,
`b` int(11) DEFAULT NULL,
PRIMARY KEY (`a`)
)
and (1,0) is inserted initially.

First of All:
For TiDB only support SNAPSHOT(latest version)
Transactions Isolation Level. but it only can see committed data before Transaction started.
and TiDB also will not update the same value in transaction,
like MySQL and SQL Server etc.
For MySQL, when use the READ COMMITTED isolation level, it
will read committed data, so it will read the other transactions
committed data.
So as your code snippet:
TiDB round 1 workflow:
T1 T2
+--------------------+
| transaction start |
| (b = 0) |
+---------+----------+
|
|
| +------------------------------+
| <----------------------+ update `b` to 2, and commit |
| +------------------------------+
|
|
+-----------+-----------+
| select b should be 0, |
| since tidb will only |
| get the data before |
| transaction committed |
+-----------+-----------+
|
v
+------------------------------+
| update value to 0 |
| (since 0 is equal to the |
| transaction started value, |
| tidb will ignore this update)|
+------------------------------+
+
|
|
|
v
+-------------------------+
|so finally `b` will be 2 |
+-------------------------+
TiDB round 2 workflow:
T1 T2
+--------------------+
| transaction start |
| (b = 2) |
+---------+----------+
|
|
| +------------------------------+
| <----------------------+ update `b` to 2, and commit |
| +------------------------------+
|
|
+-----------+-----------+
| select b should be 2, |
| since tidb will only |
| get the data before |
| transaction committed |
+-----------+-----------+
|
v
+------------------------------+
| update value to 0 |
| (since 0 is not equal to 2 |
+------------------------------+
+
|
|
|
v
+-------------------------+
|so finally `b` will be 0 |
+-------------------------+
So for TiDB will output like:
0, 2, 0, 2, 0, 2...
MySQL workflow:
T1 T2
+----------------------+
| transaction start |
| ( b = 0 ) |
+-----------+----------+
|
|
|
| +---------------------------+
| <----------------------+update `b` to 2, and commit|
| +---------------------------+
|
|
v
+--------------------------------------------+
| select b should be 2, |
| since use READ COMMITTED isolation level, |
| it will read committed data. |
+---------------------+----------------------+
|
|
v
+--------------------+
| update value to 0 |
+--------------------+
+
|
|
|
v
+--------------------------+
| so finally `b` will be 0 |
+--------------------------+
so MySQL can continuously output:
2, 2, 2, 2...
Last of word
I think this is very strange for TiDB to skip the update same value in Transaction, but when with the different value it also can be updated success, like we can update b to different value in the loop, we always can get the latest changed b.
So maybe it should be better keep same behavior between same value and different value.
I have created a issue for this:
https://github.com/pingcap/tidb/issues/7644
References:
https://github.com/pingcap/docs/blob/master/sql/transaction-isolation.md

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing redundant rows in a Spark data frame with time series data - apache-spark

Related

Collapse DataFrame using Window functions

pyspark detect change of categorical variable

Append a monotonically increasing id column that increases on column value match

pyspark function.lag on condition

Same transaction returns different results when i ran multiply times

Categories

Resources