Running sum between two timestamp in pyspark - apache-spark

I have a data in below format :
+---------------------+----+----+---------+----------+
| date_time | id | cm | p_count | bcm |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |
+---------------------+----+----+---------+----------+
I need to find rolling sum of p_count column between two date_time and partition by id.
logic for start_date_time and end_date_time for rolling sum window is below :
start_date_time=min(date_time) group by (id,cm)
end_date_time= bcm == cm ? date_time : null
in this case start_date_time=2018-02-01 04:38:00 and end_date_time=2018-02-01 12:09:19 .
Output should look like :
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+

var input = sqlContext.createDataFrame(Seq(
("2018-02-01 04:38:00", "v1", "c1",1,null),
("2018-02-01 05:37:07", "v1", "c1",1,null),
("2018-02-01 11:19:38", "v1", "c1",1,null),
("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
("2018-02-01 14:05:10", "v2", "c2",1,"c2")
)).toDF("date_time","id","cm","p_count" ,"bcm")
input.show()
Results:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
Next Code:
input.createOrReplaceTempView("input_Table");
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")
val results = sqlContext.sql("select *, " +
"SUM(p_count) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from input_Table ").show
Results:
+-------------------+---+---+-------+----+--------------+
| date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1| 1|null| 1|
|2018-02-01 05:37:07| v1| c1| 1|null| 2|
|2018-02-01 11:19:38| v1| c1| 1|null| 3|
|2018-02-01 12:09:19| v1| c1| 1| c1| 4|
|2018-02-01 14:05:10| v2| c2| 1| c2| 5|
+-------------------+---+---+-------+----+--------------+
You need to group by while windowing and add your logic to get expected reslts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Logically a Windowed Aggregate Function is newly calculated for each row within the PARTITION based on all ROWS between a starting row and an ending row.
Starting and ending rows might be fixed or relative to the current row based on the following keywords:
UNBOUNDED PRECEDING, all rows before the current row -> fixed
UNBOUNDED FOLLOWING, all rows after the current row -> fixed
x PRECEDING, x rows before the current row -> relative
y FOLLOWING, y rows after the current row -> relative
Possible kinds of calculation include:
Both starting and ending row are fixed, the window consists of all rows of a partition, e.g. a Group Sum, i.e. aggregate plus detail rows
One end is fixed, the other relative to current row, the number of rows increases or decreases, e.g. a Running Total, Remaining Sum
Starting and ending row are relative to current row, the number of rows within a window is fixed, e.g. a Moving Average over n rows
So SUM(x) OVER (ORDER BY col ROWS UNBOUNDED PRECEDING) results in a Cumulative Sum or Running Total
11 -> 11
2 -> 11 + 2 = 13
3 -> 13 + 3 (or 11+2+3) = 16
44 -> 16 + 44 (or 11+2+3+44) = 60
What is ROWS UNBOUNDED PRECEDING used for in Teradata?

Related

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

Pyspark agg function to "explode" rows into columns

Basically, I have a dataframe that looks like this:
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
and my desired output is this:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
So basically I want to "explode" the index column into new columns after I groupby id. Btw, the id counts are the same and each id has the same set of index values. I'm using pyspark.
using pivot you can achieve the desired output.
from pyspark.sql import functions as F
df = spark.createDataFrame([[1,"a","a11","a12"],[1,"b","b11","b12"],[2,"a","a21","a22"],[2,"b","b21","b22"]],["id","index","col1","col2"])
df.show()
+---+-----+----+----+
| id|index|col1|col2|
+---+-----+----+----+
| 1| a| a11| a12|
| 1| b| b11| b12|
| 2| a| a21| a22|
| 2| b| b21| b22|
+---+-----+----+----+
using pivot
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
Rename Column
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
Note rearrange column based on your requirement.

Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark

Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
The output table should be the all the SubRange groups who has overlap over the MajorRange groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
In case there is no overlapping between the ranges the Major will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?
With spark, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+

Resources