SparkSQL : How to sum two time-series data-sets with different timestamps - apache-spark

I have two data-sets of time-series data.I need to sum up these two data-sets using probably some sort of windowing approach
Timestamps are different for the two datasets
The result would be the sum of "value" fields from both datasets, which falls within the window of the result dataset.
Is there any builtin functions in Spark to do this easily?or else how can I achieve this in the best possible way
DataSet-1
raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
23 | 1528766100068 | 131
23 | 1528765200058 | 130.60001
23 | 1528764300049 | 130.3
23 | 1528763400063 | 130
23 | 1528762500059 | 129.60001
23 | 1528761600050 | 129.3
23 | 1528760700051 | 128.89999
23 | 1528759800047 | 128.60001
DataSet-2
raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
24 | 1528766100000 | 41
24 | 1528765200000 | 60
24 | 1528764300000 | 30.03
24 | 1528763400000 | 43
24 | 1528762500000 | 34.01
24 | 1528761600000 | 29.36
24 | 1528760700000 | 48.99
24 | 1528759800000 | 28.01

Her is an example
scala> d1.show
+-----------------+--------------------+---------+
|raw_data_field_id| date_time_epoch| value|
+-----------------+--------------------+---------+
| 23|2018-06-12 01:15:...| 131.0|
| 23|2018-06-12 01:00:...|130.60001|
| 23|2018-06-12 00:45:...| 130.3|
| 23|2018-06-12 00:30:...| 130.0|
| 23|2018-06-12 00:15:...|129.60001|
| 23|2018-06-12 00:00:...| 129.3|
| 23|2018-06-11 23:45:...|128.89999|
| 23|2018-06-11 23:30:...|128.60001|
+-----------------+--------------------+---------+
scala> d2.show
+-----------------+--------------------+-----+
|raw_data_field_id| date_time_epoch|value|
+-----------------+--------------------+-----+
| 24|2018-06-12 01:15:...| 41.0|
| 24|2018-06-12 01:00:...| 60.0|
| 24|2018-06-12 00:45:...|30.03|
| 24|2018-06-12 00:30:...| 43.0|
| 24|2018-06-12 00:15:...|34.01|
| 24|2018-06-12 00:00:...|29.36|
| 24|2018-06-11 23:45:...|48.99|
| 24|2018-06-11 23:30:...|28.01|
+-----------------+--------------------+-----+
scala> d1.unionAll(d2).show
+-----------------+--------------------+---------+
|raw_data_field_id| date_time_epoch| value|
+-----------------+--------------------+---------+
| 23|2018-06-12 01:15:...| 131.0|
| 23|2018-06-12 01:00:...|130.60001|
| 23|2018-06-12 00:45:...| 130.3|
| 23|2018-06-12 00:30:...| 130.0|
| 23|2018-06-12 00:15:...|129.60001|
| 23|2018-06-12 00:00:...| 129.3|
| 23|2018-06-11 23:45:...|128.89999|
| 23|2018-06-11 23:30:...|128.60001|
| 24|2018-06-12 01:15:...| 41.0|
| 24|2018-06-12 01:00:...| 60.0|
| 24|2018-06-12 00:45:...| 30.03|
| 24|2018-06-12 00:30:...| 43.0|
| 24|2018-06-12 00:15:...| 34.01|
| 24|2018-06-12 00:00:...| 29.36|
| 24|2018-06-11 23:45:...| 48.99|
| 24|2018-06-11 23:30:...| 28.01|
+-----------------+--------------------+---------+
import org.apache.spark.sql.functions.window
val df = d1.union(d2)
val avg_df = df.groupBy(window($"date_time_epoch", "15 minutes")).agg(avg($"value"))
avg_df.show
+--------------------+-----------------+
| window| avg(value)|
+--------------------+-----------------+
|[2018-06-11 23:45...| 88.944995|
|[2018-06-12 00:30...| 86.5|
|[2018-06-12 01:15...| 86.0|
|[2018-06-11 23:30...| 78.305005|
|[2018-06-12 00:00...|79.33000000000001|
|[2018-06-12 00:45...| 80.165|
|[2018-06-12 00:15...| 81.805005|
|[2018-06-12 01:00...| 95.300005|
+--------------------+-----------------+
avg_df.sort("window.start").select("window.start","window.end","avg(value)").show(truncate = false)
+-------------------+-------------------+-----------------+
|start |end |avg(value) |
+-------------------+-------------------+-----------------+
|2018-06-11 23:30:00|2018-06-11 23:45:00|78.305005 |
|2018-06-11 23:45:00|2018-06-12 00:00:00|88.944995 |
|2018-06-12 00:00:00|2018-06-12 00:15:00|79.33000000000001|
|2018-06-12 00:15:00|2018-06-12 00:30:00|81.805005 |
|2018-06-12 00:30:00|2018-06-12 00:45:00|86.5 |
|2018-06-12 00:45:00|2018-06-12 01:00:00|80.165 |
|2018-06-12 01:00:00|2018-06-12 01:15:00|95.300005 |
|2018-06-12 01:15:00|2018-06-12 01:30:00|86.0 |
+-------------------+-------------------+-----------------+

Related

Create a dataframe base on X days backward observation

Consedering that I have the following DF:
|-----------------|
|Date | Cod |
|-----------------|
|2022-08-01 | A |
|2022-08-02 | A |
|2022-08-03 | A |
|2022-08-04 | A |
|2022-08-05 | A |
|2022-08-01 | B |
|2022-08-02 | B |
|2022-08-03 | B |
|2022-08-04 | B |
|2022-08-05 | B |
|-----------------|
And considering that I have a backward observation of 2 days, how can I generate the following output DF
|------------------------------|
|RefDate | Date | Cod
|------------------------------|
|2022-08-03 | 2022-08-01 | A |
|2022-08-03 | 2022-08-02 | A |
|2022-08-03 | 2022-08-03 | A |
|2022-08-04 | 2022-08-02 | A |
|2022-08-04 | 2022-08-03 | A |
|2022-08-04 | 2022-08-04 | A |
|2022-08-05 | 2022-08-03 | A |
|2022-08-05 | 2022-08-04 | A |
|2022-08-05 | 2022-08-05 | A |
|2022-08-03 | 2022-08-01 | B |
|2022-08-03 | 2022-08-02 | B |
|2022-08-03 | 2022-08-03 | B |
|2022-08-04 | 2022-08-02 | B |
|2022-08-04 | 2022-08-03 | B |
|2022-08-04 | 2022-08-04 | B |
|2022-08-05 | 2022-08-03 | B |
|2022-08-05 | 2022-08-04 | B |
|2022-08-05 | 2022-08-05 | B |
|------------------------------|
I know that I can use loops to generate this output DF, but loops doesn't have a good performance since I can't cache the DF on memory (My original DF has approx 6 billion lines). So, what is the best way to get this output?
MVCE:
data_1=[
("2022-08-01","A"),
("2022-08-02","A"),
("2022-08-03","A"),
("2022-08-04","A"),
("2022-08-05","A"),
("2022-08-01","B"),
("2022-08-02","B"),
("2022-08-03","B"),
("2022-08-04","B"),
("2022-08-05","B")
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Cod", StringType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
You could try a self join. My thoughts - If your cluster and session are configured optimally, it should work with 6B rows.
data_sdf.alias('a'). \
join(data_sdf.alias('b'),
[func.col('a.cod') == func.col('b.cod'),
func.datediff(func.col('a.date'), func.col('b.date')).between(0, 2)],
'inner'
). \
drop(func.col('a.cod')). \
selectExpr('cod', 'a.date as ref_date', 'b.date as date'). \
show()
# +---+----------+----------+
# |cod| ref_date| date|
# +---+----------+----------+
# | B|2022-08-01|2022-08-01|
# | B|2022-08-02|2022-08-01|
# | B|2022-08-02|2022-08-02|
# | B|2022-08-03|2022-08-01|
# | B|2022-08-03|2022-08-02|
# | B|2022-08-03|2022-08-03|
# | B|2022-08-04|2022-08-02|
# | B|2022-08-04|2022-08-03|
# | B|2022-08-04|2022-08-04|
# | B|2022-08-05|2022-08-03|
# | B|2022-08-05|2022-08-04|
# | B|2022-08-05|2022-08-05|
# | A|2022-08-01|2022-08-01|
# | A|2022-08-02|2022-08-01|
# | A|2022-08-02|2022-08-02|
# | A|2022-08-03|2022-08-01|
# | A|2022-08-03|2022-08-02|
# | A|2022-08-03|2022-08-03|
# | A|2022-08-04|2022-08-02|
# | A|2022-08-04|2022-08-03|
# +---+----------+----------+
# only showing top 20 rows
This will generate records for the initial 2 dates as well which can be discarded.

Replace accounting notation for negative number with minus value

I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.

How to pivot pyspark dataframe rows to columns

I have a simple pyspark dataframe like this:
-------------------------------------------------------
|timestamp_local |timestamp_utc|device|key |value|
-------------------------------------------------------
|2020-11-20 | 2020-11-20 | J3X |Position| SEP |
|2020-11-20 | 2020-11-20 | J3X |Soll | 333 |
|2020-11-20 | 2020-11-20 | J3X |Ist | 444 |
|2020-11-21 | 2020-11-21 | J3X |Position| SOP |
|2020-11-21 | 2020-11-21 | J3X |Soll | 100 |
|2020-11-21 | 2020-11-21 | J3X |Ist | 200 |
-------------------------------------------------------
I want to use pivot function but I am not sure if it's correct.
import pyspark.sql.functions as f
result_df = raw_df.groupBy('timestamp_local', 'timestamp_utc', 'device').pivot('key').agg(f.first('value'))
Desired output:
---------------------------------------------------------------
| timestamp_local|timestamp_utc|device|Position | Soll | Ist
---------------------------------------------------------------
|2020-11-20 | 2020-11-20 | J3X | SEP | 333 | 44
|2020-12-20 | 2020-12-20 | J3X | SOP | 100 | 200
---------------------------------------------------------------
Any suggestions how to do it?

SparkSql : Efficient way to do a left-outer join keeping the boundary of the right dataset

I need to join two timeseries datasets (left & right)
I must consider all records from left data-set even if there is no match in the right dataset (I can use a left-outer join for this).
But at the sametime I must keep the starting and ending boundaries of the right dataset.
left dataset :
+-----------+-------+
| Timestamp | L_val |
+-----------+-------+
| … | … |
+-----------+-------+
| … | … |
+-----------+-------+
| 10001 | 346 |
+-----------+-------+
| 10002 | 987 |
+-----------+-------+
| 10003 | 788 |
+-----------+-------+
| 10004 | 567 |
+-----------+-------+
| 10005 | 665 |
+-----------+-------+
| 10006 | 654 |
+-----------+-------+
| 10007 | 345 |
+-----------+-------+
| 10008 | 565 |
+-----------+-------+
| 10009 | 567 |
+-----------+-------+
| …. | …. |
+-----------+-------+
| … | … |
+-----------+-------+
| | |
+-----------+-------+
right dataset:
+-----------+-------+
| Timestamp | R_val |
+-----------+-------+
| 10004 | 345 |
+-----------+-------+
| 10005 | 654 |
+-----------+-------+
| 10007 | 65 |
+-----------+-------+
| 10008 | 234 |
+-----------+-------+
required-joined-dataset:
+-----------+-------+-------+
| Timestamp | L_val | R_val |
+-----------+-------+-------+
| 10004 | 567 | 345 |
+-----------+-------+-------+
| 10005 | 665 | 654 |
+-----------+-------+-------+
| 10006 | 654 | |
+-----------+-------+-------+
| 10007 | 345 | 65 |
+-----------+-------+-------+
| 10008 | 565 | 234 |
+-----------+-------+-------+
scala> df_L.show(false)
+---------+-----+
|Timestamp|L_val|
+---------+-----+
|10001 |346 |
|10002 |987 |
|10003 |788 |
|10004 |567 |
|10005 |665 |
|10006 |654 |
|10007 |345 |
|10008 |565 |
|10009 |567 |
+---------+-----+
scala> df_R.show(false)
+---------+-----+
|Timestamp|R_val|
+---------+-----+
|10004 |345 |
|10005 |654 |
|10007 |65 |
|10008 |234 |
+---------+-----+
scala> val minTime = df_R.select(min("Timestamp")).rdd.collect.map(r => r(0)).mkString.toLong
minTime: Long = 10004
scala> val maxTime = df_R.select(max("Timestamp")).rdd.collect.map(r => r(0)).mkString.toLong
maxTime: Long = 10008
scala> df_L.alias("L").join(df_R.alias("R"), List("Timestamp"), "left").filter(col("L.Timestamp") >= minTime && col("L.Timestamp") <= maxTime ).na.fill("").show(false)
+---------+-----+-----+
|Timestamp|L_val|R_val|
+---------+-----+-----+
|10004 |567 |345 |
|10005 |665 |654 |
|10006 |654 | |
|10007 |345 |65 |
|10008 |565 |234 |
+---------+-----+-----+
OR
//For more efficient filter Left dataframe first and join with result
scala> df_L.alias("L").filter(col("L.Timestamp") >= minTime && col("L.Timestamp") <= maxTime ).join(df_R.alias("R"), List("Timestamp"), "left").show(false)
+---------+-----+-----+
|Timestamp|L_val|R_val|
+---------+-----+-----+
|10004 |567 |345 |
|10005 |665 |654 |
|10006 |654 |null |
|10007 |345 |65 |
|10008 |565 |234 |
+---------+-----+-----+

Count specific value for IDs in two dataframes

I have two dataframes
df1
+----+-------+
| | Key |
|----+-------|
| 0 | 30 |
| 1 | 31 |
| 2 | 32 |
| 3 | 33 |
| 4 | 34 |
| 5 | 35 |
+----+-------+
df2
+----+-------+--------+
| | Key | Test |
|----+-------+--------|
| 0 | 30 | Test4 |
| 1 | 30 | Test5 |
| 2 | 30 | Test6 |
| 3 | 31 | Test4 |
| 4 | 31 | Test5 |
| 5 | 31 | Test6 |
| 6 | 32 | Test3 |
| 7 | 33 | Test3 |
| 8 | 33 | Test3 |
| 9 | 34 | Test1 |
| 10 | 34 | Test1 |
| 11 | 34 | Test2 |
| 12 | 34 | Test3 |
| 13 | 34 | Test3 |
| 14 | 34 | Test3 |
| 15 | 35 | Test3 |
| 16 | 35 | Test3 |
| 17 | 35 | Test3 |
| 18 | 35 | Test3 |
| 19 | 35 | Test3 |
+----+-------+--------+
I want to count how many times each Test is listed for each Key.
+----+-------+-------+-------+-------+-------+-------+-------+
| | Key | Test1 | Test2 | Test3 | Test4 | Test5 | Test6 |
|----+-------|-------|-------|-------|-------|-------|-------|
| 0 | 30 | | | | 1 | 1 | 1 |
| 1 | 31 | | | | 1 | 1 | 1 |
| 2 | 32 | | | 1 | | | |
| 3 | 33 | | | 2 | | | |
| 4 | 34 | 2 | 1 | 3 | | | |
| 5 | 35 | | | 5 | | | |
+----+-------+-------+-------+-------+-------+-------+-------+
What I've tried
Using join and groupby, I first got the count for each Key, regardless of Test.
result_df = df1.join(df2.groupby('Key').size().rename('Count'), on='Key')
+----+-------+---------+
| | Key | Count |
|----+-------+---------|
| 0 | 30 | 3 |
| 1 | 31 | 3 |
| 2 | 32 | 1 |
| 3 | 33 | 2 |
| 4 | 34 | 6 |
| 5 | 35 | 5 |
+----+-------+---------+
I tried to group the Key with the Test
result_df = df1.join(df2.groupby(['Key', 'Test']).size().rename('Count'), on='Key')
but this returns an error
ValueError: len(left_on) must equal the number of levels in the index of "right"
Check with crosstab
pd.crosstab(df2.Key,df2.Test).reindex(df1.Key).replace({0:''})
Here another solution with groupby & pivot. Using this solution you don't need df1 at all.
# | create some dummy data
tests = ['Test' + str(i) for i in range(1,7)]
df = pd.DataFrame({'Test': np.random.choice(tests, size=100), 'Key': np.random.randint(30, 35, size=100)})
df['Count Variable'] = 1
# | group & count aggregation
df = df.groupby(['Key', 'Test']).count()
df = df.pivot(index="Key", columns="Test", values="Count Variable").reset_index()

Resources