How to pivot pyspark dataframe rows to columns - apache-spark

I have a simple pyspark dataframe like this:
-------------------------------------------------------
|timestamp_local |timestamp_utc|device|key |value|
-------------------------------------------------------
|2020-11-20 | 2020-11-20 | J3X |Position| SEP |
|2020-11-20 | 2020-11-20 | J3X |Soll | 333 |
|2020-11-20 | 2020-11-20 | J3X |Ist | 444 |
|2020-11-21 | 2020-11-21 | J3X |Position| SOP |
|2020-11-21 | 2020-11-21 | J3X |Soll | 100 |
|2020-11-21 | 2020-11-21 | J3X |Ist | 200 |
-------------------------------------------------------
I want to use pivot function but I am not sure if it's correct.
import pyspark.sql.functions as f
result_df = raw_df.groupBy('timestamp_local', 'timestamp_utc', 'device').pivot('key').agg(f.first('value'))
Desired output:
---------------------------------------------------------------
| timestamp_local|timestamp_utc|device|Position | Soll | Ist
---------------------------------------------------------------
|2020-11-20 | 2020-11-20 | J3X | SEP | 333 | 44
|2020-12-20 | 2020-12-20 | J3X | SOP | 100 | 200
---------------------------------------------------------------
Any suggestions how to do it?

Related

Create a dataframe base on X days backward observation

Consedering that I have the following DF:
|-----------------|
|Date | Cod |
|-----------------|
|2022-08-01 | A |
|2022-08-02 | A |
|2022-08-03 | A |
|2022-08-04 | A |
|2022-08-05 | A |
|2022-08-01 | B |
|2022-08-02 | B |
|2022-08-03 | B |
|2022-08-04 | B |
|2022-08-05 | B |
|-----------------|
And considering that I have a backward observation of 2 days, how can I generate the following output DF
|------------------------------|
|RefDate | Date | Cod
|------------------------------|
|2022-08-03 | 2022-08-01 | A |
|2022-08-03 | 2022-08-02 | A |
|2022-08-03 | 2022-08-03 | A |
|2022-08-04 | 2022-08-02 | A |
|2022-08-04 | 2022-08-03 | A |
|2022-08-04 | 2022-08-04 | A |
|2022-08-05 | 2022-08-03 | A |
|2022-08-05 | 2022-08-04 | A |
|2022-08-05 | 2022-08-05 | A |
|2022-08-03 | 2022-08-01 | B |
|2022-08-03 | 2022-08-02 | B |
|2022-08-03 | 2022-08-03 | B |
|2022-08-04 | 2022-08-02 | B |
|2022-08-04 | 2022-08-03 | B |
|2022-08-04 | 2022-08-04 | B |
|2022-08-05 | 2022-08-03 | B |
|2022-08-05 | 2022-08-04 | B |
|2022-08-05 | 2022-08-05 | B |
|------------------------------|
I know that I can use loops to generate this output DF, but loops doesn't have a good performance since I can't cache the DF on memory (My original DF has approx 6 billion lines). So, what is the best way to get this output?
MVCE:
data_1=[
("2022-08-01","A"),
("2022-08-02","A"),
("2022-08-03","A"),
("2022-08-04","A"),
("2022-08-05","A"),
("2022-08-01","B"),
("2022-08-02","B"),
("2022-08-03","B"),
("2022-08-04","B"),
("2022-08-05","B")
]
schema_1 = StructType([
StructField("Date", StringType(),True),
StructField("Cod", StringType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
You could try a self join. My thoughts - If your cluster and session are configured optimally, it should work with 6B rows.
data_sdf.alias('a'). \
join(data_sdf.alias('b'),
[func.col('a.cod') == func.col('b.cod'),
func.datediff(func.col('a.date'), func.col('b.date')).between(0, 2)],
'inner'
). \
drop(func.col('a.cod')). \
selectExpr('cod', 'a.date as ref_date', 'b.date as date'). \
show()
# +---+----------+----------+
# |cod| ref_date| date|
# +---+----------+----------+
# | B|2022-08-01|2022-08-01|
# | B|2022-08-02|2022-08-01|
# | B|2022-08-02|2022-08-02|
# | B|2022-08-03|2022-08-01|
# | B|2022-08-03|2022-08-02|
# | B|2022-08-03|2022-08-03|
# | B|2022-08-04|2022-08-02|
# | B|2022-08-04|2022-08-03|
# | B|2022-08-04|2022-08-04|
# | B|2022-08-05|2022-08-03|
# | B|2022-08-05|2022-08-04|
# | B|2022-08-05|2022-08-05|
# | A|2022-08-01|2022-08-01|
# | A|2022-08-02|2022-08-01|
# | A|2022-08-02|2022-08-02|
# | A|2022-08-03|2022-08-01|
# | A|2022-08-03|2022-08-02|
# | A|2022-08-03|2022-08-03|
# | A|2022-08-04|2022-08-02|
# | A|2022-08-04|2022-08-03|
# +---+----------+----------+
# only showing top 20 rows
This will generate records for the initial 2 dates as well which can be discarded.

Replace accounting notation for negative number with minus value

I have a dataframe which contains negative numbers, with accountancy notation i.e.:
df.select('sales').distinct().show()
+------------+
| sales |
+------------+
| 18 |
| 3 |
| 10 |
| (5)|
| 4 |
| 40 |
| 0 |
| 8 |
| 16 |
| (2)|
| 2 |
| (1)|
| 14 |
| (3)|
| 9 |
| 19 |
| (6)|
| 1 |
| (9)|
| (4)|
+------------+
only showing top 20 rows
The numbers wrapped in () are negative. How can I replace them to have minus values instead i.e. (5) becomes -5 and so on.
Here is what I have tried:
sales = (
df
.select('sales')
.withColumn('sales_new',
sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), sf.col('sales').substr(2,3)))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|salees |sales_new|
+---------+---------+
| 151 | 151 |
| 134 | 134 |
| 151 | 151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
| 151 | 151 |
| 50 | 50 |
| 101 | 101 |
| 134 | 134 |
|(134) |-134 |
| 46 | 46 |
| 151 | 151 |
| 134 | 134 |
| 185 | 185 |
| 84 | 84 |
| 188 | 188 |
|(94) |-94) |
| 38 | 38 |
| 21 | 21 |
+---------+---------+
The issue is that the length of sales can vary so hardcoding a value into the substring() won't work in some cases.
I have tried using regexp_replace but get an error that:
PatternSyntaxException: Unclosed group near index 1
sales = (
df
.select('sales')
.withColumn('sales_new', regexp_replace(sf.col('sales'), '(', ''))
)
This can be solved with a case statement and regular expression together:
from pyspark.sql.functions import regexp_replace, col
sales = (
df
.select('sales')
.withColumn('sales_new', sf.when(sf.col('sales').substr(1,1) == '(',
sf.concat(sf.lit('-'), regexp_replace(sf.col('sales'), '\(|\)', '')))
.otherwise(sf.col('sales')))
)
sales.show(20,False)
+---------+---------+
|sales |sales_new|
+---------+---------+
|151 |151 |
|134 |134 |
|151 |151 |
|(151) |-151 |
|(134) |-134 |
|(151) |-151 |
|151 |151 |
|50 |50 |
|101 |101 |
|134 |134 |
|(134) |-134 |
|46 |46 |
|151 |151 |
|134 |134 |
|185 |185 |
|84 |84 |
|188 |188 |
|(94) |-94 |
|38 |38 |
|21 |21 |
+---------+---------+
You can slice the string from the second character to the second last character, and then convert it to float, for example:
def convert(number):
try:
number = float(number)
except:
number = number[1:-1]
number = float(number)
return number
You can iterate through all the elements and apply this function.

SparkSql : Efficient way to do a left-outer join keeping the boundary of the right dataset

I need to join two timeseries datasets (left & right)
I must consider all records from left data-set even if there is no match in the right dataset (I can use a left-outer join for this).
But at the sametime I must keep the starting and ending boundaries of the right dataset.
left dataset :
+-----------+-------+
| Timestamp | L_val |
+-----------+-------+
| … | … |
+-----------+-------+
| … | … |
+-----------+-------+
| 10001 | 346 |
+-----------+-------+
| 10002 | 987 |
+-----------+-------+
| 10003 | 788 |
+-----------+-------+
| 10004 | 567 |
+-----------+-------+
| 10005 | 665 |
+-----------+-------+
| 10006 | 654 |
+-----------+-------+
| 10007 | 345 |
+-----------+-------+
| 10008 | 565 |
+-----------+-------+
| 10009 | 567 |
+-----------+-------+
| …. | …. |
+-----------+-------+
| … | … |
+-----------+-------+
| | |
+-----------+-------+
right dataset:
+-----------+-------+
| Timestamp | R_val |
+-----------+-------+
| 10004 | 345 |
+-----------+-------+
| 10005 | 654 |
+-----------+-------+
| 10007 | 65 |
+-----------+-------+
| 10008 | 234 |
+-----------+-------+
required-joined-dataset:
+-----------+-------+-------+
| Timestamp | L_val | R_val |
+-----------+-------+-------+
| 10004 | 567 | 345 |
+-----------+-------+-------+
| 10005 | 665 | 654 |
+-----------+-------+-------+
| 10006 | 654 | |
+-----------+-------+-------+
| 10007 | 345 | 65 |
+-----------+-------+-------+
| 10008 | 565 | 234 |
+-----------+-------+-------+
scala> df_L.show(false)
+---------+-----+
|Timestamp|L_val|
+---------+-----+
|10001 |346 |
|10002 |987 |
|10003 |788 |
|10004 |567 |
|10005 |665 |
|10006 |654 |
|10007 |345 |
|10008 |565 |
|10009 |567 |
+---------+-----+
scala> df_R.show(false)
+---------+-----+
|Timestamp|R_val|
+---------+-----+
|10004 |345 |
|10005 |654 |
|10007 |65 |
|10008 |234 |
+---------+-----+
scala> val minTime = df_R.select(min("Timestamp")).rdd.collect.map(r => r(0)).mkString.toLong
minTime: Long = 10004
scala> val maxTime = df_R.select(max("Timestamp")).rdd.collect.map(r => r(0)).mkString.toLong
maxTime: Long = 10008
scala> df_L.alias("L").join(df_R.alias("R"), List("Timestamp"), "left").filter(col("L.Timestamp") >= minTime && col("L.Timestamp") <= maxTime ).na.fill("").show(false)
+---------+-----+-----+
|Timestamp|L_val|R_val|
+---------+-----+-----+
|10004 |567 |345 |
|10005 |665 |654 |
|10006 |654 | |
|10007 |345 |65 |
|10008 |565 |234 |
+---------+-----+-----+
OR
//For more efficient filter Left dataframe first and join with result
scala> df_L.alias("L").filter(col("L.Timestamp") >= minTime && col("L.Timestamp") <= maxTime ).join(df_R.alias("R"), List("Timestamp"), "left").show(false)
+---------+-----+-----+
|Timestamp|L_val|R_val|
+---------+-----+-----+
|10004 |567 |345 |
|10005 |665 |654 |
|10006 |654 |null |
|10007 |345 |65 |
|10008 |565 |234 |
+---------+-----+-----+

How to get uniques IDs and totals with Pivot Table Excel

I have this table in columns A to F
A | B | C | D | E | F |
---------------------------------------------------------------------------------
File1 | | | | |
Record | ID | ABKs | MNKs | Date | Seg
NewRecord | 972567676 | 34305 | 72358 | 3/4/2019 22:13 | 21
NewRecord | 685206308 | 8198 | 27174 | 3/4/2019 22:16 | 61
NewRecord | 300264531 | 393064 | 10093118 | 3/4/2019 22:18 | 238
NewRecord | 300264531 | 431153 | 10055124 | 3/4/2019 22:22 | 232
NewRecord | 300264531 | 394506 | 10091831 | 3/4/2019 22:25 | 239
File2 | | | | |
Record | ID | ABKs | MNKs | Date | Seg
NewRecord | 300264531 | 494676 | 9992073 | 3/4/2019 22:29 | 307
NewRecord | 300264531 | 480117 | 10005787 | 3/4/2019 22:35 | 326
NewRecord | 300264531 | 500751 | 9986284 | 3/4/2019 22:53 | 74
NewRecord | 300264531 | 174754 | 10312153 | 3/4/2019 22:54 | 19
File3 | | | | |
Record | ID | ABKs | MNKs | Date | Seg
NewRecord | 725372898 | 734660 | 9751476 | 3/4/2019 23:04 | 79
NewRecord | 725372898 | 1307025 | 9178944 | 3/4/2019 23:05 | 256
NewRecord | 725372898 | 530935 | 9955441 | 3/4/2019 23:18 | 41
NewRecord | 725372898 | 564462 | 9921311 | 3/4/2019 23:19 | 713
File4 | | | | |
Record | ID | ABKs | MNKs | Date | Seg
NewRecord | 941774677 | 441381 | 10044548 | 3/4/2019 23:46 | 1196
NewRecord | 941774677 | 336354 | 7138685 | 3/5/2019 0:06 | 839
File5 | | | | |
Record | ID | ABKs | MNKs | Date | Seg
NewRecord | 1303422672 | 1947830 | 8538344 | 3/5/2019 0:30 | 126
NewRecord | 1303422672 | 939494 | 2130082 | 3/5/2019 0:33 | 107
I want to create a pivot table that shows me the data as mentionen in Expected Pivot Table format
but I'm currently getting the output as shown in Current Pivot Table output
I've set the pivot table as shown below.
Is possible to get the table desired with pivot table? and what I need to do for such result?
Attached sample excel file
Just drag PivotField/Column {Date} from "values" to "Rows" in Pivot
It will affect your column sequence.

SparkSQL : How to sum two time-series data-sets with different timestamps

I have two data-sets of time-series data.I need to sum up these two data-sets using probably some sort of windowing approach
Timestamps are different for the two datasets
The result would be the sum of "value" fields from both datasets, which falls within the window of the result dataset.
Is there any builtin functions in Spark to do this easily?or else how can I achieve this in the best possible way
DataSet-1
raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
23 | 1528766100068 | 131
23 | 1528765200058 | 130.60001
23 | 1528764300049 | 130.3
23 | 1528763400063 | 130
23 | 1528762500059 | 129.60001
23 | 1528761600050 | 129.3
23 | 1528760700051 | 128.89999
23 | 1528759800047 | 128.60001
DataSet-2
raw_data_field_id | date_time_epoch | value
-------------------+-----------------+-----------
24 | 1528766100000 | 41
24 | 1528765200000 | 60
24 | 1528764300000 | 30.03
24 | 1528763400000 | 43
24 | 1528762500000 | 34.01
24 | 1528761600000 | 29.36
24 | 1528760700000 | 48.99
24 | 1528759800000 | 28.01
Her is an example
scala> d1.show
+-----------------+--------------------+---------+
|raw_data_field_id| date_time_epoch| value|
+-----------------+--------------------+---------+
| 23|2018-06-12 01:15:...| 131.0|
| 23|2018-06-12 01:00:...|130.60001|
| 23|2018-06-12 00:45:...| 130.3|
| 23|2018-06-12 00:30:...| 130.0|
| 23|2018-06-12 00:15:...|129.60001|
| 23|2018-06-12 00:00:...| 129.3|
| 23|2018-06-11 23:45:...|128.89999|
| 23|2018-06-11 23:30:...|128.60001|
+-----------------+--------------------+---------+
scala> d2.show
+-----------------+--------------------+-----+
|raw_data_field_id| date_time_epoch|value|
+-----------------+--------------------+-----+
| 24|2018-06-12 01:15:...| 41.0|
| 24|2018-06-12 01:00:...| 60.0|
| 24|2018-06-12 00:45:...|30.03|
| 24|2018-06-12 00:30:...| 43.0|
| 24|2018-06-12 00:15:...|34.01|
| 24|2018-06-12 00:00:...|29.36|
| 24|2018-06-11 23:45:...|48.99|
| 24|2018-06-11 23:30:...|28.01|
+-----------------+--------------------+-----+
scala> d1.unionAll(d2).show
+-----------------+--------------------+---------+
|raw_data_field_id| date_time_epoch| value|
+-----------------+--------------------+---------+
| 23|2018-06-12 01:15:...| 131.0|
| 23|2018-06-12 01:00:...|130.60001|
| 23|2018-06-12 00:45:...| 130.3|
| 23|2018-06-12 00:30:...| 130.0|
| 23|2018-06-12 00:15:...|129.60001|
| 23|2018-06-12 00:00:...| 129.3|
| 23|2018-06-11 23:45:...|128.89999|
| 23|2018-06-11 23:30:...|128.60001|
| 24|2018-06-12 01:15:...| 41.0|
| 24|2018-06-12 01:00:...| 60.0|
| 24|2018-06-12 00:45:...| 30.03|
| 24|2018-06-12 00:30:...| 43.0|
| 24|2018-06-12 00:15:...| 34.01|
| 24|2018-06-12 00:00:...| 29.36|
| 24|2018-06-11 23:45:...| 48.99|
| 24|2018-06-11 23:30:...| 28.01|
+-----------------+--------------------+---------+
import org.apache.spark.sql.functions.window
val df = d1.union(d2)
val avg_df = df.groupBy(window($"date_time_epoch", "15 minutes")).agg(avg($"value"))
avg_df.show
+--------------------+-----------------+
| window| avg(value)|
+--------------------+-----------------+
|[2018-06-11 23:45...| 88.944995|
|[2018-06-12 00:30...| 86.5|
|[2018-06-12 01:15...| 86.0|
|[2018-06-11 23:30...| 78.305005|
|[2018-06-12 00:00...|79.33000000000001|
|[2018-06-12 00:45...| 80.165|
|[2018-06-12 00:15...| 81.805005|
|[2018-06-12 01:00...| 95.300005|
+--------------------+-----------------+
avg_df.sort("window.start").select("window.start","window.end","avg(value)").show(truncate = false)
+-------------------+-------------------+-----------------+
|start |end |avg(value) |
+-------------------+-------------------+-----------------+
|2018-06-11 23:30:00|2018-06-11 23:45:00|78.305005 |
|2018-06-11 23:45:00|2018-06-12 00:00:00|88.944995 |
|2018-06-12 00:00:00|2018-06-12 00:15:00|79.33000000000001|
|2018-06-12 00:15:00|2018-06-12 00:30:00|81.805005 |
|2018-06-12 00:30:00|2018-06-12 00:45:00|86.5 |
|2018-06-12 00:45:00|2018-06-12 01:00:00|80.165 |
|2018-06-12 01:00:00|2018-06-12 01:15:00|95.300005 |
|2018-06-12 01:15:00|2018-06-12 01:30:00|86.0 |
+-------------------+-------------------+-----------------+

Resources