I have a pyspark dataframe df :-
STORE
COL_APPLE_BB
COL_APPLE_NONBB
COL_PEAR_BB
COL_PEAR_NONBB
COL_ORANGE_BB
COL_ORANGE_NONBB
COL_GRAPE_BB
COL_GRAPE_NONBB
1
28
24
24
32
26
54
60
36
2
19
12
24
13
10
24
29
10
I have another pyspark df df2 :-
STORE
PDT
FRUIT
TYPE
1
1
APPLE
BB
1
2
ORANGE
NONBB
1
3
PEAR
BB
1
4
GRAPE
BB
1
5
APPLE
BB
1
6
ORANGE
BB
2
1
PEAR
NONBB
2
2
ORANGE
NONBB
2
3
APPLE
NONBB
Expected pyspark df2 with a column COL_VALUE for repective store,fruit,type:-
STORE
PDT
FRUIT
TYPE
COL_VALUE
1
1
APPLE
BB
28
1
2
ORANGE
NONBB
54
1
3
PEAR
BB
24
1
4
GRAPE
BB
60
1
5
APPLE
BB
28
1
6
ORANGE
BB
26
2
1
PEAR
NONBB
13
2
2
ORANGE
NONBB
24
2
3
APPLE
NONBB
12
from pyspark.sql.functions import *
df = spark.createDataFrame(
[
(1, 28, 24, 24, 32, 26, 54, 60, 36),
(2, 19, 12, 24, 13, 10, 24, 29, 10)
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB", "COL_PEAR_BB", "COL_PEAR_NONBB", "COL_ORANGE_BB", "COL_ORANGE_NONBB", "COL_GRAPE_BB","COL_GRAPE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 3, "PEAR", "BB"),
(1, 4, "GRAPE", "BB"),
(1, 5, "APPLE", "BB"),
(1, 6, "ORANGE", "BB"),
(2, 1, "PEAR", "NONBB"),
(2, 2, "ORANGE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
unPivot_df = df.select("STORE",expr("stack(8, 'APPLE_BB',COL_APPLE_BB,\
'APPLE_NONBB',COL_APPLE_NONBB,\
'PEAR_BB', COL_PEAR_BB,\
'PEAR_NONBB', COL_PEAR_NONBB,\
'ORANGE_BB',COL_ORANGE_BB, \
'ORANGE_NONBB',COL_ORANGE_NONBB,\
'GRAPE_BB',COL_GRAPE_BB,\
'GRAPE_NONBB',COL_GRAPE_NONBB) as (Appended,COL_VALUE)"))
df2 = df2.withColumn("Appended",concat_ws('_',col("FRUIT"),col("TYPE")))
df2 = df2.join(unPivot_df,['STORE',"Appended"],"left")
df2.show()
+-----+------------+---+------+-----+---------+
|STORE| Appended|PDT| FRUIT| TYPE|COL_VALUE|
+-----+------------+---+------+-----+---------+
| 1|ORANGE_NONBB| 2|ORANGE|NONBB| 54|
| 1| PEAR_BB| 3| PEAR| BB| 24|
| 1| GRAPE_BB| 4| GRAPE| BB| 60|
| 1| APPLE_BB| 1| APPLE| BB| 28|
| 2|ORANGE_NONBB| 2|ORANGE|NONBB| 24|
| 2| APPLE_NONBB| 3| APPLE|NONBB| 12|
| 1| ORANGE_BB| 6|ORANGE| BB| 26|
| 1| APPLE_BB| 5| APPLE| BB| 28|
| 2| PEAR_NONBB| 1| PEAR|NONBB| 13|
+-----+------------+---+------+-----+---------+
If you have Spark 3.2 or higher you could use something like:
data = data.melt(
id_vars=['STORE'],
value_vars=data.columns[1:],
var_name="variable",
value_name="value"
)
to get a "long" form of the dataset, and then use regex_extract twice to get the required information from the variable column.
For earlier versions of Spark, use the following:
def process_row(row):
output = []
for index, key in enumerate(row.asDict()):
if key == "STORE":
store = row[key]
else:
_, fruit, type = key.split("_")
output.append((store, index, fruit, type, row[key]))
return output
data = data.rdd.flatMap(process_row).toDF(
schema=["STORE", "PDT", "FRUIT", "TYPE", "COLUMN_VALUE"]
)
Alternatively to melt, you can use stack in earlier Spark versions:
df = spark.createDataFrame(
[
(1, 28, 24),
(2, 19, 12),
],
["STORE", "COL_APPLE_BB", "COL_APPLE_NONBB"]
)
df2 = spark.createDataFrame(
[
(1, 1, "APPLE", "BB"),
(1, 2, "ORANGE", "NONBB"),
(1, 2, "APPLE", "NONBB"),
(2, 3, "APPLE", "NONBB")
],
["STORE", "PDT", "FRUIT", "TYPE"]
)
Create a column that matches the "COL_FRUIT_TYPE" in df:
df3 = df2.withColumn("fruit_type", F.concat(F.lit("COL_"), F.col("FRUIT"), F.lit("_"), F.col("TYPE")))
df3.show(10, False)
gives:
+-----+---+------+-----+----------------+
|STORE|PDT|FRUIT |TYPE |fruit_type |
+-----+---+------+-----+----------------+
|1 |1 |APPLE |BB |COL_APPLE_BB |
|1 |2 |ORANGE|NONBB|COL_ORANGE_NONBB|
|1 |2 |APPLE |NONBB|COL_APPLE_NONBB |
+-----+---+------+-----+----------------+
Then "unpivot" the first df:
from pyspark.sql.functions import expr
unpivotExpr = "stack({}, {}) as (fruit_type, COL_VALUE)".format(len(df.columns) - 1, ','.join( [("'{}', {}".format(c, str(c))) for c in df.columns[1:]] ) )
print(unpivotExpr)
unPivotDF = df.select("STORE", expr(unpivotExpr)) \
.where("STORE is not null")
unPivotDF.show(truncate=False)
The stack function takes as arguments: the number of "columns" that it will be "unpivoting" (here, it derives that it will be len(df.columns) - 1, as we will be skipping the STORE column); then, in case of just column, value pairs, it takes a list of these in the form col_name, value. Here, the
[("'{}', {}".format(c, str(c))) for c in df.columns[1:]] part takes columns from df, skipping the first one (STORE), then returns a pair for each of the remaining columns, such as 'COL_APPLE_BB', COL_APPLE_BB. In the end I join these into a comma-separated string (",".join()) and replace the placeholder {} with this string.
Example how stack function is usually called:
"stack(2, 'COL_APPLE_BB', COL_APPLE_BB, 'COL_APPLE_NONBB', COL_APPLE_NONBB) as (fruit_type, COL_VALUE)"
The unPivotDF.show(truncate=False) outputs:
+-----+---------------+---------+
|STORE|fruit_type |COL_VALUE|
+-----+---------------+---------+
|1 |COL_APPLE_BB |28 |
|1 |COL_APPLE_NONBB|24 |
|2 |COL_APPLE_BB |19 |
|2 |COL_APPLE_NONBB|12 |
+-----+---------------+---------+
and join the two:
df3.join(unPivotDF, ["fruit_type", "STORE"], "left")\
.select("STORE", "PDT", "FRUIT", "TYPE", "COL_VALUE").show(40, False)
result:
+-----+---+------+-----+---------+
|STORE|PDT|FRUIT |TYPE |COL_VALUE|
+-----+---+------+-----+---------+
|1 |2 |ORANGE|NONBB|null |
|1 |2 |APPLE |NONBB|24 |
|1 |1 |APPLE |BB |28 |
|2 |3 |APPLE |NONBB|12 |
+-----+---+------+-----+---------+
The drawback is that you need to enumerate the column names in stack, if I figure out a way to do this automatically, I will update the answer.
EDIT: I have updated the use of the stack function, so that it can derive the columns by itself.
Related
I have a dataframe looking like this:
| id | device| x | y | z | timestamp |
1 device_1 22 8 23 2020-10-30T16:00:00.000+0000
1 device_1 21 88 65 2020-10-30T16:01:00.000+0000
1 device_1 33 34 64 2020-10-30T16:02:00.000+0000
2 device_2 12 6 97 2019-11-30T13:00:00.000+0000
2 device_2 44 77 13 2019-11-30T13:00:00.000+0000
1 device_1 22 11 30 2022-10-30T08:00:00.000+0000
1 device_1 22 11 30 2022-10-30T08:01:00.000+0000
The data represents events for an "id" on a certain point in time. I would like to see development of values over a period of time to plot a time series for instance.
I'm thinking of adding a column 'duration' which is 0 for the first entry and then the difference in the next entry related to the same id on the same day (there might be multiple different event streams for the same id on separate days).
I would ideally want a dataframe looking something like this:
| id | device | x | y | z | timestamp | duration |
1 device_1 22 8 23 2020-10-30T16:00:00.000+0000 00:00.00.000
1 device_1 21 88 65 2020-10-30T16:01:00.000+0000 00:01:00.000
1 device_1 33 34 64 2020-10-30T16:02:00.000+0000 00:02:00.000
2 device_2 12 6 97 2019-11-30T13:00:00.000+0000 00:00:00.000
2 device_2 44 77 13 2019-11-30T13:00:30.000+0000 00:00:30.000
1 device_1 22 11 30 2022-10-30T08:00:00.000+0000 00:00:00.000
1 device_1 22 11 30 2022-10-30T08:01:00.000+0000 00:01:00.000
I have no idea where to begin in order to achieve this so either a good explanation or a code example would be very helpful!
Any other suggestions on how to be able to plot development over time (in general not related to a specific date or time of the day) based on this dataframe are also very welcome.
Note: It has to be in PySpark (not pandas) since the dataset is extremely large.
You will need to use window functions (specific functions working inside partitions created using over clause). The below code does the same thing as in the other answer, but I wanted to show a more streamlined version, fully in PySpark, as opposed to PySpark + SQL with subqueries.
Initially, the column "difference" will be of type interval, so then it's up to you to try to transform it to whatever data type you need. I have just extracted the interval using regexp_extract which stores it as string.
Input (I assume your "timestamp" column is of type timestamp):
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'device_1', 22, 8, 23, '2020-10-30T16:00:00.000+0000'),
(1, 'device_1', 21, 88, 65, '2020-10-30T16:01:00.000+0000'),
(1, 'device_1', 33, 34, 64, '2020-10-30T16:02:00.000+0000'),
(2, 'device_2', 12, 6, 97, '2019-11-30T13:00:00.000+0000') ,
(2, 'device_2', 44, 77, 13, '2019-11-30T13:00:30.000+0000'),
(1, 'device_1', 22, 11, 30, '2022-10-30T08:00:00.000+0000'),
(1, 'device_1', 22, 11, 30, '2022-10-30T08:01:00.000+0000')],
["id", "device", "x", "y", "z", "timestamp"]
).withColumn("timestamp", F.to_timestamp("timestamp"))
Script:
w = W.partitionBy('id', F.to_date('timestamp')).orderBy('timestamp')
df = df.withColumn('duration', F.col('timestamp') - F.min('timestamp').over(w))
df = df.withColumn('duration', F.regexp_extract('duration', r'\d\d:\d\d:\d\d', 0))
df.show(truncate=0)
# +---+--------+---+---+---+-------------------+--------+
# |id |device |x |y |z |timestamp |duration|
# +---+--------+---+---+---+-------------------+--------+
# |1 |device_1|22 |8 |23 |2020-10-30 16:00:00|00:00:00|
# |1 |device_1|21 |88 |65 |2020-10-30 16:01:00|00:01:00|
# |1 |device_1|33 |34 |64 |2020-10-30 16:02:00|00:02:00|
# |1 |device_1|22 |11 |30 |2022-10-30 08:00:00|00:00:00|
# |1 |device_1|22 |11 |30 |2022-10-30 08:01:00|00:01:00|
# |2 |device_2|12 |6 |97 |2019-11-30 13:00:00|00:00:00|
# |2 |device_2|44 |77 |13 |2019-11-30 13:00:30|00:00:30|
# +---+--------+---+---+---+-------------------+--------+
your problem can be resolved using Window functions a below
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()
df = spark.createDataFrame(
[
(1,'device_1',22,8,23,'2020-10-30T16:00:00.000+0000'),
(1,'device_1',21,88,65,'2020-10-30T16:01:00.000+0000'),
(1,'device_1',33,34,64,'2020-10-30T16:02:00.000+0000'),
(2,'device_2',12,6,97,'2019-11-30T13:00:00.000+0000') ,
(2,'device_2',44,77,13,'2019-11-30T13:00:00.000+0000'),
(1,'device_1',22,11,30,'2022-10-30T08:00:00.000+0000'),
(1,'device_1',22,11,30,'2022-10-30T08:01:00.000+0000')
],
("id", "device_name", "x","y","z","timestmp"))
df.show(5, False)
+---+-----------+---+---+---+----------------------------+
|id |device_name|x |y |z |timestmp |
+---+-----------+---+---+---+----------------------------+
|1 |device_1 |22 |8 |23 |2020-10-30T16:00:00.000+0000|
|1 |device_1 |21 |88 |65 |2020-10-30T16:01:00.000+0000|
|1 |device_1 |33 |34 |64 |2020-10-30T16:02:00.000+0000|
|2 |device_2 |12 |6 |97 |2019-11-30T13:00:00.000+0000|
|2 |device_2 |44 |77 |13 |2019-11-30T13:00:00.000+0000|
+---+-----------+---+---+---+----------------------------+
from pyspark.sql.functions import *
df_1 = df.withColumn("timestmp_t", to_timestamp(col("timestmp")))
df_1 = df_1.withColumn("date_t", to_date(substring(col("timestmp"), 1, 10)))
df_1.show(5)
+---+-----------+---+---+---+--------------------+-------------------+----------+
| id|device_name| x| y| z| timestmp| timestmp_t| date_t|
+---+-----------+---+---+---+--------------------+-------------------+----------+
| 1| device_1| 22| 8| 23|2020-10-30T16:00:...|2020-10-30 16:00:00|2020-10-30|
| 1| device_1| 21| 88| 65|2020-10-30T16:01:...|2020-10-30 16:01:00|2020-10-30|
| 1| device_1| 33| 34| 64|2020-10-30T16:02:...|2020-10-30 16:02:00|2020-10-30|
| 2| device_2| 12| 6| 97|2019-11-30T13:00:...|2019-11-30 13:00:00|2019-11-30|
| 2| device_2| 44| 77| 13|2019-11-30T13:00:...|2019-11-30 13:00:00|2019-11-30|
+---+-----------+---+---+---+--------------------+-------------------+----------+
df_1.createOrReplaceTempView("tmp_table")
spark.sql("""
select t.*, (timestmp_t - min) as duration from (
SELECT id, device_name, date_t, timestmp_t, MIN(timestmp_t) OVER (PARTITION BY id, date_t ORDER BY timestmp_t) AS min
FROM tmp_table) as t
""").show(5, False)
+---+-----------+----------+-------------------+-------------------+-----------------------------------+
|id |device_name|date_t |timestmp_t |min |duration |
+---+-----------+----------+-------------------+-------------------+-----------------------------------+
|1 |device_1 |2020-10-30|2020-10-30 16:00:00|2020-10-30 16:00:00|INTERVAL '0 00:00:00' DAY TO SECOND|
|1 |device_1 |2020-10-30|2020-10-30 16:01:00|2020-10-30 16:00:00|INTERVAL '0 00:01:00' DAY TO SECOND|
|1 |device_1 |2020-10-30|2020-10-30 16:02:00|2020-10-30 16:00:00|INTERVAL '0 00:02:00' DAY TO SECOND|
|1 |device_1 |2022-10-30|2022-10-30 08:00:00|2022-10-30 08:00:00|INTERVAL '0 00:00:00' DAY TO SECOND|
|1 |device_1 |2022-10-30|2022-10-30 08:01:00|2022-10-30 08:00:00|INTERVAL '0 00:01:00' DAY TO SECOND|
+---+-----------+----------+-------------------+-------------------+-----------------------------------+
I am getting trouble using agg function and renaming results properly. So far I have made the table of the following format.
sheet
equipment
chamber
time
value1
value2
a
E1
C1
1
11
21
a
E1
C1
2
12
22
a
E1
C1
3
13
23
b
E1
C1
1
14
24
b
E1
C1
2
15
25
b
E1
C1
3
16
26
I would like to create a statistical table like this:
sheet
E1_C1_value1_mean
E1_C1_value1_min
E1_C1_value1_max
E1_C1_value2_mean
E1_C1_value2_min
E1_C1_value2_max
a
12
11
13
22
21
23
b
15
14
16
25
24
26
Which I would like to groupBy "sheet", "equipment", "chamber" to aggregate mean, min, max values.
I also need to rename column by the rule: equip + chamber + aggregation function.
There are multiple "equipment" names and "chamber" names.
As pivot in spark only accept single column, therefore you have to concat the column which you want to pivot:
df = spark.createDataFrame(
[
('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26),
],
schema=['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2']
)
df.printSchema()
df.show(10, False)
+-----+---------+-------+----+------+------+
|sheet|equipment|chamber|time|value1|value2|
+-----+---------+-------+----+------+------+
|a |E1 |C1 |1 |11 |21 |
|a |E1 |C1 |2 |12 |22 |
|a |E1 |C1 |3 |13 |23 |
|b |E1 |C1 |1 |14 |24 |
|b |E1 |C1 |2 |15 |25 |
|b |E1 |C1 |3 |16 |26 |
+-----+---------+-------+----+------+------+
Assume there are lots of columns that you want to do the aggregation, you can use a loop to create and prevent the bulky coding:
aggregation = []
for col in df.columns[-2:]:
aggregation += [func.min(col).alias(f"{col}_min"), func.max(col).alias(f"{col}_max"), func.avg(col).alias(f"{col}_mean")]
df.withColumn('new_col', func.concat_ws('_', func.col('equipment'), func.col('chamber')))\
.groupby('sheet')\
.pivot('new_col')\
.agg(*aggregation)\
.orderBy('sheet')\
.show(100, False)
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|sheet|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value1_mean|E1_C1_value2_min|E1_C1_value2_max|E1_C1_value2_mean|
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
|a |11 |13 |12.0 |21 |23 |22.0 |
|b |14 |16 |15.0 |24 |26 |25.0 |
+-----+----------------+----------------+-----------------+----------------+----------------+-----------------+
First, create a column out of those which you want to pivot.
Then, pivot and aggregate as usual.
Input dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('a', 'E1', 'C1', 1, 11, 21),
('a', 'E1', 'C1', 2, 12, 22),
('a', 'E1', 'C1', 3, 13, 23),
('b', 'E1', 'C1', 1, 14, 24),
('b', 'E1', 'C1', 2, 15, 25),
('b', 'E1', 'C1', 3, 16, 26)],
['sheet', 'equipment', 'chamber', 'time', 'value1', 'value2'])
Script:
df = df.withColumn('_temp', F.concat_ws('_', 'equipment', 'chamber'))
df = (df
.groupBy('sheet')
.pivot('_temp')
.agg(
F.mean('value1').alias('value1_mean'),
F.min('value1').alias('value1_min'),
F.max('value1').alias('value1_max'),
F.mean('value2').alias('value2_mean'),
F.min('value2').alias('value2_min'),
F.max('value2').alias('value2_max'),
)
)
df.show()
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# |sheet|E1_C1_value1_mean|E1_C1_value1_min|E1_C1_value1_max|E1_C1_value2_mean|E1_C1_value2_min|E1_C1_value2_max|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
# | b| 15.0| 14| 16| 25.0| 24| 26|
# | a| 12.0| 11| 13| 22.0| 21| 23|
# +-----+-----------------+----------------+----------------+-----------------+----------------+----------------+
I have a Spark dataframe like this:
+------+---------+---------+---------+---------+
| name | metric1 | metric2 | metric3 | metric4 |
+------+---------+---------+---------+---------+
| a | 1 | 2 | 3 | 4 |
| b | 1 | 2 | 3 | 4 |
| c | 3 | 1 | 5 | 4 |
| a | 3 | 3 | 3 | 3 |
+------+---------+---------+---------+---------+
For any duplicate names that appear, I want to replace the multiple rows with a single row containing nulls, so desired output is:
+------+---------+---------+---------+---------+
| name | metric1 | metric2 | metric3 | metric4 |
+------+---------+---------+---------+---------+
| a | null | null | null | null |
| b | 1 | 2 | 3 | 4 |
| c | 3 | 1 | 5 | 4 |
+------+---------+---------+---------+---------+
The following works:
import org.apache.spark.sql.functions._
val df = Seq(
("a", 1, 2, 3, 4), ("b", 1, 2, 3, 4), ("c", 3, 1, 5, 4), ("a", 3, 3, 3, 3)
).toDF("name", "metric1", "metric2", "metric3", "metric4")
val newDf = df
.groupBy(col("name"))
.agg(
min(col("metric1")).as("metric1"),
min(col("metric2")).as("metric2"),
min(col("metric3")).as("metric3"),
min(col("metric4")).as("metric4"),
count(col("name")).as("NumRecords")
)
.withColumn("metric1", when(col("NumRecords") !== 1, lit(null)).otherwise(col("metric1")))
.withColumn("metric2", when(col("NumRecords") !== 1, lit(null)).otherwise(col("metric2")))
.withColumn("metric3", when(col("NumRecords") !== 1, lit(null)).otherwise(col("metric3")))
.withColumn("metric4", when(col("NumRecords") !== 1, lit(null)).otherwise(col("metric4")))
.drop("NumRecords")
but surely there has got to be a better way...
scala> val df = Seq(("a", 1, 2, 3, 4), ("b", 1, 2, 3, 4), ("c", 3, 1, 5, 4), ("a", 3, 3, 3, 3)).toDF("name", "metric1", "metric2", "metric3", "metric4")
scala> val newDf = df.groupBy(col("name")).agg(min(col("metric1")).as("metric1"),min(col("metric2")).as("metric2"),min(col("metric3")).as("metric3"),min(col("metric4")).as("metric4"),count(col("name")).as("NumRecords"))
scala> val colArr2 = df.columns.diff(Array("name"))
scala> val reqDF = colArr2.foldLeft(newDf){
(df,colName)=>
df.withColumn(colName,when(col("NumRecords") =!= "1",lit(null)).otherwise(col(colName)))
}.drop("NumRecords")
scala> reqDF.show
+----+-------+-------+-------+-------+
|name|metric1|metric2|metric3|metric4|
+----+-------+-------+-------+-------+
| c| 3| 1| 5| 4|
| b| 1| 2| 3| 4|
| a| null| null| null| null|
+----+-------+-------+-------+-------+
Please try like above.
Below is the sales data available to calculate max_price .
Logic for Max_price
Max(last 3 weeks price)
For the first 3 weeks where last weeks data is not available
max price will be
max of(week 1 , week 2 , week 3)
in the below example max (rank 5 , 6 ,7).
how to implement the same using window function in spark?
Here is the solution using PySpark Window, lead/udf.
Please note that i changed the rank 5,6,7 prices to 1,2,3 to differentiate with other values to explain . that this logic is picking what you explained.
max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())
df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])
window = Window.orderBy(col("year").desc(),col("week").desc())
df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))
df.show()
which results
+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
| 1| 5|2019| 1| 20|[18, 21, 20]| 21|
| 2| 4|2019| 2| 18| [21, 20, 1]| 21|
| 3| 3|2019| 3| 21| [20, 1, 2]| 20|
| 4| 2|2019| 4| 20| [1, 2, 3]| 3|
| 5| 1|2019| 5| 1| [2, 3, 1]| 3|
| 6| 52|2018| 6| 2| [3, 1, 2]| 3|
| 7| 51|2018| 7| 3| [1, 2, 3]| 3|
+----------+----+----+----+-----+------------+---------+
Here is the solution in Scala
var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")
val window = Window.orderBy($"year".desc, $"week".desc)
df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))
df.show()
You can use SQL window functions combined with the greatest(). When the SQL window function has less than 3 number of rows, you are considering the current rows and even prior rows. Therefore you need to have the lag1_price, lag2_price calculated in the inner sub-query. In the outer query, you can use the row_count value and use the greatest() function by passing in lag1, lag2 and current price for the respective values against 2,1,0 and get the maximum value.
Check this out:
val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")
df.createOrReplaceTempView("sales")
val df2 = spark.sql("""
select product_id, week, year, price,
count(*) over(order by year desc, week desc rows between 1 following and 3 following ) as count_row,
lag(price) over(order by year desc, week desc ) as lag1_price,
sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
max(price) over(order by year desc, week desc rows between 1 following and 3 following ) as max_price1 from sales
""")
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
select product_id, week, year, price,
case
when count_row=2 then greatest(price,max_price1)
when count_row=1 then greatest(price,lag1_price,max_price1)
when count_row=0 then greatest(price,lag1_price,lag2_price)
else max_price1
end as max_price
from sales_inner
""").show(false)
Results:
+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1 |5 |2019|20 |3 |null |null |21 |
|2 |4 |2019|18 |3 |20 |null |21 |
|3 |3 |2019|21 |3 |18 |20 |20 |
|4 |2 |2019|20 |3 |21 |18 |3 |
|5 |1 |2019|1 |2 |20 |21 |3 |
|6 |52 |2018|2 |1 |1 |20 |3 |
|7 |51 |2018|3 |0 |2 |1 |null |
+----------+----+----+-----+---------+----------+----------+----------+
+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1 |5 |2019|20 |21 |
|2 |4 |2019|18 |21 |
|3 |3 |2019|21 |20 |
|4 |2 |2019|20 |3 |
|5 |1 |2019|1 |3 |
|6 |52 |2018|2 |3 |
|7 |51 |2018|3 |3 |
+----------+----+----+-----+---------+
I have a database with time visit in timestamp like this
ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like
ID, longest_seq (days)
1, 2
2, 5
3, 1
I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.
SELECT ID, MIN (d), MAX(d)
FROM (
SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d,
ROW_NUMBER() OVER(
PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST')
as date)) rn
FROM purchase
where ID is not null
GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date)
)
GROUP BY ID, rn
ORDER BY ID
If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help
Thanks
[EDIT] A more explicit input /output
ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
The result would be :
ID, MaxSeq (in days)
1,3
2,3
3,1
All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day
My answer below is adapted from https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even for use in Spark SQL. You'll have wrap the SQL queries with:
spark.sql("""
SQL_QUERY
""")
So, for the first query:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
This will give you:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
We can see that the consecutive rows have the same grp value. Then we will use GROUP BY and COUNT to get the number of consecutive time.
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
This will return:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
Now we just use MAX and GROUP BY to get the max number of consecutive time.
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Which will give you:
id, max_consecutive
1, 3
2, 3
3, 1
Hope this helps!
That's the case for my beloved window aggregate functions!
I think the following example could help you out (at least to get started).
The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).
In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.
scala> visits.show
+---+----+
| ID|time|
+---+----+
| 1| 1|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 3|
| 1| 3|
| 2| 1|
| 3| 1|
| 3| 2|
| 3| 2|
+---+----+
Let's define the window specification to group id rows together.
import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
partitionBy("id").
orderBy("time")
With that you rank the rows and count rows with the same rank.
val answer = visits.
select($"id", $"time", rank over idsSortedByTime as "rank").
groupBy("id", "time", "rank").
agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
| 1| 1| 1| 2|
| 1| 2| 3| 1|
| 1| 3| 4| 3|
| 3| 1| 1| 1|
| 3| 2| 2| 2|
| 2| 1| 1| 1|
+---+----+----+-----+
That appears (very close?) to a solution. You seem done!
Using spark.sql and with intermediate tables
scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]
scala> df.createOrReplaceTempView("tb1")
scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1 |3 |
|3 |1 |
|2 |3 |
+---+---+
scala>
Solution using DataFrame API:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("ID","time")
df1.show(false)
df1.printSchema()
val w = Window.partitionBy("ID").orderBy("time")
val df2 = df1.withColumn("rank", col("time") - row_number().over(w))
.groupBy("ID", "rank")
.agg(count("rank").alias("count"))
.groupBy("ID")
.agg(max("count").alias("time"))
.orderBy("ID")
df2.show(false)
Console output:
+---+----+
|ID |time|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |5 |
|2 |10 |
|2 |11 |
|3 |1 |
|3 |4 |
|3 |9 |
|3 |11 |
+---+----+
root
|-- ID: integer (nullable = false)
|-- time: integer (nullable = false)
+---+----+
|ID |time|
+---+----+
|1 |3 |
|2 |3 |
|3 |1 |
+---+----+