Below is the sales data available to calculate max_price .
Logic for Max_price
Max(last 3 weeks price)
For the first 3 weeks where last weeks data is not available
max price will be
max of(week 1 , week 2 , week 3)
in the below example max (rank 5 , 6 ,7).
how to implement the same using window function in spark?
Here is the solution using PySpark Window, lead/udf.
Please note that i changed the rank 5,6,7 prices to 1,2,3 to differentiate with other values to explain . that this logic is picking what you explained.
max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())
df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])
window = Window.orderBy(col("year").desc(),col("week").desc())
df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))
df.show()
which results
+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
| 1| 5|2019| 1| 20|[18, 21, 20]| 21|
| 2| 4|2019| 2| 18| [21, 20, 1]| 21|
| 3| 3|2019| 3| 21| [20, 1, 2]| 20|
| 4| 2|2019| 4| 20| [1, 2, 3]| 3|
| 5| 1|2019| 5| 1| [2, 3, 1]| 3|
| 6| 52|2018| 6| 2| [3, 1, 2]| 3|
| 7| 51|2018| 7| 3| [1, 2, 3]| 3|
+----------+----+----+----+-----+------------+---------+
Here is the solution in Scala
var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")
val window = Window.orderBy($"year".desc, $"week".desc)
df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))
df.show()
You can use SQL window functions combined with the greatest(). When the SQL window function has less than 3 number of rows, you are considering the current rows and even prior rows. Therefore you need to have the lag1_price, lag2_price calculated in the inner sub-query. In the outer query, you can use the row_count value and use the greatest() function by passing in lag1, lag2 and current price for the respective values against 2,1,0 and get the maximum value.
Check this out:
val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")
df.createOrReplaceTempView("sales")
val df2 = spark.sql("""
select product_id, week, year, price,
count(*) over(order by year desc, week desc rows between 1 following and 3 following ) as count_row,
lag(price) over(order by year desc, week desc ) as lag1_price,
sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
max(price) over(order by year desc, week desc rows between 1 following and 3 following ) as max_price1 from sales
""")
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
select product_id, week, year, price,
case
when count_row=2 then greatest(price,max_price1)
when count_row=1 then greatest(price,lag1_price,max_price1)
when count_row=0 then greatest(price,lag1_price,lag2_price)
else max_price1
end as max_price
from sales_inner
""").show(false)
Results:
+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1 |5 |2019|20 |3 |null |null |21 |
|2 |4 |2019|18 |3 |20 |null |21 |
|3 |3 |2019|21 |3 |18 |20 |20 |
|4 |2 |2019|20 |3 |21 |18 |3 |
|5 |1 |2019|1 |2 |20 |21 |3 |
|6 |52 |2018|2 |1 |1 |20 |3 |
|7 |51 |2018|3 |0 |2 |1 |null |
+----------+----+----+-----+---------+----------+----------+----------+
+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1 |5 |2019|20 |21 |
|2 |4 |2019|18 |21 |
|3 |3 |2019|21 |20 |
|4 |2 |2019|20 |3 |
|5 |1 |2019|1 |3 |
|6 |52 |2018|2 |3 |
|7 |51 |2018|3 |3 |
+----------+----+----+-----+---------+
Related
This question already has answers here:
Find maximum row per group in Spark DataFrame
(2 answers)
Closed 7 months ago.
I have dataframe:
data = [('I ran home', 3, 1, 10),
('I went home', 3, 1, 11),
('I looked at the cat', 4, 2, 19),
('The cat looked at me', 5, 3, 20),
('I ran home', 3, 4, 10),
('I went homes', 3, 4, 12)]
schema = StructType([ \
StructField("text",StringType(),True), \
StructField("word_count", IntegerType(), True), \
StructField("group", IntegerType(), True), \
StructField("len_text", IntegerType(), True)])
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
+--------------------+----------+-----+--------+
|text |word_count|group|len_text|
+--------------------+----------+-----+--------+
|I ran home |3 |1 |10 |
|I went home |3 |1 |11 |
|I looked at the cat |4 |2 |19 |
|The cat looked at me|5 |3 |20 |
|I ran home |3 |4 |10 |
|I went homes |3 |4 |12 |
+--------------------+----------+-----+--------+
I want to filter rows with two conditions: if the values in the word_count column are the same and if the value in the len_text column is greater than the next row, then leave the greater value. In pandas i can do this with idmax():
df1 = df.loc[df.groupby('group')['len_text'].idxmax()]
Is there any analogue for pyspark? I want this result:
+--------------------+----------+-----+--------+
|text |word_count|group|len_text|
+--------------------+----------+-----+--------+
|I went home |3 |1 |11 |
|I looked at the cat |4 |2 |19 |
|The cat looked at me|5 |3 |20 |
|I went homes |3 |4 |12 |
+--------------------+----------+-----+--------+
You can use window functions, i.e. row_number
from pyspark.sql import functions as F, Window as W
w = W.partitionBy('group').orderBy(F.desc('len_text'))
df = df.withColumn('_rn', F.row_number().over(w))
df = df.filter('_rn=1').drop('_rn')
df.show()
# +--------------------+----------+-----+--------+
# | text|word_count|group|len_text|
# +--------------------+----------+-----+--------+
# | I went home| 3| 1| 11|
# | I looked at the cat| 4| 2| 19|
# |The cat looked at me| 5| 3| 20|
# | I went homes| 3| 4| 12|
# +--------------------+----------+-----+--------+
I want to compare two dataframes that have the same schema, and have a primary key column.
For each primary key, if other columns have any difference (could be multiple columns, so need to use some dynamic way to scan all other columns), I want to output the column name and values of both dataframes.
Also, I want to output the result if one primary key doesn't exist in another dataframe (so "full outer join" will be used). Here is some example:
dataframe1:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book2 | 2 |
|3 |book3 | 3 |
|4 |book4 | 4 |
+-----------+------+------+
dataframe2:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book8 | 8 |
|3 |book3 | 7 |
|5 |book5 | 5 |
+-----------+------+------+
The result would be:
+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2 |book | book2 | book8 |
|2 |number | 2 | 8 |
|3 |number | 3 | 7 |
|4 |book | book4 | null |
|4 |number | 4 | null |
|5 |book | null | book5 |
|5 |number | null | 5 |
+-----------+------+----------+------------+------------*
I know the first step is to join both dataframes on the primary key:
// joining the two DFs on primary_key
val result = df1.as("l")
.join(df2.as("r"), "primary_key", "fullouter")
But I am not sure how to proceed. Can someone give me some advice? Thanks
Data:
val df1 = Seq(
(1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)
).toDF("primary_key", "book", "number")
val df2 = Seq(
(1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)
).toDF("primary_key", "book", "number")
Imports
import org.apache.spark.sql.functions._
Define list of columns:
val cols = Seq("book", "number")
Join as you do right now:
val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")
Define:
val comp = explode(array(cols.map(c => struct(
lit(c).alias("diff_column_name"),
// Value left
col(s"l.${c}").cast("string").alias("dataframe1"),
// Value right
col(s"r.${c}").cast("string").alias("dataframe2"),
// Differs
not(col(s"l.${c}") <=> col(s"r.${c}")).alias("diff")
)): _*))
Select and filter:
joined
.withColumn("comp", comp)
.select($"primary_key", $"comp.*")
// Filter out mismatches and get rid of obsolete diff
.where($"diff").drop("diff")
.orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// | 2| book| book2| book8|
// | 2| number| 2| 8|
// | 3| number| 3| 7|
// | 4| book| book4| null|
// | 4| number| 4| null|
// | 5| book| null| book5|
// | 5| number| null| 5|
// +-----------+----------------+----------+----------+
I have pyspark.rdd.PipelinedRDD (Rdd1).
when I am doing Rdd1.collect(),it is giving result like below.
[(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method
My final data frame should be like below. df.show() should be like:
+----------+-------+-------------------+
|CId |IID |Score |
+----------+-------+-------------------+
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|10 |3 |3.616726727464709 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|1 |3 |2.016527311459324 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|2 |3 |6.230272144805092 |
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
|3 |3 |-0.3924680103722977|
+----------+-------+-------------------+
I can achieve this converting to rdd next applying collect, iteration and finally Data frame.
but now I want to convert pyspark.rdd.PipelinedRDD to Dataframe with out using any collect() method.
please let me know how to achieve this?
You want to do two things here:
1. flatten your data
2. put it into a dataframe
One way to do it is as follows:
First, let us flatten the dictionary:
rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()])
When collecting the data, you get something like this:
[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ...
Then we can format the data and turn it into a dataframe:
rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))\
.toDF(("CId", "IID", "Score"))\
.show()
which gives you this:
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
There is an even easier and more elegant solution avoiding python lambda-expressions as in #oli answer which relies on spark DataFrames's explode which perfectly fits your requirement. It should be faster too because there is no need to use python lambda's twice. See below:
from pyspark.sql.functions import explode
# dummy data
data = [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
# create your rdd
rdd = sc.parallelize(data)
# convert to spark data frame
df = rdd.toDF(["CId", "Values"])
# use explode
df.select("CId", explode("Values").alias("IID", "Score")).show()
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
This is how you can do it with scala
val Rdd1 = spark.sparkContext.parallelize(Seq(
(10, Map(3 -> 3.616726727464709, 4 -> 2.9996439803387602, 5 -> 1.6767412921625855)),
(1, Map(3 -> 2.016527311459324, 4 -> -1.5271512313750577, 5 -> 1.9665475696370045)),
(2, Map(3 -> 6.230272144805092, 4 -> 4.033642544526678, 5 -> 3.1517805604906313)),
(3, Map(3 -> -0.3924680103722977, 4 -> 2.9757316477407443, 5 -> -1.5689126834176417))
))
val x = Rdd1.flatMap(x => (x._2.map(y => (x._1, y._1, y._2))))
.toDF("CId", "IId", "score")
Output:
+---+---+-------------------+
|CId|IId|score |
+---+---+-------------------+
|10 |3 |3.616726727464709 |
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|1 |3 |2.016527311459324 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|2 |3 |6.230272144805092 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|3 |3 |-0.3924680103722977|
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
+---+---+-------------------+
Hope you can convert to pyspark.
Ensure a spark session is created first:
sc = SparkContext()
spark = SparkSession(sc)
I found this answer when I was trying to solve this exact issue.
'PipelinedRDD' object has no attribute 'toDF' in PySpark
We have dataframe like below :
+------+--------------------+
| Flag | value|
+------+--------------------+
|1 |5 |
|1 |4 |
|1 |3 |
|1 |5 |
|1 |6 |
|1 |4 |
|1 |7 |
|1 |5 |
|1 |2 |
|1 |3 |
|1 |2 |
|1 |6 |
|1 |9 |
+------+--------------------+
After normal cumsum we get this.
+------+--------------------+----------+
| Flag | value|cumsum |
+------+--------------------+----------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |27 |
|1 |7 |34 |
|1 |5 |39 |
|1 |2 |41 |
|1 |3 |44 |
|1 |2 |46 |
|1 |6 |52 |
|1 |9 |61 |
+------+--------------------+----------+
Now what we want is for cumsum to reset when specific condition is set for ex. when it crosses 20.
Below is expected output:
+------+--------------------+----------+---------+
| Flag | value|cumsum |expected |
+------+--------------------+----------+---------+
|1 |5 |5 |5 |
|1 |4 |9 |9 |
|1 |3 |12 |12 |
|1 |5 |17 |17 |
|1 |6 |23 |23 |
|1 |4 |27 |4 | <-----reset
|1 |7 |34 |11 |
|1 |5 |39 |16 |
|1 |2 |41 |18 |
|1 |3 |44 |21 |
|1 |2 |46 |2 | <-----reset
|1 |6 |52 |8 |
|1 |9 |61 |17 |
+------+--------------------+----------+---------+
This is how we are calculating the cumulative sum.
win_counter = Window.partitionBy("flag")
df_partitioned = df_partitioned.withColumn('cumsum',F.sum(F.col('value')).over(win_counter))
There are two ways I've found to solve it without udf:
Dataframe
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
w = (Window
.partitionBy('flag')
.orderBy(f.monotonically_increasing_id())
.rowsBetween(Window.unboundedPreceding, Window.currentRow))
df = df.withColumn('values', f.collect_list('value').over(w))
expr = "AGGREGATE(values, 0, (acc, el) -> IF(acc < 20, acc + el, el))"
df = df.select('Flag', 'value', f.expr(expr).alias('cumsum'))
df.show(truncate=False)
RDD
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
def cumsum_by_flag(rows):
cumsum, reset = 0, False
for row in rows:
if reset:
cumsum = row.value
reset = False
else:
cumsum += row.value
reset = cumsum > 20
yield row.value, cumsum
def unpack(value):
flag = value[0]
value, cumsum = value[1]
return flag, value, cumsum
rdd = df.rdd.keyBy(lambda row: row.Flag)
rdd = (rdd
.groupByKey()
.flatMapValues(cumsum_by_flag)
.map(unpack))
df = rdd.toDF('Flag int, value int, cumsum int')
df.show(truncate=False)
Output:
+----+-----+------+
|Flag|value|cumsum|
+----+-----+------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |4 |
|1 |7 |11 |
|1 |5 |16 |
|1 |2 |18 |
|1 |3 |21 |
|1 |2 |2 |
|1 |6 |8 |
|1 |9 |17 |
+----+-----+------+
It's probably best to do with pandas_udf here.
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'flag':[1]*13,'id':range(13), 'value': [5,4,3,5,6,4,7,5,2,3,2,6,9]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = None
reset = False
for v in pdf['value'].values:
if prev is None:
cumsums.append(v)
prev = v
else:
prev = prev + v if not reset else v
cumsums.append(prev)
reset = True if prev >= 20 else False
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('flag').apply(_calc_cumsum)
df.show()
the results:
+----+---+-----+------+
|flag| id|value|cumsum|
+----+---+-----+------+
| 1| 0| 5| 5.0|
| 1| 1| 4| 9.0|
| 1| 2| 3| 12.0|
| 1| 3| 5| 17.0|
| 1| 4| 6| 23.0|
| 1| 5| 4| 4.0|
| 1| 6| 7| 11.0|
| 1| 7| 5| 16.0|
| 1| 8| 2| 18.0|
| 1| 9| 3| 21.0|
| 1| 10| 2| 2.0|
| 1| 11| 6| 8.0|
| 1| 12| 9| 17.0|
+----+---+-----+------+
I have a database with time visit in timestamp like this
ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800
I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like
ID, longest_seq (days)
1, 2
2, 5
3, 1
I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.
SELECT ID, MIN (d), MAX(d)
FROM (
SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d,
ROW_NUMBER() OVER(
PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST')
as date)) rn
FROM purchase
where ID is not null
GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date)
)
GROUP BY ID, rn
ORDER BY ID
If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help
Thanks
[EDIT] A more explicit input /output
ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11
The result would be :
ID, MaxSeq (in days)
1,3
2,3
3,1
All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day
My answer below is adapted from https://dzone.com/articles/how-to-find-the-longest-consecutive-series-of-even for use in Spark SQL. You'll have wrap the SQL queries with:
spark.sql("""
SQL_QUERY
""")
So, for the first query:
CREATE TABLE intermediate_1 AS
SELECT
id,
time,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS rn,
time - ROW_NUMBER() OVER (PARTITION BY id ORDER BY time) AS grp
FROM purchase
This will give you:
id, time, rn, grp
1, 1, 1, 0
1, 2, 2, 0
1, 3, 3, 0
2, 1, 1, 0
2, 3, 2, 1
2, 4, 3, 1
2, 5, 4, 1
2, 10, 5, 5
2, 11, 6, 5
3, 1, 1, 0
3, 4, 2, 2
3, 9, 3, 6
3, 11, 4, 7
We can see that the consecutive rows have the same grp value. Then we will use GROUP BY and COUNT to get the number of consecutive time.
CREATE TABLE intermediate_2 AS
SELECT
id,
grp,
COUNT(*) AS num_consecutive
FROM intermediate_1
GROUP BY id, grp
This will return:
id, grp, num_consecutive
1, 0, 3
2, 0, 1
2, 1, 3
2, 5, 2
3, 0, 1
3, 2, 1
3, 6, 1
3, 7, 1
Now we just use MAX and GROUP BY to get the max number of consecutive time.
CREATE TABLE final AS
SELECT
id,
MAX(num_consecutive) as max_consecutive
FROM intermediate_2
GROUP BY id
Which will give you:
id, max_consecutive
1, 3
2, 3
3, 1
Hope this helps!
That's the case for my beloved window aggregate functions!
I think the following example could help you out (at least to get started).
The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).
In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.
scala> visits.show
+---+----+
| ID|time|
+---+----+
| 1| 1|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 3|
| 1| 3|
| 2| 1|
| 3| 1|
| 3| 2|
| 3| 2|
+---+----+
Let's define the window specification to group id rows together.
import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
partitionBy("id").
orderBy("time")
With that you rank the rows and count rows with the same rank.
val answer = visits.
select($"id", $"time", rank over idsSortedByTime as "rank").
groupBy("id", "time", "rank").
agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
| 1| 1| 1| 2|
| 1| 2| 3| 1|
| 1| 3| 4| 3|
| 3| 1| 1| 1|
| 3| 2| 2| 2|
| 2| 1| 1| 1|
+---+----+----+-----+
That appears (very close?) to a solution. You seem done!
Using spark.sql and with intermediate tables
scala> val df = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("id","time")
df: org.apache.spark.sql.DataFrame = [id: int, time: int]
scala> df.createOrReplaceTempView("tb1")
scala> spark.sql(""" with tb2(select id,time, time-row_number() over(partition by id order by time) rw1 from tb1), tb3(select id,count(rw1) rw2 from tb2 group by id,rw1) select id, rw2 from tb3 where (id,rw2) in (select id,max(rw2) from tb3 group by id) group by id, rw2 """).show(false)
+---+---+
|id |rw2|
+---+---+
|1 |3 |
|3 |1 |
|2 |3 |
+---+---+
scala>
Solution using DataFrame API:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1, 1),(1, 2),(1, 3),(2, 1),(2, 3),(2, 4),(2, 5),(2, 10),(2, 11),(3, 1),(3, 4),(3, 9),(3, 11)).toDF("ID","time")
df1.show(false)
df1.printSchema()
val w = Window.partitionBy("ID").orderBy("time")
val df2 = df1.withColumn("rank", col("time") - row_number().over(w))
.groupBy("ID", "rank")
.agg(count("rank").alias("count"))
.groupBy("ID")
.agg(max("count").alias("time"))
.orderBy("ID")
df2.show(false)
Console output:
+---+----+
|ID |time|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |5 |
|2 |10 |
|2 |11 |
|3 |1 |
|3 |4 |
|3 |9 |
|3 |11 |
+---+----+
root
|-- ID: integer (nullable = false)
|-- time: integer (nullable = false)
+---+----+
|ID |time|
+---+----+
|1 |3 |
|2 |3 |
|3 |1 |
+---+----+