Redshift pivot 2 fields of a row to column - pivot

I have a query which I can pivot:
SELECT *
FROM
(
SELECT
CASE fg.date
WHEN '2022-03-08' - 30 THEN 'ret30'
WHEN '2022-03-08' - 7 THEN 'ret7'
WHEN '2022-03-08' - 1 THEN 'ret1'
END::varchar(max) as retention,
platform_id,
fg.id AS count,
fg_1d.id AS count_2
FROM schema.A AS fg
LEFT JOIN (
SELECT user_id
FROM schema.A
WHERE "date" = '2022-03-08'
) AS fg_1d ON fg.id = fg_1d.id
LEFT JOIN schema.dim_B b ON fg.b_id = b.id
WHERE (fg.date = '2022-03-08' - 30 OR fg.date = '2022-03-08' - 7 OR fg.date = '2022-03-08' - 1)
)PIVOT (COUNT(distinct count) FOR retention IN ('ret30', 'ret7', 'ret30'));
Data without pivot is:
ret1 | 1 |10
ret7 | 1 |8
ret30| 1 |6
ret1 | 2 |14
ret7 | 2 |2
ret30| 2 |4
ret1 | 3 |11
ret7 | 3 |9
ret30| 3 |7
Data already pivoted:
platform_id | ret1 | ret7| ret 30
1|10|8|6
1|14|2|4
1|11|9|7
Now I would like to add another metric:
Data without pivot is:
ret1 | 1 |10 |12
ret7 | 1 |8 |7
ret30| 1 |6 |6
ret1 | 2 |14 |12
ret7 | 2 |2 |7
ret30| 2 |4 |6
ret1 | 3 |11 |12
ret7 | 3 |9 |7
ret30| 3 |7 |6
Data with pivot should be:
platform_id | ret1 | ret7| ret 30 | ret1_2 | ret7_2| ret30_2
1|10|8|6|12|7|6
1|14|2|4|12|7|6
1|11|9|7|12|7|6
I tried without success:
SELECT *
FROM
(
SELECT
CASE fg.date
WHEN '2022-03-08' - 30 THEN 'ret30'
WHEN '2022-03-08' - 7 THEN 'ret7'
WHEN '2022-03-08' - 1 THEN 'ret1'
END::varchar(max) as retention,
platform_id,
fg.id AS count,
fg_1d.id AS count_2
FROM schema.A AS fg
LEFT JOIN (
SELECT user_id
FROM schema.A
WHERE "date" = '2022-03-08'
) AS fg_1d ON fg.id = fg_1d.id
LEFT JOIN schema.dim_B b ON fg.b_id = b.id
WHERE (fg.date = '2022-03-08' - 30 OR fg.date = '2022-03-08' - 7 OR fg.date = '2022-03-08' - 1)
)PIVOT (COUNT(distinct count), COUNT(distinct count_2) FOR retention IN ('ret30', 'ret7', 'ret30'));
And also:
SELECT *
FROM
(
SELECT
CASE fg.date
WHEN '2022-03-08' - 30 THEN 'ret30'
WHEN '2022-03-08' - 7 THEN 'ret7'
WHEN '2022-03-08' - 1 THEN 'ret1'
END::varchar(max) as retention,
platform_id,
fg.id AS count,
fg_1d.id AS count_2
FROM schema.A AS fg
LEFT JOIN (
SELECT user_id
FROM schema.A
WHERE "date" = '2022-03-08'
) AS fg_1d ON fg.id = fg_1d.id
LEFT JOIN schema.dim_B b ON fg.b_id = b.id
WHERE (fg.date = '2022-03-08' - 30 OR fg.date = '2022-03-08' - 7 OR fg.date = '2022-03-08' - 1)
)PIVOT (COUNT(distinct count) FOR retention IN ('ret30', 'ret7', 'ret30')),
PIVOT (COUNT(distinct count) FOR retention IN ('ret30', 'ret7', 'ret30'));

Related

Get value for latest record incase of multiple records for same group

I have a dataset which will have multiple records for an id column field grouped on other columns. For this dataset, I want to derive a new column only for the latest record of each group. I was using a case statement to derive the new column and union to get the value for the latest record. I was thinking to avoid using UNION as it is an expensive operation in spark-sql.
Input:
person_id order_id order_ts order_amt
1 1 2020-01-01 10:10:10 10
1 2 2020-01-01 10:15:15 15
2 3 2020-01-01 10:10:10 0
2 4 2020-01-01 10:15:15 15
From the above input, person_id 1 has two orders (1,2) and person_id 2 has two orders (3,4). I want to derive a column for only latest order for a given person.
Expected Output:
person_id order_id order_ts order_amt valid_order
1 1 2020-01-01 10:10:10 10 N
1 2 2020-01-01 10:15:15 15 Y
2 3 2020-01-01 10:10:10 0 N
2 4 2020-01-01 10:15:15 15 Y
I tried below query to get the output using UNION in the query:
select person_id, order_id, order_ts, order_amt, valid_order
from
(
select *, row_number() over(partition by order_id order by derive_order) as rnk
from
(
select person_id, order_id, order_ts, order_amt, 'N' as valid_order, 'before' as derive_order
from test_table
UNION
select person_id, order_id, order_ts, order_amt,
case when order_amt is not null and order_amt >0 then 'Y' else 'N' end as valid_order,
'after' as derive_order
from
(
select *, row_number() over(partition by person_id order by order_ts desc) as rnk
from test_table
) where rnk = 1
) final
) where rnk = 1 order by person_id, order_id;
I also got the same output using a combination of left outer join and inner join.
Join Query:
select final.person_id, final.order_id, final.order_ts, final.order_amt,
case when final.valid_order is null then 'N' else final.valid_order end as valid_order
from
(
select c.person_id, c.order_id, c.order_ts, c.order_amt, d.valid_order from test_table c
left outer join
(
select a.*, case when a.order_amt is not null and a.order_amt >0 then 'Y' else 'N' end as valid_order
from test_table a
inner join
(
select person_id, max(order_id) as order_id from test_table group by 1
) b on a.person_id = b.person_id and a.order_id = b.order_id
) d on c.order_id = d.order_id
) final order by person_id, order_id;
Our input dataset will have around 20Million records. Is there a better-optimized way to get the same output apart from the above queries.
Any help would be appreciated.
check if it helps-
val data =
"""
|person_id | order_id | order_ts |order_amt
| 1 | 1 | 2020-01-01 10:10:10 | 10
| 1 | 2 | 2020-01-01 10:15:15 | 15
| 2 | 3 | 2020-01-01 10:10:10 | 0
| 2 | 4 | 2020-01-01 10:15:15 | 15
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- person_id: integer (nullable = true)
* |-- order_id: integer (nullable = true)
* |-- order_ts: timestamp (nullable = true)
* |-- order_amt: integer (nullable = true)
*
* +---------+--------+-------------------+---------+
* |person_id|order_id|order_ts |order_amt|
* +---------+--------+-------------------+---------+
* |1 |1 |2020-01-01 10:10:10|10 |
* |1 |2 |2020-01-01 10:15:15|15 |
* |2 |3 |2020-01-01 10:10:10|0 |
* |2 |4 |2020-01-01 10:15:15|15 |
* +---------+--------+-------------------+---------+
*/
Using spark DSL
df.withColumn("latest", max($"order_ts").over(Window.partitionBy("person_id")))
.withColumn("valid_order", when(unix_timestamp($"latest") - unix_timestamp($"order_ts") =!= 0, lit("N"))
.otherwise(lit("Y"))
)
.show(false)
/**
* +---------+--------+-------------------+---------+-------------------+-----------+
* |person_id|order_id|order_ts |order_amt|latest |valid_order|
* +---------+--------+-------------------+---------+-------------------+-----------+
* |2 |3 |2020-01-01 10:10:10|0 |2020-01-01 10:15:15|N |
* |2 |4 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* |1 |1 |2020-01-01 10:10:10|10 |2020-01-01 10:15:15|N |
* |1 |2 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* +---------+--------+-------------------+---------+-------------------+-----------+
*/
Using SPARK SQL
// Spark SQL
df.createOrReplaceTempView("order_table")
spark.sql(
"""
|select person_id, order_id, order_ts, order_amt, latest,
| case when (unix_timestamp(latest) - unix_timestamp(order_ts) != 0) then 'N' else 'Y' end as valid_order
| from
| (select person_id, order_id, order_ts, order_amt, max(order_ts) over (partition by person_id) as latest FROM order_table) a
""".stripMargin)
.show(false)
/**
* +---------+--------+-------------------+---------+-------------------+-----------+
* |person_id|order_id|order_ts |order_amt|latest |valid_order|
* +---------+--------+-------------------+---------+-------------------+-----------+
* |2 |3 |2020-01-01 10:10:10|0 |2020-01-01 10:15:15|N |
* |2 |4 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* |1 |1 |2020-01-01 10:10:10|10 |2020-01-01 10:15:15|N |
* |1 |2 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* +---------+--------+-------------------+---------+-------------------+-----------+
*/
It can be done without joins or union. Also this condition a.order_amt is not null and a.order_amt >0 is redundant because if amount > 0 it is already NOT NULL.
select person_id, order_id, order_ts, order_amt,
case when rn=1 and order_amt>0 then 'Y' else 'N' end as valid_order
from
(
select person_id, order_id, order_ts, order_amt,
row_number() over(partition by person_id order by order_ts desc) as rn
from test_table a
) s

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+
IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.

Spark SQL : is there a way to get a sliding window whose size depends on a time duration instead of a number of items?

I have a Spark Dataset of events, indexed by a timestamp. What I would like to do is enrich each entry with additional information : the number of events occuring in the five minutes (300 seconds) following this event. So if initial data consists of two columns event_id and timestamp, I want to build a third columnn counter like below :
event_id timestamp counter
0 0 4
1 100 3
2 150 2
3 250 1
4 275 0
5 600 2
6 610 1
7 750 1
8 950 2
9 1100 1
10 1200 0
I know that using Spark I can use windows to count future events within a window of fixed size in term of number of events.
val window = Window.orderBy('timestamp).rowsBetween(0, 300)
myDataset.withColumn("count_future_events", sum(lit(1)).over(window))
But this is not interesting because the result is obviously always the same.
I wish something like this existed :
val window = Window.orderBy('timestamp).rowsBetween('timestamp, 'timestamp + 300) // 300 seconds here
But this does not compile.
Is there any way to achieve what I want ?
Have you got an answer?
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("timestamp").rangeBetween(0, 300)
df.withColumn("counter", sum(lit(1)).over(w) - 1).show(false)
You can simply use the rangeBetween for the Window. The result is then:
+--------+---------+-------+
|event_id|timestamp|counter|
+--------+---------+-------+
|0 |0 |4 |
|1 |100 |3 |
|2 |150 |2 |
|3 |250 |1 |
|4 |275 |0 |
|5 |600 |2 |
|6 |610 |1 |
|7 |750 |1 |
|8 |950 |2 |
|9 |1100 |1 |
|10 |1200 |0 |
+--------+---------+-------+

Running sum between two timestamp in pyspark

I have a data in below format :
+---------------------+----+----+---------+----------+
| date_time | id | cm | p_count | bcm |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |
+---------------------+----+----+---------+----------+
I need to find rolling sum of p_count column between two date_time and partition by id.
logic for start_date_time and end_date_time for rolling sum window is below :
start_date_time=min(date_time) group by (id,cm)
end_date_time= bcm == cm ? date_time : null
in this case start_date_time=2018-02-01 04:38:00 and end_date_time=2018-02-01 12:09:19 .
Output should look like :
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
var input = sqlContext.createDataFrame(Seq(
("2018-02-01 04:38:00", "v1", "c1",1,null),
("2018-02-01 05:37:07", "v1", "c1",1,null),
("2018-02-01 11:19:38", "v1", "c1",1,null),
("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
("2018-02-01 14:05:10", "v2", "c2",1,"c2")
)).toDF("date_time","id","cm","p_count" ,"bcm")
input.show()
Results:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
Next Code:
input.createOrReplaceTempView("input_Table");
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")
val results = sqlContext.sql("select *, " +
"SUM(p_count) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from input_Table ").show
Results:
+-------------------+---+---+-------+----+--------------+
| date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1| 1|null| 1|
|2018-02-01 05:37:07| v1| c1| 1|null| 2|
|2018-02-01 11:19:38| v1| c1| 1|null| 3|
|2018-02-01 12:09:19| v1| c1| 1| c1| 4|
|2018-02-01 14:05:10| v2| c2| 1| c2| 5|
+-------------------+---+---+-------+----+--------------+
You need to group by while windowing and add your logic to get expected reslts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Logically a Windowed Aggregate Function is newly calculated for each row within the PARTITION based on all ROWS between a starting row and an ending row.
Starting and ending rows might be fixed or relative to the current row based on the following keywords:
UNBOUNDED PRECEDING, all rows before the current row -> fixed
UNBOUNDED FOLLOWING, all rows after the current row -> fixed
x PRECEDING, x rows before the current row -> relative
y FOLLOWING, y rows after the current row -> relative
Possible kinds of calculation include:
Both starting and ending row are fixed, the window consists of all rows of a partition, e.g. a Group Sum, i.e. aggregate plus detail rows
One end is fixed, the other relative to current row, the number of rows increases or decreases, e.g. a Running Total, Remaining Sum
Starting and ending row are relative to current row, the number of rows within a window is fixed, e.g. a Moving Average over n rows
So SUM(x) OVER (ORDER BY col ROWS UNBOUNDED PRECEDING) results in a Cumulative Sum or Running Total
11 -> 11
2 -> 11 + 2 = 13
3 -> 13 + 3 (or 11+2+3) = 16
44 -> 16 + 44 (or 11+2+3+44) = 60
What is ROWS UNBOUNDED PRECEDING used for in Teradata?

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

I have dataset which like this:
+-------+------+-------+
|groupid|rownum|column2|
+-------+------+-------+
| 1 | 1 | 7 |
| 1 | 2 | 9 |
| 1 | 3 | 8 |
| 1 | 4 | 5 |
| 1 | 5 | 1 |
| 1 | 6 | 0 |
| 1 | 7 | 15 |
| 1 | 8 | 1 |
| 1 | 9 | 13 |
| 1 | 10 | 20 |
| 2 | 1 | 8 |
| 2 | 2 | 1 |
| 2 | 3 | 4 |
| 2 | 4 | 2 |
| 2 | 5 | 19 |
| 2 | 6 | 11 |
| 2 | 7 | 5 |
| 2 | 8 | 6 |
| 2 | 9 | 15 |
| 2 | 10 | 8 |
still have more rows......
I want to add a new column "column3" , which if the continuous column2 values are less than 10,then they will be arranged a same number such as 1. if their appear a value larger than 10 in column2, this row will be dropped ,then the following column3 row’s value will increase 1. For example, when groupid = 1,the column3's value from rownum 1 to 6 will be 1 and the rownum7 will be dropped, the column3's value of rownum 8 will be 2 and the rownum9,10 will be dropped.After the procedure, the table will like this:
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
| 1 | 1 | 7 | 1 |
| 1 | 2 | 9 | 1 |
| 1 | 3 | 8 | 1 |
| 1 | 4 | 5 | 1 |
| 1 | 5 | 1 | 1 |
| 1 | 6 | 0 | 1 |
| 1 | 7 | 15 | drop | this row will be dropped, in fact not exist
| 1 | 8 | 1 | 2 |
| 1 | 9 | 13 | drop | like above
| 1 | 10 | 20 | drop | like above
| 2 | 1 | 8 | 1 |
| 2 | 2 | 1 | 1 |
| 2 | 3 | 4 | 1 |
| 2 | 4 | 2 | 1 |
| 2 | 5 | 19 | drop | ...
| 2 | 6 | 11 | drop | ...
| 2 | 7 | 5 | 2 |
| 2 | 8 | 6 | 2 |
| 2 | 9 | 15 | drop | ...
| 2 | 10 | 8 | 3 |
In our project, the dataset is expressed as dataframe in spark sql
I try to solve this problem by udf in this way:
var last_rowNum: Int = 1
var column3_Num: Int = 1
def assign_column3_Num(rowNum:Int): Int = {
if (rowNum == 1){ //do nothing, just arrange 1
column3_Num = 1
last_rowNum = 1
return column3_Num
}
/*** if the difference between rownum is 1, they have the same column3
* value, if not, column3_Num++, so they are different
*/
if(rowNum - last_rowNum == 1){
last_rowNum = rowNum
return column3_Num
}else{
column3_Num += 1
last_rowNum = rowNum
return column3_Num
}
}
spark.sqlContext.udf.register("assign_column3_Num",assign_column3_Num _)
df.filter("column2>10") //drop the larger rows
.withColumn("column3",assign_column3_Num(col("column2"))) //add column3
as you can see, I use global variable. However, it's only effective in spark local[1] model. if i use local[8] or yarn-client, the result will totally wrong! this is because spark's running mechanism,they operate the global variable without distinguishing groupid and order!
So the question is how can i arrange right number when spark running on cluster?
use udf or udaf or RDD or other ?
thank you!
You can achieve your requirement by defining a udf function as below (comments are given for clarity)
import org.apache.spark.sql.functions._
def createNewCol = udf((rownum: collection.mutable.WrappedArray[Int], column2: collection.mutable.WrappedArray[Int]) => { // udf function
var value = 1 //value for column3
var previousValue = 0 //value for checking condition
var arrayBuffer = Array.empty[(Int, Int, Int)] //initialization of array to be returned
for((a, b) <- rownum.zip(column2)){ //zipping the collected lists and looping
if(b > 10 && previousValue < 10) //checking condition for column3
value = value +1 //adding 1 for column3
arrayBuffer = arrayBuffer ++ Array((a, b, value)) //adding the values
previousValue = b
}
arrayBuffer
})
Now utilize the algorithm defined in the udf function and to get the desired result, you would need to collect the values of rownum and column2 grouping them by groupid and sorting them by rownum and then call the udf function. Next steps would be to explode and select necessary columns. (commented for clarity)
df.orderBy("rownum").groupBy("groupid").agg(collect_list("rownum").as("rownum"), collect_list("column2").as("column2")) //collecting in order for generating values for column3
.withColumn("new", createNewCol(col("rownum"), col("column2"))) //calling udf function and storing the array of struct(rownum, column2, column3) in new column
.drop("rownum", "column2") //droping unnecessary columns
.withColumn("new", explode(col("new"))) //exploding the new column array so that each row can have struct(rownum, column2, column3)
.select(col("groupid"), col("new._1").as("rownum"), col("new._2").as("column2"), col("new._3").as("column3")) //selecting as separate columns
.filter(col("column2") < 10) // filtering the rows with column2 greater than 10
.show(false)
You should have your desired output as
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
|1 |1 |7 |1 |
|1 |2 |9 |1 |
|1 |3 |8 |1 |
|1 |4 |5 |1 |
|1 |5 |1 |1 |
|1 |6 |0 |1 |
|1 |8 |1 |2 |
|2 |1 |8 |1 |
|2 |2 |1 |1 |
|2 |3 |4 |1 |
|2 |4 |2 |1 |
|2 |7 |5 |2 |
|2 |8 |6 |2 |
|2 |10 |8 |3 |
+-------+------+-------+-------+

Resources