How to use when() .otherwise function in Spark with multiple conditions

How to use when() .otherwise function in Spark with multiple conditions - apache-spark

This is my first post so let me know if I need to give more details.
I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am using the when() .otherwise function in spark but I'm not getting the result I would expect.
Below is the code I'm using:
val evaluation = evaluation_raw
.withColumn("immediate",
when(col("intended_outcome_review").isNull
&& col("outcome").isNull
&& col("impact").isNull
&& col("impact_self").isNull
&& col("next_step").isNull,
lit(false))
.otherwise(lit(true)))
.select(
col("id"),
col("intended_outcome_review"),
col("outcome"),
col("impact"),
col("impact_self"),
col("next_step"),
col("immediate"))
Desired outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |false |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
Actual outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |true |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
If anyone knows what I may be doing wrong please let me know.
Thanks!

You can use a trick and cast column.isNull() to in int and calculate sum of them. if the sum is above 0 then it's true.
.withColumn(
'immediate',
(
F.col('intended_outcome_review').isNull().cast('int') +
F.col('outcome').isNull().cast('int') +
F.col('impact').isNull().cast('int') +
F.col('next_step').isNull().cast('int')
) != 0
)

Turns out some of the columns are converted from Null to "" when other parts of the form are filled out.
Answer below considers empty strings and Null values:
.withColumn("immediate",
when((col("intended_outcome_review").isNull || col("intended_outcome_review") ==="")
&& (col("outcome").isNull || col("outcome") === "")
&& (col("impact").isNull || col("outcome") === "")
&& (col("impact_self").isNull || col("impact_self") === "")
&& (col("next_step").isNull || col("next_step") === ""),
lit(false))
.otherwise(lit(true)))

Related

strange behavior with spark

I have a data fame like such
+-----------------------------------------+-----------------------+----------+---------------------------------------+
|id |timestamp |segment_id|centroid |
+-----------------------------------------+-----------------------+----------+---------------------------------------+
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:53.529|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:54.528|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:55.527|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:56.527|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:57.526|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:58.525|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:21:59.525|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:00.524|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:01.524|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:02.523|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:03.523|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:04.522|366 |[42.52251447741435, -83.74453303719517]|
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:05.522|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:06.522|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:07.521|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:08.521|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:09.520|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:10.519|null |null |
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:11.520|null |null |
+-----------------------------------------+-----------------------+----------+---------------------------------------+
I am trying to filter it on a not null segment_id with the following code
df.where("segment_id is not null").show(truncate=False)
what results is the following correct row:
+-----------------------------------------+-----------------------+----------+--------+
|id |timestamp |segment_id|centroid|
+-----------------------------------------+-----------------------+----------+--------+
|01-9-0b18bb1e-eea1-4862-bb07-71fdde01b848|2022-03-12 15:22:04.522|null |null |
+-----------------------------------------+-----------------------+----------+--------+
So it seems like it picks the right row but somehow segment_id and centroid show up as null. The only time it works is if the not null value happens to be the first row of the dataframe.

Spark sql replace collect_list empty lists with null

I have below data in dataframe
+----------+--------------+-------------------+---------------+
|id |mid |ppp |qq |
+----------+--------------+-------------------+---------------+
|A |4 |[{P}] |null |
|B |4 |[{P}] |null |
|A |4 |null |[{P}] |
|A |4 |null |[{Q}] |
|C |4 |null |[{Q}] |
|D |4 |null |[{Q}] |
|A |4 |null |[{R}] |
+----------+--------------+-------------------+---------------+
I have below code
String[] array = {"id", "mid", "ppp", "qq"};
List<String> columns = Arrays.asList(array)
Column[] columns = columns
.stream()
.filter(field -> !field.equals("id") && !field.equals("mid"))
.map(column -> flatten(when(size(collect_list(column)).equalTo(0), null)
.otherwise(collect_list(column)))
.as(column))
.collect(Collectors.toList()).toArray(new Column[0]);
Dataset<Row> output = df
.groupBy(functions.col("id"), functions.col("mid"))
.agg(columns[0], Arrays.copyOfRange(columns, 1, columns.length));
The above code produces groups by id and mid and then collect_list collects elements of ppp and qq into arrays in both columns.
Output :
+----------+--------------+-------------------+----------------+
|id |mid | ppp |qq |
+----------+--------------+-------------------+----------------+
|A |4 |[[P]] |[[R], [P], [Q]] |
|B |4 |null |[[Q]] |
|C |4 |[[P]] |null |
|D |4 |null |[[Q]] |
Code works fine exactly as required where if collect_list creates empty list, I am replacing that by null.
Is there a way to avoid calling collect_list twice in when and otherwise and achieve the same result that if collect_list creates empty list, replace that by null.

of course you can do that, just call size on the array on set it to null if it is 0, something like
df
.groupBy()
.agg(
collect_list($"mycol").as("arr_mycol")
)
// set empty arrays to null
.withColumn("arr_mycol",when(size($"arr_mycol")>0,$"arr_mycol"))

PySpark: How to calculate days between when last condition was met (positive vs negative)

Current DF (filter by a single userId, flag is 1 when the loss is > 0, -1 when is <=0):
display(df):
+------+----------+---------+----+
| user|Date |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01| -16.5| 1|
|100364|2019-02-02| 73.5| -1|
|100364|2019-02-03| 31| -1|
|100364|2019-02-09| -5.2| 1|
|100364|2019-02-10| -34.5| 1|
|100364|2019-02-13| -8.1| 1|
|100364|2019-02-18| 5.68| -1|
|100364|2019-02-19| 5.76| -1|
|100364|2019-02-20| 9.12| -1|
|100364|2019-02-26| 9.4| -1|
|100364|2019-02-27| -30.6| 1|
+----------+------+---------+----+
the desidered outcome df should show the number of days since lastwin ('RecencyLastWin') and since lastloss ('RecencyLastLoss')
display(df):
+------+----------+---------+----+--------------+---------------+
| user|Date |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01| -16.5| 1| null| null|
|100364|2019-02-02| 73.5| -1| 1| null|
|100364|2019-02-03| 31| -1| 2| 1|
|100364|2019-02-09| -5.2| 1| 8| 6|
|100364|2019-02-10| -34.5| 1| 1| 7|
|100364|2019-02-13| -8.1| 1| 1| 10|
|100364|2019-02-18| 5.68| -1| 5| 15|
|100364|2019-02-19| 5.76| -1| 6| 1|
|100364|2019-02-20| 9.12| -1| 7| 1|
|100364|2019-02-26| 9.4| -1| 13| 6|
|100364|2019-02-27| -30.6| 1| 14| 1|
+----------+------+---------+----+--------------+---------------+
My approach was the following:
from pyspark.sql.window import Window
w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")
last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))
finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.withColumn('previous_date_played' , F.lag('EventDate').over(w))\
.withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
.withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))
then I tried to add (multiple attempts..) but failed to 'perfectly' return what I want.
finalcheck = finalcheck.withColumn('previous_last_pos' , F.last('last_positive_day_count', True).over(w2))\
.withColumn('previous_last_neg' , F.last('last_negative_day_count', True).over(w2))\
.withColumn('previous_last_pos_date' , F.last('last_positive_day', True).over(w2))\
.withColumn('previous_last_neg_date' , F.last('last_negative_day', True).over(w2))\
.withColumn('recency_last_positive' , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
.withColumn('day_since_last_negative_v1' , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
.withColumn('days_off' , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
.withColumn('recency_last_negative' , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_negative_v2' , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_positive_v2' , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))
Any suggestion/tips?
(I found a similar question but didn't figured out how to apply in my specific case):
How to calculate days between when last condition was met?

Here is my try.
There are two parts to calculate this. The first one is that when the wins and losses keep going, then the difference of dates should be summed. To achieve this, I have marked the consecutive losses and wins as 1, and split them into the partition groups by cumulative summing until the current row of the marker. Then, I can calculate the cumulative days from the last loss or win after the consecutive losses and wins the end.
The second one is that when the wins and losses changed, simply get the date difference from the last match and this match. It can be easily obtained by the date difference of current and previous one.
Finally, merge those results in a column.
from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window
w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')
df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
.withColumn('lastDate', lag('Date', 1).over(w1)) \
.withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
.withColumn('consecutiveLoss', expr('if(flag = 1 or lastFlag = 1, 0, 1)')) \
.withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
.withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
.withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
.withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
.withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
.withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
.withColumn('lastWin' , expr('if(lastFlag = 1, dateDiff, null)')) \
.withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
.withColumn('RecencyLastWin', coalesce('lastWin' , 'daysLastwin' )) \
.orderBy('Date')
df2.show(11, False)
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user |Date |RealLoss|flag|lastFlag|lastDate |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |null |0 |1 |0 |1 |null |null |null |null |null |null |
|100364|2019-02-02|73.5 |-1 |1 |2019-02-01|1 |0 |0 |0 |1 |null |1 |null |1 |null |1 |
|100364|2019-02-03|31.0 |-1 |-1 |2019-02-02|1 |1 |0 |1 |1 |null |2 |1 |null |1 |2 |
|100364|2019-02-09|-5.2 |1 |-1 |2019-02-03|6 |0 |0 |1 |1 |6 |8 |6 |null |6 |8 |
|100364|2019-02-10|-34.5 |1 |1 |2019-02-09|1 |0 |1 |1 |2 |7 |null |null |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |1 |2019-02-10|3 |0 |1 |1 |3 |10 |null |null |3 |10 |3 |
|100364|2019-02-18|5.68 |-1 |1 |2019-02-13|5 |0 |0 |1 |3 |15 |5 |null |5 |15 |5 |
|100364|2019-02-19|5.76 |-1 |-1 |2019-02-18|1 |1 |0 |2 |3 |null |6 |1 |null |1 |6 |
|100364|2019-02-20|9.12 |-1 |-1 |2019-02-19|1 |1 |0 |3 |3 |null |7 |1 |null |1 |7 |
|100364|2019-02-26|9.4 |-1 |-1 |2019-02-20|6 |1 |0 |4 |3 |null |13 |6 |null |6 |13 |
|100364|2019-02-27|-30.6 |1 |-1 |2019-02-26|1 |0 |0 |4 |3 |1 |14 |1 |null |1 |14 |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)
+------+----------+--------+----+---------------+--------------+
|user |Date |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |
|100364|2019-02-02|73.5 |-1 |null |1 |
|100364|2019-02-03|31.0 |-1 |1 |2 |
|100364|2019-02-09|-5.2 |1 |6 |8 |
|100364|2019-02-10|-34.5 |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |10 |3 |
|100364|2019-02-18|5.68 |-1 |15 |5 |
|100364|2019-02-19|5.76 |-1 |1 |6 |
|100364|2019-02-20|9.12 |-1 |1 |7 |
|100364|2019-02-26|9.4 |-1 |6 |13 |
|100364|2019-02-27|-30.6 |1 |1 |14 |
+------+----------+--------+----+---------------+--------------+

Can we reorder spark dataframe's columns?

I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+

Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.

Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)

Spark : Preceding row for the current row using window paritionBy

I have the below dataframe,
|id |lat |lng |timestamp |
+-----+---------+-----------+-------------------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|
+-----+---------+-----------+-------------------+
And I can find the count (number of occurrences), precount(previous count) and pretsp (previous timestamp) using a window and partitionBy id and timestamp.
val specDevicePartiton = Window.partitionBy("id").orderBy("timestamp")
val specDevicePartitonTimeStamp = Window.partitionBy("id", "timestamp").orderBy("timestamp")
val userProfileDF = deviceDF.withColumn("prelatitude", lag(deviceDF("lat"), 1).over(specDevicePartiton))
.withColumn("prelongitude", lag(deviceDF("lng"), 1).over(specDevicePartiton))
.withColumn("pretimestamp", lag(deviceDF("timestamp"), 1).over(specDevicePartiton))
.withColumn("pretsp", when((col("timestamp") === col("pretimestamp")), first(col("pretimestamp"))
.over(specDevicePartitonTimeStamp)).otherwise(col("pretimestamp")))
.withColumn("count", count("timestamp").over(specDevicePartitonTimeStamp))
.withColumn("previousCount", lag(col("count"), 1).over(specDevicePartiton))
.withColumn("precount", when((col("timestamp") === col("pretimestamp")), first(col("previousCount"))
.over(specDevicePartitonTimeStamp)).otherwise(col("previousCount")))
.withColumn("preFirstLat", when((col("precount").>(1)) && (col("count") === 1), first(col("lat")).over(specDevicePartitonPreTimeStamp.rowsBetween(-2, -1))))
.withColumn("preFirstLng", when((col("precount").>(1)) && (col("count") === 1), first(col("lng")).over(specDevicePartitonPreTimeStamp)))
.drop("prelatitude", "prelongitude", "nxtlatitude", "nxtlongitude", "pretimestamp")
You can find below the output dataframe,
|id |lat |lng |timestamp |pretsp |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1 |1 |null |null |
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1 |1 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |null |null |
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1 |2 |3.1357369 |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |null |null |
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |null |null |
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |null |null |
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |null |null |
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
I want to find out first and last from previous for current row. Expected output would be like this,
|id |lat |lng |timestamp |pretsp |count|precount|preFirstLat|preFirstLng|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
|user1|3.1357369|101.6863713|2017-11-06 19:33:16|2017-11-06 18:44:12|1 |null |null |null|
|user1|3.1360323|101.6874385|2017-11-06 21:10:25|2017-11-06 19:33:16|1 |1 |3.1357369 |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |3.1360323 |101.6874385|
|user1|3.1363076|101.6902847|2017-11-07 01:39:07|2017-11-06 21:10:25|2 |1 |3.1360323 |101.6874385|
|user1|3.1357369|101.6863713|2017-11-07 04:16:30|2017-11-07 01:39:07|1 |2 |3.1357369 |101.686727 |
|user1|3.1357369|101.6863713|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |3.1357369 |101.6863713|
|user1|3.1357409|101.6860155|2017-11-07 05:05:03|2017-11-07 04:16:30|2 |1 |3.1357369 |101.6863713|
|user1|3.1360323|101.6874385|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |3.1357369 |101.6863713|
|user1|3.1357369|101.6863713|2017-11-07 06:13:07|2017-11-07 05:05:03|2 |2 |3.1357369 |101.6863713|
+-----+---------+-----------+-------------------+-------------------+-----+--------+-----------+-----------+
Logic:
Find first lat and long value from previous rows for current row. Here previous row with same timestamp and different lat and long value.
Example: Check the timestamp = 2017-11-07 04:16:30 and 2017-11-07 05:05:03 so on in the above output.
I have tried to rowsbetween(start, end) dynamically by considering precount as start and -1 as end but I do know how to achieve this.
I have to do the same for figuring out last value if I get the solution to find out the first value then it will be same for last value I think.
Here is the simple example,
Id val1 date prefirstVal preLastVal
1 10 2017-11-06 null null
1 20 2017-11-07 10 10
1 25 2017-11-07 10 10
**1 30 2017-11-08 20 25**
1 35 2017-11-09 30 30
1 40 2017-11-09 30 30
Here the in the highlighted row 2017-11-08, prefristVal 20 is from 2017-11-07 first entry and preLastVal 25 is from last entry of 2017-11-07.
Thanks,

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to use when() .otherwise function in Spark with multiple conditions - apache-spark

Related

strange behavior with spark

Spark sql replace collect_list empty lists with null

PySpark: How to calculate days between when last condition was met (positive vs negative)

Can we reorder spark dataframe's columns?

Spark : Preceding row for the current row using window paritionBy

Categories

Resources