Write unique values in Spark while keeping old values - apache-spark

I have a Spark job which is time scheduled to be executed.
When I write the result DataFrame to a Data Target (S3, HDFS, DB...), I want that what Spark writes is not duplicated for a specific column.
EXAMPLE:
Let's say that MY_ID is the unique column.
1st execution:
--------------
|MY_ID|MY_VAL|
--------------
| 1 | 5 |
| 2 | 9 |
| 3 | 6 |
--------------
2nd execution:
--------------
|MY_ID|MY_VAL|
--------------
| 2 | 9 |
| 3 | 2 |
| 4 | 4 |
--------------
What I am expecting to find in the Data Target after the 2 executions is something like this:
--------------
|MY_ID|MY_VAL|
--------------
| 1 | 5 |
| 2 | 9 |
| 3 | 6 |
| 4 | 4 |
--------------
Where the expected output is the result of the first execution with the results of the second execution appended. In case the value for MY_ID already exists, the old one is kept, discarding the results of new executions (in this case the 2nd execution wants to write for MY_ID 3 the MY_VAL 9. Since that this record already exists from the 1st execution, the new record is discarded).
So the distinct() function is not enough to guarantee this condition. The uniqueness of the column MY_ID should be kept even in the dumped output.
Is there any solution that can guarantee this property at reasonable computational costs? (It is basically the same idea of UNIQUE in relational Databases.)

You can to do fullOuterJoin on first & second iteration.
val joined = firstIteration.join(secondIteration, Seq("MY_ID"), "fullouter")
scala> joined.show
+-----+------+------+
|MY_ID|MY_VAL|MY_VAL|
+-----+------+------+
| 1| 5| null|
| 3| 6| 2|
| 4| null| 4|
| 2| 9| 9|
+-----+------+------+
From the resultant table, if firstIteration's MY_VAL has value, you can use it as it is. Else if its null (indicates that the key occurs only in second iteration). use the value from secondIteration's MY_VAL.
scala> joined.withColumn("result", when(firstIteration.col("MY_VAL").isNull, secondIteration.col("MY_VAL"))
.otherwise(firstIteration.col("MY_VAL")))
.drop("MY_VAL")
.show
+-----+------+
|MY_ID|result|
+-----+------+
| 1| 5|
| 3| 6|
| 4| 4|
| 2| 9|
+-----+------+

Not sure whether you are using Scala or Python but have a look at the dropDuplicates function that allow you the specify one or more columns:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

Related

Aggregation and feature extraction at the same time using pyspark

i have this dataset
+---------+------+------------------+--------------------+-------------+
| LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703| 4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484| 2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765| 0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632| 2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384| 1|
for every consumerID we have the sum and avg consumption in every month, acron grouped is fixed for each consumer
i want to aggregate according to the id and in the same time extract those new features and have round numbers to finally have this data
+---------+-------------+-------------------+------------------+------------------+------------------
| LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023| 4| | | |
|MAC000128| 2| | | |
|MAC000012| 0| | | |
|MAC000053| 2| | | |
|MAC000121| 1| | | |
You can do a pivot:
import pyspark.sql.functions as F
result = df.groupBy('LCLid', 'Acorn_grouped') \
.pivot('season') \
.agg(
F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'),
F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
).fillna(0) # replace nulls with zero -
# you can skip this if you want to keep nulls

Difference between explode and explode_outer

What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical:
SELECT explode(array(10, 20));
10
20
and
SELECT explode_outer(array(10, 20));
10
20
The Spark source suggests that there is a difference between the two functions
expression[Explode]("explode"),
expressionGeneratorOuter[Explode]("explode_outer")
but what is the effect of expressionGeneratorOuter compared to expression?
explode creates a row for each element in the array or map column by ignoring null or empty values in array whereas explode_outer returns all values in array or map including null or empty.
For example, for the following dataframe-
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
explode gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
Whereas explode_outer gives the following output-
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
SELECT explode(col1) from values (array(10,20)), (null)
returns
+---+
|col|
+---+
| 10|
| 20|
+---+
while
SELECT explode_outer(col1) from values (array(10,20)), (null)
returns
+----+
| col|
+----+
| 10|
| 20|
|null|
+----+

pyspark detect change of categorical variable

I have a spark dataframe consisting of two columns.
+-----------------------+-----------+
| Metric|Recipe_name|
+-----------------------+-----------+
| 100. | A |
| 200. | A |
| 300. | A |
| 10. | A |
| 20. | A |
| 10. | B |
| 20. | B |
| 10. | A |
| 20. | A |
| .. | .. |
| .. | .. |
| 10. | B |
The dataframe is time ordered ( you can imagine there is a increasing timestamp column ). I need to add a column 'Cycles'. There are two scenarios when I say a new cycle begins :
If the same recipe is running lets say recipe 'A', and the value of Metric decreases (with respect to the last row) then a new cycle begins.
Lets say we switch from current recipe 'A' to second recipe 'B' and switch back to recipe 'A' we say a new cycle for recipe 'A' has begun.
So in the end i would like to have a column 'Cycle' which looks like this :
+-----------------------+-----------+-----------+
| Metric|Recipe_name| Cycle|
+-----------------------+-----------+-----------+
| 100. | A | 0 |
| 200. | A | 0 |
| 300. | A | 0 |
| 10. | A | 1 |
| 20. | A | 1 |
| 10. | B | 0 |
| 20. | B | 0 |
| 10. | A | 2 |
| 20. | A | 2 |
| .. | .. | 2 |
| .. | .. | 2 |
| 10. | B | 1 |
So it means recipe A has cycle 0 then metric decreases and cycle changes to 1.
Then a new recipe starts B so it has a new cycle 0.
Then again we get back to recipe A we say a new cycle begins for recipe A and with respect to last cycle number it has cycle 2 ( and similarly for recipe B).
In total there are 200 recipes.
Thanks for the help.
Replace my order column to your ordering column. Compare your condition by using lag function where the Recipe_name column is being partitioned.
w = Window.partitionBy('Recipe_name').orderBy('order')
df.withColumn('Cycle', when(col('Metric') < lag('Metric', 1, 0).over(w), 1).otherwise(0)) \
.withColumn('Cycle', sum('Cycle').over(w)) \
.orderBy('order') \
.show()
+------+-----------+-----+
|Metric|Recipe_name|Cycle|
+------+-----------+-----+
| 100| A| 0|
| 200| A| 0|
| 300| A| 0|
| 10| A| 1|
| 20| A| 1|
| 10| B| 0|
| 20| B| 0|
| 10| A| 2|
| 20| A| 2|
| 10| B| 1|
+------+-----------+-----+

Calculate Spark column value depending on another row value on the same column

I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+

How to count rows matching interrelated conditions?

I have table that looks like this:
TripID | Name | State
1 | John | OH
2 | John | OH
3 | John | CA
4 | John | OH
1 | Mike | CA
2 | Mike | CA
3 | Mike | OH
I'd like to count the people who travelled to OH first followed by CA.
In the above case it'd be John only so the answer should be 1.
So I want to know how can we set a certain order in SQL filtering to filter the result?
I may have misunderstood your question perhaps, but if you're asking about:
how many people travelled to OH first and then to CA.
(The sketch of) the query could be as follows:
scala> trips.show
+------+----+-----+
|tripid|name|state|
+------+----+-----+
| 1|John| OH|
| 2|John| OH|
| 3|John| CA|
| 4|John| OH|
| 1|Mike| CA|
| 2|Mike| CA|
| 3|Mike| OH|
+------+----+-----+
scala> trips.orderBy("name", "tripid").groupBy("name").agg(collect_list("state")).show
+----+-------------------+
|name|collect_list(state)|
+----+-------------------+
|John| [OH, OH, CA, OH]|
|Mike| [CA, CA, OH]|
+----+-------------------+
As I see it now, you'd have two options:
(hard) Write a user-defined aggregate function (UDAF) that would do the aggregation (and would replace collect_list with a itinerary that'd contain distinct states).
(easier) Write a user-defined function (UDF) that would do a similar job as the UDAF above (but after collect_list have collected the values).
(easy) Use functions (like explode and/or window)
Let's do the easy solution (not necessarily the most effective!).
It turns out that the groupBy earlier is not really necessary (!) You can handle it using window aggregation alone (used twice).
import org.apache.spark.sql.expressions.Window
val byName = Window.partitionBy("name").orderBy("tripid")
val distinctStates = trips.withColumn("rank", rank over byName).dropDuplicates("name", "state").orderBy("name", "rank")
scala> distinctStates.show
+------+----+-----+----+
|tripid|name|state|rank|
+------+----+-----+----+
| 1|John| OH| 1|
| 3|John| CA| 3|
| 1|Mike| CA| 1|
| 3|Mike| OH| 3|
+------+----+-----+----+
// rank again but this time use the pre-calculated distinctStates dataset
val distinctStatesRanked = distinctStates.withColumn("rank", rank over byName).orderBy("name", "rank")
scala> distinctStatesRanked.show
+------+----+-----+----+
|tripid|name|state|rank|
+------+----+-----+----+
| 1|John| OH| 1|
| 3|John| CA| 2|
| 1|Mike| CA| 1|
| 3|Mike| OH| 2|
+------+----+-----+----+
val left = distinctStatesRanked.filter($"state" === "OH").filter($"rank" === 1)
val right = distinctStatesRanked.filter($"state" === "CA").filter($"rank" === 2)
scala> left.join(right, "name").show
+----+------+-----+----+------+-----+----+
|name|tripid|state|rank|tripid|state|rank|
+----+------+-----+----+------+-----+----+
|John| 1| OH| 1| 3| CA| 2|
+----+------+-----+----+------+-----+----+

Resources