Spark Structured streaming watermark has no effect - apache-spark

I do window based aggregation with watermark, but everytime all of the data getting aggregated.
relevant code:
val jsonDF = spark.readStream.format("json").schema(schema).load("data-source")
val result = jsonDF.withWatermark("dateTime","10 minutes").groupBy(window($"dateTime","10 minutes","5 minutes"),$"location").sum("value")
val query = result.writeStream.outputMode("complete").format("console").queryName("location-query").start()
Once query started I start to put files into the directory "data-source":
current time is 2022-12-29T10:44:30
[
{
"location": "A",
"value": 2,
"dateTime": "2022-12-01T15:44:21"
},
{
"location": "B",
"value": 3,
"dateTime": "2022-12-28T16:44:21"
},
{
"location": "A",
"value": 7,
"dateTime": "2022-12-29T10:44:21"
}
]
result:
+--------------------+--------+----------+
| window|location|sum(value)|
+--------------------+--------+----------+
|{2022-12-28 16:40...| B| 3|
|{2022-12-29 10:35...| A| 7|
|{2022-12-01 15:35...| A| 2|
|{2022-12-01 15:40...| A| 2|
|{2022-12-28 16:35...| B| 3|
|{2022-12-29 10:40...| A| 7|
+--------------------+--------+----------+
expected result:
+--------------------+--------+----------+
| window|location|sum(value)|
+--------------------+--------+----------+
|{2022-12-29 10:35...| A| 7|
|{2022-12-29 10:40...| A| 7|
+--------------------+--------+----------+
As you can see even very old data from 2022-12-01 also aggregated
Even if I wait some time say 20min and add new files with older dates, everything still getting aggregated

Ok, the problem was the output mode:
outputMode("complete")
I complete mode Spark can't drop the state because it have to write the whole result all the time. Which makes sense.
Changing to append or update mode solved the issue

Related

Spark add duplicate only when one column is same and other is different

I have data like this
[
{"uuid":"fdkhflds","key": "A", "id": "1"},
{"uuid":"ieuieiue","key": "A", "id": "2"},
{"uuid":"qwtriqrr","key": "A", "id": "3"},
{"uuid":"dhgfsddd","key": "A", "id": "1"},
{"uuid":"sdjhfdjh","key": "E", "id": "4"}
]
I want to add flag in those column where key is same but id is different.
Expected output:
[
{"uuid":"fdkhflds","key": "A", "id": "1","de_dupe_required": 0},
{"uuid":"ieuieiue","key": "A", "id": "2","de_dupe_required": 1},
{"uuid":"qwtriqrr","key": "A", "id": "3","de_dupe_required": 1},
{"uuid":"dhgfsddd","key": "A", "id": "1","de_dupe_required": 0},
{"uuid":"sdjhfdjh","key": "E", "id": "4","de_dupe_required": 0}
]
Explanation:
Since first and fourth record have same key and id, So no flag is needed
Since fifth record has no same key or id, So no flag for this as well
Since second and third have the same key, but id is different so flag should be 1
You could achieve this with pyspark.sql.Window by generating a rank() for the keys ordered by the id. Then marking as de_dupe_required wherever the rank() is not 1.
from pyspark.sql import functions as F, Window
window_spec = Window.partitionBy("key").orderBy("id")
df = (df.withColumn("dupe_rank", F.rank().over(window_spec))
.withColumn("de_dupe_required", F.when(F.col("dupe_rank")==1, F.lit(0))
.otherwise(F.lit(1)))
.drop("dupe_rank")
)
df.show()
Output is:
+--------+---+---+----------------+
| uuid|key| id|de_dupe_required|
+--------+---+---+----------------+
|fdkhflds| A| 1| 0|
|dhgfsddd| A| 1| 0|
|ieuieiue| A| 2| 1|
|qwtriqrr| A| 3| 1|
|sdjhfdjh| E| 4| 0|
+--------+---+---+----------------+
Note this will still work if there are some combinations like having two (A,3) (as noted by #thebluephantom) since we order by id hence the rank will be greater than 1 for these rows.
Output for two (A,3):
+--------+---+---+----------------+
| uuid|key| id|de_dupe_required|
+--------+---+---+----------------+
|fdkhflds| A| 1| 0|
|dhgfsddd| A| 1| 0|
|ieuieiue| A| 2| 1|
|qwtriqrr| A| 3| 1|
|qwtriqrr| A| 3| 1|
|sdjhfdjh| E| 4| 0|
+--------+---+---+----------------+
The question is vague. This is my solution whereby we consider 2 A,3's being possible, thus not as per 1st answer.
%python
from pyspark.sql.functions import col, lit
df = spark.createDataFrame(
[
("A", 1, "xyz"),
("A", 2, "xyz"),
("A", 3, "xyz"),
("A", 3, "xyz"),
("A", 1, "xyz"),
("E", 4, "xyz"),
("A", 9, "xyz")
],
["c1", "c2", "c3"]
)
df2 = df.groupBy("c1", "c2").count().filter(col('count') == 1)
df3 = df2.groupBy("c1").count().filter(col('count') == 1)
df4 = df2.join(df3, df3.c1 == df2.c1, "leftanti").select("c1", "c2", lit(1)).toDF("c1", "c2", "ddr")
dfA = df.select("c1","c2")
dfB = df4.select("c1","c2")
df5 = dfA.exceptAll(dfB)
res = df4.withColumn("ddr", lit(1)).unionAll(df5.withColumn("ddr", lit(0)))
res.show()
returns:
+---+---+---+
| c1| c2|ddr|
+---+---+---+
| A| 2| 1|
| A| 9| 1|
| A| 1| 0|
| A| 1| 0|
| A| 3| 0|
| A| 3| 0|
| E| 4| 0|
+---+---+---+
It's about the algorithm, you can do the rest. It needs to be a step-wise approach.
You can do this using count window function.
Using your input data, I've added a new row for id=3. AFAIU, in this case the id=3 should also be marked 0 as there are now 2 occurrences for it.
data_sdf. \
withColumn('num_key_occurs', func.count('key').over(wd.partitionBy('key'))). \
withColumn('num_id_occurs_inkey', func.count('id').over(wd.partitionBy('key', 'id'))). \
withColumn('samekey_diffid',
((func.col('num_key_occurs') > 1) & (func.col('num_id_occurs_inkey') == 1)).cast('int')
). \
show()
# +---+---+--------+--------------+-------------------+--------------+
# | id|key| uuid|num_key_occurs|num_id_occurs_inkey|samekey_diffid|
# +---+---+--------+--------------+-------------------+--------------+
# | 4| E|sdjhfdjh| 1| 1| 0|
# | 1| A|fdkhflds| 5| 2| 0|
# | 1| A|dhgfsddd| 5| 2| 0|
# | 2| A|ieuieiue| 5| 1| 1|
# | 3| A|qwtriqrr| 5| 2| 0|
# | 3| A|blahbleh| 5| 2| 0|
# +---+---+--------+--------------+-------------------+--------------+
Feel free to drop the count columns at the end.

Pyspark udf to detect "Actors"

I have a matrix(dataframe) I want to find all the rows there the row and columns intersect with a '1'. (The 'Character' row value matches the column name)
Example. Sam is an actor. (He has a '1' in the column 'actor' and the row the 'character' value of 'actor'.) This would be a row I'm would want returned.
df = spark.createDataFrame(
[
("actor", "sam", "1", "0", "0", "0", "0"),
("villan", "jack", "0", "0", "0", "0", "0"),
("actress", "rose", "0", "0", "0", "1", "0"),
("comedian", "mike", "0", "1", "1", "0", "1"),
("musician", "young", "1", "1", "1", "1", "0")
],
["character", "name", "actor", "villan", "comedian", "actress", "musician"]
)
+---------+-----+-----+------+--------+-------+--------+
|character| name|actor|villan|comedian|actress|musician|
+---------+-----+-----+------+--------+-------+--------+
| actor| sam| 1| 0| 0| 0| 0|
| villan| jack| 0| 0| 0| 0| 0|
| actress| rose| 0| 0| 0| 1| 0|
| comedian| mike| 0| 1| 1| 0| 1|
| musician|young| 1| 1| 1| 1| 0|
+---------+-----+-----+------+--------+-------+--------+
#create function
def myMatch( needle, haystack ):
return haystack[needle]
#create udf
matched = udf(myMatch, StringType()) # your existing data is strings
#apply udf
df.select(\
df.name ,\
matched( \
df.character, \
f.struct( *[df[col] for col in df.columns] ) )\ # shortcut to add all columns to a struct so it can be passed to udf
.alias("IsPlayingCharacter") )\
.show()

Spark: how to process certain column content individually in dataframe?

The data structure is like this:
id
name
data
001
aaa
true,false,false
002
bbb
true,true,true
003
ccc
false,true,true
I want to map the results in data to their names by their corresponding orders in the mapping table. In detail, the first step is to get the order number of False in data and then get the name by order number in the mapping table.
For example, the first record has two False and their index numbers are 2 and 3, so the mapping result is code2 and code3. Also, there are all true in the second record so the mapping result is an empty string.
the mapping table: ("code1","code2","code3")
the expected result:
id
name
data
001
aaa
code2,code3
002
bbb
003
ccc
code1
Is it possible to achieve this in the dataframe?
If you are using spark 3+ you can use filter and transform functions as
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true"),
).toDF("id", "name", "data")
val cols = Seq("col1", "col2", "col3")
val dfNew = df.withColumn("data", split($"data", ","))
.withColumn("mapping", arrays_zip($"data", typedLit(cols)))
.withColumn("new1", filter($"mapping", (c: Column) => c.getField("data") === "false"))
.withColumn("data", transform($"new1", (c: Column) => c.getField("1")))
.drop("new1", "mapping")
dfNew.show(false)
Output:
+---+----+------------+
|id |name|data |
+---+----+------------+
|001|aaa |[col2, col3]|
|002|bbb |[] |
|003|ccc |[col1] |
+---+----+------------+
The following should work but be aware that it features a posexplode (explode an array with positional value) which can be a costly operation specially if you have a huge dataset.
val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true")
).toDF("id", "name", "data")
val codes = Seq(
(0, "code1"),
(1, "code2"),
(2, "code3")
).toDF("code_id", "codes")
val df1 = df.select($"*", posexplode(split($"data", ",")))
.join(codes, $"pos" === $"code_id")
.withColumn( "codes", when($"col" === "false", $"codes").otherwise(null) )
//+---+----+----------------+---+-----+-------+-----+
//| id|name| data|pos| col|code_id|codes|
//+---+----+----------------+---+-----+-------+-----+
//|001| aaa|true,false,false| 0| true| 0| null|
//|001| aaa|true,false,false| 1|false| 1|code2|
//|001| aaa|true,false,false| 2|false| 2|code3|
//|002| bbb| true,true,true| 0| true| 0| null|
//|002| bbb| true,true,true| 1| true| 1| null|
//|002| bbb| true,true,true| 2| true| 2| null|
//|003| ccc| false,true,true| 0|false| 0|code1|
//|003| ccc| false,true,true| 1| true| 1| null|
//|003| ccc| false,true,true| 2| true| 2| null|
//+---+----+----------------+---+-----+-------+-----+
val finalDf = df1.groupBy($"id", $"name").agg(concat_ws(",", collect_list($"codes")).as("data"))
//+---+----+-----------+
//| id|name| data|
//+---+----+-----------+
//|002| bbb| |
//|001| aaa|code2,code3|
//|003| ccc| code1|
//+---+----+-----------+

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

How can Spark writes larger files without additional data?

I use Spark EMR to process data and write them to S3. The data are partitioned by date. In the case where we re-process the same date data, I use a custom-made function that compares the ongoing processed dataframe with the data that is already in S3. Both data are fused so that no data is lost.
My issue is that between the first write and the second write of the same data, the total size of the data is different in S3.
The first write results in 200 files of variable sizes (20-100KB) for a total of 74MB. The second write results in 200 files of fixed sizes (about 430KB each) for a total of 84MB.
I compared both data from the different writes by importing them into dataframes, the number of rows is similar. The data are the same (I used df1.exceptAll(df2)).
Why is there a difference in file sizing between first and second writes?
Where could this additional 10MB come from?
I do not use any repartitions / coalesce.
Thanks in advance.
Maybe for some reason there are duplicates in the second df and your validation doesn't handle that scenario. In that case, you'll need to do the same verification, but inverting your df's.
Sample:
import spark.implicits._
val df1 = Seq(
(1,2,3),
(4,5,6)
).toDF("col_a", "col_b", "col_c")
val df2 = Seq(
(1,2,3),
(4,5,6),
(4,5,6)
).toDF("col_a", "col_b", "col_c")
df1.show()
df2.show()
// output:
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 1| 2| 3|
| 4| 5| 6|
+-----+-----+-----+
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 1| 2| 3|
| 4| 5| 6|
| 4| 5| 6|
+-----+-----+-----+
exceptAll validations:
df1.exceptAll(df2).show()
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
+-----+-----+-----+
df2.exceptAll(df1).show()
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
| 4| 5| 6|
+-----+-----+-----+

Resources