I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table:
operations\
.repartition('date')\
.writeStream\
.outputMode('append')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/operations/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/operations')
This is working fine, but i need to summarize this information grouped by (date,client_id), so i created another streaming from this operations table to a new table:
summarized= spark.readStream.format('delta').load('/mnt/sandbox/operations')
summarized= summarized.groupBy('client_id','date').agg(<a lot of aggs>)
summarized.repartition('date')\
.writeStream\
.outputMode('complete')\
.trigger(once=True)\
.option("checkpointLocation", "/mnt/sandbox/summarized/_chk")\
.format('delta')\
.partitionBy('date')\
.start('/mnt/sandbox/summarized')
This is working, but every time I got new data into operations table, spark recalculates summarized all over again. I tried to use the append mode on the second streaming, but it need watermarks, and the date is DateType.
There is a way to only calculate new aggregates based on the group keys and append them on the summarized?
You need to use Spark Structured Streaming - Window Operations
When you use windowed operations, it will do the bucketing according to windowDuration and slideDuration. windowDuration tells you what is the length of the window, and slideDuration tells by how much time should you slide the window.
If you groupby using window() [docs], you will get a resultant window column along with other columns you groupby with like client_id
For example:
windowDuration = "10 minutes"
slideDuration = "5 minutes"
summarized = before_summary.groupBy(before_summary.client_id,
window(before_summary.date, windowDuration, slideDuration)
).agg(<a lot of aggs>).orderBy('window')
Related
I am using in Spark Structured Streaming foreachBatch() to maintain manually a sliding window, consisting of the last 200000 entries. With every microbatch I receive about 50 rows. On this sliding sliding window I am calculating manually my desired metrices like min, max, etc.
Spark provides also a Sliding Window function. But I have two problems with it:
The interval when the sliding window is updated can only be configured based on a time period, but there seems no possibility to force an update with each single microbatch coming in. Is there a possibility that I do not see?
The bigger problem: It seems I can only do aggregations using grouping like:
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
But I do not want to group over multiple sliding windows. I need something like the existing foreachBatch() that allows me to access not only the current batch but also/or the current sliding window. Is there something like that?
Thank you for your time!
You can probably use flatMapGroupsWithState feature to achieve this.
Basically you can store/keep updating previous batches in an internal state(only the information you need) and use it in the next batch
You can refer below links
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#arbitrary-stateful-operations
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-demo-arbitrary-stateful-streaming-aggregation-flatMapGroupsWithState.html
Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.
Following function is supposed to join two DataFrames and return the number of checkouts per location. It is based on the Seattle Public Library data set.
def topKCheckoutLocations(checkoutDF: DataFrame, libraryInventoryDF: DataFrame, k: Int): DataFrame = {
checkoutDF
.join(libraryInventoryDF, "ItemType")
.groupBy("ItemBarCode", "ItemLocation") //grouping by ItemBarCode and ItemLocation
.agg(count("ItemBarCode")) //counting number of ItemBarCode for each ItemLocation
.withColumnRenamed("count(ItemBarCode)", "NumCheckoutItemsAtLocation")
.select($"ItemLocation", $"NumCheckoutItemsAtLocation")
}
When I run this, it takes ages to finish (40+ minutes), and I'm pretty sure it is not supposed to take more than a couple of minutes. Can I change the order of the calls to decrease computation time?
As I never managed to finish computation I never actually got to check whether the output is correct. I assume it is.
The checkoutDF has 3 mio. rows.
For spark job performance
Select the required column from the dataset before joins to
decrease data size
Partition your both dataset by join column ("ItemType") to avoid shuffling
I am fetching data from Kafka topics and storing them in Deltalake(parquet) format. I wish to find the number of messages fetched in particular day.
My thought process: I thought to read the directory where the data is stored in parquet format using spark and apply count on the files with ".parquet" for a particular day. This returns a count but I am not really sure if that's the correct way.
Is this way correct ? Are there any other ways to count the number of messages fetched from a Kafka topic for a particular day(or duration) ?
Message we consume from topic not only have key-value but also have other information like timestamp
Which can be used to track the consumer flow.
Timestamp
Timestamp get updated by either Broker or Producer based on Topic configuration. If Topic configured time stamp type is CREATE_TIME, the timestamp in the producer record will be used by the broker whereas if Topic configured to LOG_APPEND_TIME , timestamp will be overwritten by the broker with the broker local time while appending the record.
So if you are storing any where if you keep timestamp you can very well track per day, or per hour message rate.
Other way you can use some Kafka dashboard like Confluent Control Center (License price) or Grafana (free) or any other tool to track the message flow.
In our case while consuming message and storing or processing along with that we also route meta details of message to Elastic Search and we can visualize it through Kibana.
You can make use of the "time travel" capabilities that Delta Lake offers.
In your case you can do
// define location of delta table
val deltaPath = "file:///tmp/delta/table"
// travel back in time to the start and end of the day using the option 'timestampAsOf'
val countStart = spark.read.format("delta").option("timestampAsOf", "2021-04-19 00:00:00").load(deltaPath).count()
val countEnd = spark.read.format("delta").option("timestampAsOf", "2021-04-19 23:59:59").load(deltaPath).count()
// print out the number of messages stored in Delta Table within one day
println(countEnd - countStart)
See documentation on Query an older snapshot of a table (time travel).
Another way to retrieve this information without counting the rows between two versions is to use Delta table history. There are several advantages of that - you don't read the whole dataset, you can take into account updates & deletes as well, for example if you're doing MERGE operation (it's not possible to do with comparing .count on different versions, because update is replacing the actual value, or delete the row).
For example, for just appends, following code will count all inserted rows written by normal append operations (for other things, like, MERGE/UPDATE/DELETE we may need to look into other metrics):
from delta.tables import *
df = DeltaTable.forName(spark, "ml_versioning.airbnb").history()\
.filter("timestamp > 'begin_of_day' and timestamp < 'end_of_day'")\
.selectExpr("cast(nvl(element_at(operationMetrics, 'numOutputRows'), '0') as long) as rows")\
.groupBy().sum()
First of all, I'm pretty new to spark, so apologies if I'm missing the obvious!
I'm developing a POC using Spark, which consumes a stream of data from Apache Kafka. My first goal was general moving averages which was simple using the 'window' function in Spark, and calculate some averages based on some keys.
My next goal is to calculate the 'delta' since the last window. So, if I have a parameter call 'noise', the 'window' function calculates avg(noise). But I want to also include the the delta of avg(noise) between the current window the the previous window.
I tried using the lag function, however it does not look like it is supposed:
Non-time-based windows are not supported on streaming DataFrames/Datasets
My question is, does Spark Structure Streaming provide some way to calculate this out of the box? I've contemplated using MapGroupsWithStateFunction which I think might work, but if there is a built in approach that is obviously preferable.
My bit of code for this is:
WindowSpec w = Window.partitionBy(functions.col("window.start"), functions.col("keyName")).orderBy(functions.col("window.start"));
Dataset<Row> outputDS = inputDataset.withWatermark("timeStamp", "1 days")
.groupBy(functions.col("keyName"), functions.window(functions.col("timeStamp"), "1 hours", "1 hours"))
.avg("noise").withColumn("delta", functions.lag("avg(noise)", 1).over(w));