We have sensors starting and running for a random duration multiple times a day. The data from the sensors are sent to a Kafka topic and is consumed by Spark Structured streaming API and is stored to a Delta Lake. Now we have to identify and store sessions for each sensor in a different Delta Lake table partitioned by device_id and sensor_id.
I tried with Spark Structured Streaming with watermarking but didn't do much good.
stream2 = spark.readStream.format('delta')
.load('<FIRST_DELTA_LAKE_TABLE>')
.select('device_id', 'json', 'time')
.withWatermark('timestamp', '10 minutes')
.groupBy('device_id').agg(F.min('time').alias('min_time'), F.max('time').alias('max_time')))
.writeStream
.format("delta")
stream2.start("<SESSIONS_TABLE>")
The idea was to have second table identifying the sessions from incoming data and saving the start time and end time for each session and device. The streaming jobs runs and nothing gets written to the Sessions delta table.
Any help on this will be appreciated.
By default, the when you're writing a stream, it uses the append mode by default (see the doc). And in this mode, when you're using watermarks, the data output only after the watermark is crossed, so there will be at least 10 minutes delay until you start to see the data in the output.
But I think that the primary problem is that you're running the "global" aggregations, without defined window, or something like. Usually for session detection people use flatMapGroupWithState, something like described in the following blog post.
Related
I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.
We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.
How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.
Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.
The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()
Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).
I am interested in using Spark Structured Streaming for real-time data processing using data from for example last ~24hours of running, however I am not able to find correct solution for this problem.
Some useful information about entire situation:
Data is constantly flowing all the time as an input for Spark so stream is active 24/7
Spark does some actions and then writes some data to files(eg. parquet)
Watermarking is used to reduce state size
Someone wants to work on only most recent data returned from Spark Structured Streaming(for example all data from last 24 hours) to have a quick view on what happened in last time and for further very specific analysis.
From what I understand, watermarking helps managing state size so Spark does not hold entire data about the state. This is a good thing and solves one problem with 24/7 running.
The other problem is output data. Currently Spark appends the data and nothing else, this makes it grow bigger and bigger. Using memory sink for testing it creates memory problems. I didn't try it with file sink because it creates one file for each record(ugh) so there's a risk of using all available inodes in system extremely quickly. I can create one file per window with file sink.
So my question is:
Is it possible to force Spark Structured Streaming to delete output data after some amount of time when it is no longer needed? I want to keep output data only from for example last 24 hours. Is there any build-in solution or do I need to do it on my own? If I needed to do it on my own, wouldn't checkpoint data and spark metadata get corrupted?
Using watermarking will allow us to keep only the last 24 hours
df.withWatermark("timestamp", "24 hours") //when timestamp is your event time field
From Spark Spark doc
in Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly.