how to handle deletes in spark streaming - apache-spark

in our spark streaming upstream deletion of PII data happens every 2 weeks and that source location is year/month/day partitioned. This deletion process is deleting data across the partitions and flooding our downstream streaming job with loads of data and impairing our down stream job.
we are already using IgnoreChanges: True. Is there anything that could be done to fix this issue?
Thanks

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

Does Spark streaming receivers continue pulling data for every block interval during the current micro-batch

For every spark.streaming.blockInterval (say, 1 minute) receivers listen to streaming sources for data. Suppose the current micro-batch is taking an unnaturally long time to complete (by intention, say 20 min). During this micro-batch, would the Receivers still listens to the streaming source and store it in Spark memory?
The current pipeline runs in Azure Databricks by using Spark Structured Streaming.
Can anyone help me understand this!
With the above scenario the Spark will continue to consume/pull data from Kafka and micro batches will continue to pile up and eventually cause Out of memory (OOM) issues.
In order to avoid the scenario enable back pressure setting,
spark.streaming.backpressure.enabled=true
https://spark.apache.org/docs/latest/streaming-programming-guide.html
For more details on Spark back pressure feature

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.

S3 based streaming solution using apache spark or flink

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.
I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

How Spark Structured Streaming handles backpressure?

I'm analyzing the backpressure feature on Spark Structured Streaming. Does anyone know the details? Is it possible to tune process incoming records by code?
Thanks
If you mean dynamically changing the size of each internal batch in Structured Streaming, then NO. There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source. Read the following links for more details:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Handling back pressure is needed only is push based mechanisms. Kafka consumers are pull based, spark will pull next batch of records only when current batch is finished processing and saving. If processing & saving is delayed in spark, it won't pull new batch of records so no need of back pressure handling.
maxOffsetsPerTrigger can change the number of records processed per spark batch set, backpressure.enabled changes rate of receiving, but that's not same as back pressure where you go and tell the source to slow dow.

Resources