S3 based streaming solution using apache spark or flink - apache-spark

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.

I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

Related

How to get spark streaming to continue where spark batch left off

I have monthly directories of parquet files (~10TB each directory). Files are being atomically written to this directory every minute or so. When we get to a new month, a new directory is created and data is written there. Once data is written, it cannot be moved.
I easily run batch queries on this data using spark (batch mode). I can also easily run spark streaming queries.
I am wondering how I can reconcile the two modes: batch and stream.
For example: Lets say I run a batch query on the data. I get the results of the query and do something with them. I can then checkpoint this dataframe. Now let's say I want to start a streaming job to only process new files relative to what was processed in the batch job, ie. only files not processed in the batch job should now be processed.
Is this possible with spark streaming? If start a spark streaming job and use the same checkpoint that the batch job used, will it proceed as I want it to?
Or, with the batch job, do I need to keep track of what files were processed and then somehow pass this to spark streaming so it can know to not process these.
This seems like a pretty common problem, so I am asking here to see what some other big data software developers have done.
I apologize for not having any code to post in this question, but I hope that my explanation is all it takes for someone to see a potential solution. If needed, I can come up with some snippets

Ignoring preexisting data in Spark structured streaming

I have a Spark stream that process files in an S3 prefix. The issue is that there are many TBs of data already in this prefix, meaning the EMR cluster underneath Spark is getting throttled trying to process it when the stream is turned on.
What I want is to ignore all files before a certain date, and then have the stream run normally. Is there a recommended way to do this?
I think I found what I need.
val df=spark.readStream
.schema(testSchema)
.option(“maxFileAge”, “1”)
.parquet(“s3://bucket/prefix”)
This ignores everything older than a week.

Spark structured streaming job parameters for writing .compact files

I'm currently streaming from a file source, but every time a .compact file needs to be written, there's a big latency spike (~5 minutes; the .compact files are about 2.7GB). This is kind of aggravating because I'm trying to keep a rolling window's latency below a threshold and throwing an extra five minutes on there once every, say, half-hour messes with that.
Are there any spark parameters for tweaking .compact file writes? This system seems very lightly documented.
It looks like you ran into a reported bug: SPARK-30462 - Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects which was fixed in Spark version 3.1.
Before that version there are no other configurations to prevent the compact file to be growing incrementally while using quite a lot of memory which makes the compaction slow.
Here is description of the Release Note on Structured Streaming:
Streamline the logic on file stream source and sink metadata log (SPARK-30462)
Before this change, whenever the metadata was needed in FileStreamSource/Sink, all entries in the metadata log were deserialized into the Spark driver’s memory. With this change, Spark will read and process the metadata log in a streamlined fashion whenever possible.
No sooner do I throw in the towel, than an answer appears. According to the Jacek Laskowski's book on Spark here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-properties.html
There is a parameter spark.sql.streaming.fileSource.log.compactInterval that controls this interval. If anyone knows of any other parameters that control this behaviour, though, please let me know!

What is the the best strategy to consume 500 Kafka topics and write Parquet?

I have 500 Kafka topics. Some of them are TB level; some are MB per hour.
I want to buffer them for five minutes and write them as Parquet files to the specific location on S3. The location is changed every 5 minutes based on the topic name, time, and random id, and whenever the location changes, I send a message to the status topic. We can imagine as a transactional write operation happens every 5 minutes.
I am trying to find the best solution for my problem with Spark or Flink.
For Spark, I could use structured streaming for consuming from Kafka and write logic per topic. If I have application per topic, I will waste a lot of resources for application master(driver), if I have one application and consume all topics there, some of the topics are sparse, and the app will be highly coupled.
I am also open to suggestions other than Spark or Flink. What is my best option?
I have implemented similar problem via Kafka-S3 sink where I dump files using a custom hourly S3 sink. Kafka S3 flushes files for every 500 lines to S3 which is a plain JSON and also pushes the file name to another topic "s3-file-names".
There is a spark streamer job which consumes these file names , group name under same partition performs the JSON to parquet conversion and saves back to S3. My job currently refreshes the S3 location every minute as I have configured the S3 flush file size i.e. 500 lines accordingly.

Kafka , Spark large csv file (4Go)

I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks
If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.

Resources