Kafka , Spark large csv file (4Go) - apache-spark

I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks

If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Simultaneously download and process data in Apache Spark

I have multiple files which ideally would all be the input to the map-reduce job. But each takes some time to download, so I was wondering if I can begin in-memory processing on the ones already downloaded meanwhile the others are pending download.
This methodology will pipeline the whole process according to me. How do I achieve this with Apache Spark, and would this lead to huge deltas in the mapper sizes, causing me to shuffle and shard? Are there any other problems you see in this approach?

Spark structured streaming job parameters for writing .compact files

I'm currently streaming from a file source, but every time a .compact file needs to be written, there's a big latency spike (~5 minutes; the .compact files are about 2.7GB). This is kind of aggravating because I'm trying to keep a rolling window's latency below a threshold and throwing an extra five minutes on there once every, say, half-hour messes with that.
Are there any spark parameters for tweaking .compact file writes? This system seems very lightly documented.
It looks like you ran into a reported bug: SPARK-30462 - Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects which was fixed in Spark version 3.1.
Before that version there are no other configurations to prevent the compact file to be growing incrementally while using quite a lot of memory which makes the compaction slow.
Here is description of the Release Note on Structured Streaming:
Streamline the logic on file stream source and sink metadata log (SPARK-30462)
Before this change, whenever the metadata was needed in FileStreamSource/Sink, all entries in the metadata log were deserialized into the Spark driver’s memory. With this change, Spark will read and process the metadata log in a streamlined fashion whenever possible.
No sooner do I throw in the towel, than an answer appears. According to the Jacek Laskowski's book on Spark here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-properties.html
There is a parameter spark.sql.streaming.fileSource.log.compactInterval that controls this interval. If anyone knows of any other parameters that control this behaviour, though, please let me know!

Best way to return data back to driver from spark workers

We're facing performance issues running big data task on a single machine. Task by design is memory and compute intensive and is running an optimization algorithm (branch and bound algorithm) on huge data sets. A single EC2 c5.24x large machine (96vCPU/192GiB) is now taking over 2 days to complete the task. We have tried to optimize the code (multiple threads, more memory, parallelism, optimized algorithm) but looks like there's a limit to how much we could achieve and doesn't sound like a scalable option as the data set is growing and we're adding more use cases to it.
Thinking of distributing this in to smaller tasks and have it executed by multiple workers in spark cluster. Output of the task will be a single Gzipped JSON (2- 20 MB in size) and by distributing i want each worker to build smaller JSONs or RDD chunks which could later be merged at the driver side.
Is this doable? Is there a limit on how much data each worker can send back to the driver? Or is it better to store each worker output to some database (S3) and then merge at driver side? What are the pros and cons of each approach?
I would suggest not to collect data from executors to driver and combine them there. It makes the driver a bottleneck and it will run into frequent resource related issues. The best option is to let the executors process the data and produce JSONs. You can leave the output JSONs as is which will help in reprocessing them again. If you are too worried about size of these JSONs or want to combine them to send to another process then you can use a file utility to combine them or you can use dataframe.coleasc(1) to generate one output file.

S3 based streaming solution using apache spark or flink

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.
I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

Resources