HBase batch loading with speed control cause of slow consumer - apache-spark

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Related

How to speed up recovery partition on S3 with Spark?

I am using Spark 3.0 on EMR to write down some data on S3 with a daily partitioning (data goes back to ~5 years), in this way:
writer.option("path", somepath).saveAsTable("my_schema.my_table")
Due to the large number of partitions the process is taking very long time just to "recover partitions" as all tasks seem completed before. Is there any way to reduce this intermediate time?
In the above code, you haven't mentioned the write mode. Default write mode is ErrorIfExists. This could cause an overhead by checking whether it exists while writing.
Also we could use dynamic partition which could optimize the huge volume of writes that is discussed here.
Here is a sample snippet
# while building the session
sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
...
...
# while writing
yourDataFrame.write
.option("path", somepath)
.partitionBy(“date”)
.mode(SaveMode.Overwrite)
.saveAsTable("my_schema.my_table")
If you are writing one time and if it's not a repeated process, the use case may not need dynamic partition. Dynamic partition is useful to skip overwriting already written partitions. The operation will be idempotent with performance benefits.

Kafka , Spark large csv file (4Go)

I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks
If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Resources