Spark Streaming directory fills disk - apache-spark

I have a stream job which is intended to run continuously with a single step which uses mapWithState and therefore requires checkpointing to be configured. I set it up with a local directory as this is only running on a single node at this stage.
I'm observing that the checkpoint directory grows quickly and continuously. Over the course of a few days it grows to over a million files and exhausts the inodes on the disk.
Questions:
Is this expected behavior?
Assuming not, how can I isolate what might be causing the snapshots not to be pruned?

The error is that checkpointing was enable by sparkContext.checkpoint(checkpointDir) rather than sparkStreamingContext.checkpoint(checkpointDir).
The former was enough to make Spark run the stateful stream instead of complaining that checkpointing was not enabled, but the appropriate logic for streaming checkpoints was not being called because sparkStreamingContext.checkpointDir was null.

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Differences between persist(DISK_ONLY) vs manually saving to HDFS and reading back

This answer clearly explains RDD persist() and cache() and the need for it - (Why) do we need to call cache or persist on a RDD
So, I understand that calling someRdd.persist(DISK_ONLY) is lazy, but someRdd.saveAsTextFile("path") is eager.
But other than this (also disregarding the cleanup of text file stored in HDFS manually), is there any other difference (performance or otherwise) between using persist to cache the rdd to disk versus manually writing and reading from disk?
Is there a reason to prefer one over the other?
More Context: I came across code which manually writes to HDFS and reads it back in our production application. I've just started learning Spark and was wondering if this can be replaced with persist(DISK_ONLY). Note that the saved rdd in HDFS is deleted before every new run of the job and this stored data is not used for anything else between the runs.
There are at least these differences:
Writing to HDFS will have the replicas overhead, while caching is written locally on the executor (or to second replica if DISK_ONLY_2 is chosen).
Writing to HDFS is persistent, while cached data might get lost if/when an executor is killed for any reason. And you already mentioned the benefit of writing to HDFS when the entire application goes down.
Caching does not change the partitioning, but reading from HDFS might/will result in different partitioning than the original written DataFrame/RDD. For example, small partitions (files) will be aggregated and large files will be split.
I usually prefer to cache small/medium data sets that are expensive to evaluate, and write larger data sets to HDFS.

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

How can Spark in yarn recover from an executor loss with RDD persisted on disk

I prepared some RDD and compute it for a few hours. I use YARN. Sometimes executors get lost and spark (1.6) goes crazy as it misses source data.
Seems that persist(DISK) can help me in this situation.
But I wonder, as persist on disk uses a non-dfs place to store the data, how can remote executors read it? Or is the computation stuck until YARN can schedule an executor to a particular node?
Maybe I use the wrong mechanism and rdd.checkpoint(hdfs://) is more appropriate here?
So I've chosen checkpointing after all.
As I understood, cache has only speedup goal for the further iterations, but not the reliability ones. For example, with the property spark.dynamicAllocation.cachedExecutorIdleTimeout data can even be removed at some point.
Based on the documentation
http://spark.apache.org/docs/latest/job-scheduling.html#graceful-decommission-of-executors After removing an executor, its cached data can't be reached, although "spark shuffle service" is available on the host (it serves another purpose).
Checkpointing seems to work fine.

Where does Spark actually persist RDDs on disk?

I am using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY.
I think there might be something wrong with my code... Where can I find the persisted RDDs on disk so that I can make sure they were actually persisted?
As per the doc:
spark.local.dir (by default /tmp)
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Two possible reasons for your observation:
RDDs are persisted in a lazy fashion, therefore, to make it work you should call an action(e.g. count()) on it after you call persist()
Even if you make sure the persist() happens, the actual data may not write to disk actually, your write method is returned directly after the data is write into buffer cache, therefore, when you read it next to write, it simply return the cached data.
So, Did persist happens?
Did you clear linux Buffer cache on each node after persist rdd as DISK_ONLY, before operate on it and measure performance?
So what I suggest you to do is:
persist rdd as DISK_ONLY, invoke an action(e.g. count()), to make it persist.
sleep the application for a few seconds, clear the cache of all the worker node during this period
sync && echo 3 > /proc/sys/vm/drop_caches
resume your procedure, and measure the performance of persisted RDD.

Resources