How to identify that accumulated metadata is a problem in Spark? - apache-spark

I saw that we can trigger automatic clean ups of accumulated metadata in spark by setting spark.cleaner.ttlx or by writing intermediate dataframes into disk. But how do I know that the accumulated metadata is a problem that I must handle in my jobs?

Related

How to speed up recovery partition on S3 with Spark?

I am using Spark 3.0 on EMR to write down some data on S3 with a daily partitioning (data goes back to ~5 years), in this way:
writer.option("path", somepath).saveAsTable("my_schema.my_table")
Due to the large number of partitions the process is taking very long time just to "recover partitions" as all tasks seem completed before. Is there any way to reduce this intermediate time?
In the above code, you haven't mentioned the write mode. Default write mode is ErrorIfExists. This could cause an overhead by checking whether it exists while writing.
Also we could use dynamic partition which could optimize the huge volume of writes that is discussed here.
Here is a sample snippet
# while building the session
sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
...
...
# while writing
yourDataFrame.write
.option("path", somepath)
.partitionBy(“date”)
.mode(SaveMode.Overwrite)
.saveAsTable("my_schema.my_table")
If you are writing one time and if it's not a repeated process, the use case may not need dynamic partition. Dynamic partition is useful to skip overwriting already written partitions. The operation will be idempotent with performance benefits.

Spark's amnesia of parquet partitions when cached in-memory (native spark cache)

I am working on some batch processing with Spark, reading data from a partitioned parquet file which is around 2TB. Right now, I am caching the whole file, in-memory, since I need to restrict the reading of the same parquet file, multiple times (given the way, we are analyzing the data).
Till some time back, the code is working fine. Recently, we have added use-cases which needs to work on some selective partitions (like average of a metric for the last 2 years where the complete data spawns across 6+ years).
When we started taking metrics for the execution times, we have observed that the use-case, which will work on a subset of partitioned data, is also taking similar time when compared to the time taken by the use-case which requires to work on complete data.
So, my question is that whether Spark's in-memory caching honors partitions of a Parquet file i.e., will spark holds the partition information even after caching the data, in-memory ?
Note: Since this is really a general question about Spark's processing style, I didn't added any metrics or the code.

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

How to checkpoint a RDD without saving all of its data?

I am running a series of jobs and intermediate rdd is used in all jobs. So i have cached the intermediate rdds but after some iterations its slowing down. Then i used rdd check pointing after caching to break lineage which is not required. In spark UI i am able to confirm that check pointing is done correctly. But its also taking time because its writing each rdd to local system. What is the effective way to break unnecessary lineage without saving actual rdd data?
The exact point of checkpointing is to save all data.This enables breaking lineage and "forgetting" about the past. Without saving the data breaking the lineage is simply not possible.

spark streaming failed batches

I see some failed batches in my spark streaming application because of memory related issues like
Could not compute split, block input-0-1464774108087 not found
, and I was wondering if there is a way to re process those batches on the side without messing with the current running application, just in general , does not have to be the same exact exception.
Thanks in advance
Pradeep
This may happen in cases where your data ingestion rate into spark is higher than memory allocated or can be kept. You can try changing StorageLevel to MEMORY_AND_DISK_SER so that when it is low on memory Spark can spill data to disk. This will prevent your error.
Also, I don't think this error means that any data was lost while processing, but that input block which was added by your block manager just timed out before processing started.
Check similar question on Spark User list.
Edit:
Data is not lost, it was just not present where the task was expecting it to be. As per Spark docs:
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.

Resources