I am working on some batch processing with Spark, reading data from a partitioned parquet file which is around 2TB. Right now, I am caching the whole file, in-memory, since I need to restrict the reading of the same parquet file, multiple times (given the way, we are analyzing the data).
Till some time back, the code is working fine. Recently, we have added use-cases which needs to work on some selective partitions (like average of a metric for the last 2 years where the complete data spawns across 6+ years).
When we started taking metrics for the execution times, we have observed that the use-case, which will work on a subset of partitioned data, is also taking similar time when compared to the time taken by the use-case which requires to work on complete data.
So, my question is that whether Spark's in-memory caching honors partitions of a Parquet file i.e., will spark holds the partition information even after caching the data, in-memory ?
Note: Since this is really a general question about Spark's processing style, I didn't added any metrics or the code.


How to speed up recovery partition on S3 with Spark?

I am using Spark 3.0 on EMR to write down some data on S3 with a daily partitioning (data goes back to ~5 years), in this way:
writer.option("path", somepath).saveAsTable("my_schema.my_table")
Due to the large number of partitions the process is taking very long time just to "recover partitions" as all tasks seem completed before. Is there any way to reduce this intermediate time?
In the above code, you haven't mentioned the write mode. Default write mode is ErrorIfExists. This could cause an overhead by checking whether it exists while writing.
Also we could use dynamic partition which could optimize the huge volume of writes that is discussed here.
Here is a sample snippet
# while building the session
sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
# while writing
.option("path", somepath)
If you are writing one time and if it's not a repeated process, the use case may not need dynamic partition. Dynamic partition is useful to skip overwriting already written partitions. The operation will be idempotent with performance benefits.

What is difference between overwrite and append to parquet

What is the difference between append and overwrite to parquet in spark.
I'm processing huge amount of data for say 10 days. At present I'm processing daily logs into parquet files using "append" method and partitioning the data based on date. But the problem I'm facing is daily data is also very huge and taking a lot of time, contributing to high CPU usage as well while processing data using EMR cluster. This is making my job very slow and expensive. So I'm looking for a way where I can further split the data and can merge the data to day cluster.
Please see spark SaveMode docs

Keeping only small window of output data in Spark Structured Streaming

I am interested in using Spark Structured Streaming for real-time data processing using data from for example last ~24hours of running, however I am not able to find correct solution for this problem.
Some useful information about entire situation:
Data is constantly flowing all the time as an input for Spark so stream is active 24/7
Spark does some actions and then writes some data to files(eg. parquet)
Watermarking is used to reduce state size
Someone wants to work on only most recent data returned from Spark Structured Streaming(for example all data from last 24 hours) to have a quick view on what happened in last time and for further very specific analysis.
From what I understand, watermarking helps managing state size so Spark does not hold entire data about the state. This is a good thing and solves one problem with 24/7 running.
The other problem is output data. Currently Spark appends the data and nothing else, this makes it grow bigger and bigger. Using memory sink for testing it creates memory problems. I didn't try it with file sink because it creates one file for each record(ugh) so there's a risk of using all available inodes in system extremely quickly. I can create one file per window with file sink.
So my question is:
Is it possible to force Spark Structured Streaming to delete output data after some amount of time when it is no longer needed? I want to keep output data only from for example last 24 hours. Is there any build-in solution or do I need to do it on my own? If I needed to do it on my own, wouldn't checkpoint data and spark metadata get corrupted?
Using watermarking will allow us to keep only the last 24 hours
df.withWatermark("timestamp", "24 hours") //when timestamp is your event time field
From Spark Spark doc
in Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly.

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks
