How to speed up recovery partition on S3 with Spark? - apache-spark

I am using Spark 3.0 on EMR to write down some data on S3 with a daily partitioning (data goes back to ~5 years), in this way:
writer.option("path", somepath).saveAsTable("my_schema.my_table")
Due to the large number of partitions the process is taking very long time just to "recover partitions" as all tasks seem completed before. Is there any way to reduce this intermediate time?

In the above code, you haven't mentioned the write mode. Default write mode is ErrorIfExists. This could cause an overhead by checking whether it exists while writing.
Also we could use dynamic partition which could optimize the huge volume of writes that is discussed here.
Here is a sample snippet
# while building the session
sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)
...
...
# while writing
yourDataFrame.write
.option("path", somepath)
.partitionBy(“date”)
.mode(SaveMode.Overwrite)
.saveAsTable("my_schema.my_table")
If you are writing one time and if it's not a repeated process, the use case may not need dynamic partition. Dynamic partition is useful to skip overwriting already written partitions. The operation will be idempotent with performance benefits.

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Spark's amnesia of parquet partitions when cached in-memory (native spark cache)

I am working on some batch processing with Spark, reading data from a partitioned parquet file which is around 2TB. Right now, I am caching the whole file, in-memory, since I need to restrict the reading of the same parquet file, multiple times (given the way, we are analyzing the data).
Till some time back, the code is working fine. Recently, we have added use-cases which needs to work on some selective partitions (like average of a metric for the last 2 years where the complete data spawns across 6+ years).
When we started taking metrics for the execution times, we have observed that the use-case, which will work on a subset of partitioned data, is also taking similar time when compared to the time taken by the use-case which requires to work on complete data.
So, my question is that whether Spark's in-memory caching honors partitions of a Parquet file i.e., will spark holds the partition information even after caching the data, in-memory ?
Note: Since this is really a general question about Spark's processing style, I didn't added any metrics or the code.

spark partitionBy out of memory failures

I have a Spark 2.2 job written in pyspark that's trying to read in 300BT of Parquet data in a hive table, run it through a python udf, and then write it out.
The input is partitioned on about five keys and results in about 250k partitions.
I then want to write it out using the same partition scheme using the .partitionBy clause for the dataframe.
When I don't use a partitionBy clause the data writes out and the job does finish eventually. However with the partitionBy clause I continuously see out of memory failures on the spark UI.
Upon further investigation the source parquet data is about 800MB on disk (compressed using snappy), and each node has about 50G of memory available to it.
Examining the spark UI I see that the last step before writing out is doing a sort. I believe this sort is the cause of all my issues.
When reading in a dataframe of partitioned data, is there any way to preserve knowledge of this partitioning so spark doesn't run an unnecessary sort before writing it out?
I'm trying to avoid a shuffle step here by repartitioning that could equally result in further delays of this.
Ultimately I can rewrite to read one partition at a time, but I think that's not a good solution and that spark should already be able to handle this use case.
I'm running with about 1500 executors across 150 nodes on ec2 r3.8xlarge.
I've tried smaller executor configs and larger ones and always hit the same out of memory issues.

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Resources