Breaking lineage of an RDD without relying on HDFS

I'm running a spark application on Amazon spot instances. In the end, I'm exporting my results to parquet files on S3. The tasks are memory intensive, so I have to run the initial calculations using a large number of partitions (hundreds of thousands). In the end, I would like to coalesce the partitions to a few large partitions and save them to big parquet files. And this is where I get into trouble:
- If I'm using .coalesce(), which is a narrow transformation, the entire lineage that precedes the coalesce will be executed on a small number of partitions, which will cause OOMs.
- If I'm using .repartition(), I rely on HDFS for the shuffle files.
This is a problem when using spot instances, which may be decommissioned, leaving corrupt/missing HDFS blocks.
- checkpointing also relies on HDFS so I can't use that.
- converting to a Dataframe and back didn't actually break the lineage (rdd.toDF.rdd, am I missing something?).
To conclude, I'm looking for a way to coalesce to a smaller amount of partitions only to persist the data on S3 - I would like for the calculation to happen using the original partitions.


Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint:
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors?

I am looking through spark partitioning and I see different answers for the question.
Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors?, and Does the performance improves by repartitioning the data in skewed data case? (I assume the data related to the same join key is again shuffled back to a single executor during the join). Please help me understand this. Thanks!
It really depends on your data where from you are reading. If you are reading from HDFS, then one block will be one partition. But if you are reading a parquet file, then one parquet file is one partition as it is not splittable, so depending on the block in case of HDFS and files count in case of parquet, it creates partitions.
Regarding the skewed data, the more data one partition has, the more time it takes to finish the execution. The other tasks will finished quickly as they have less data so the resources are not being utilized properly. Therefore, it is always better to repartition the skewed data properly, so all executors can evenly do the execution.
How to process data in parallel but write results in a single file in Spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk
The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
you can try

spark partitionBy out of memory failures

I have a Spark 2.2 job written in pyspark that's trying to read in 300BT of Parquet data in a hive table, run it through a python udf, and then write it out.
The input is partitioned on about five keys and results in about 250k partitions.
I then want to write it out using the same partition scheme using the .partitionBy clause for the dataframe.
When I don't use a partitionBy clause the data writes out and the job does finish eventually. However with the partitionBy clause I continuously see out of memory failures on the spark UI.
Upon further investigation the source parquet data is about 800MB on disk (compressed using snappy), and each node has about 50G of memory available to it.
Examining the spark UI I see that the last step before writing out is doing a sort. I believe this sort is the cause of all my issues.
When reading in a dataframe of partitioned data, is there any way to preserve knowledge of this partitioning so spark doesn't run an unnecessary sort before writing it out?
I'm trying to avoid a shuffle step here by repartitioning that could equally result in further delays of this.
Ultimately I can rewrite to read one partition at a time, but I think that's not a good solution and that spark should already be able to handle this use case.
I'm running with about 1500 executors across 150 nodes on ec2 r3.8xlarge.
I've tried smaller executor configs and larger ones and always hit the same out of memory issues.

Spark Streaming: avoid small files in HDFS

I have a Spark Streaming application that writes its output to HDFS.
What precautions and strategies can I take to ensure that not too many small files are generated by this process and create a memory pressure in the HDFS Namenode.
Does Apache Spark provides any pre-built solutions to avoid small files in HDFS.
No. Spark do not provide any such solution.
What you can do:
Increase batch interval - this will not guarantee anything - but still there is higher chance. Though the tradeoff here is that streaming will have bigger latency.
Manually manage it. For example - on each batch you could calculate size of the RDD and accumulate RDDs unless they satisfy your size requirement. Then you just union RDDs and write to disk. This will unpredictably increase latency, but will guarantee efficient space usage.
Another solution is also to get another Spark application that reaggregates the small files every hour/day/week,etc.
I know this question is old, but may be useful for someone in the future.
Another option is to use coalesce with a smaller number of partitions. coalesce merges partitions together and creates larger partitions. This can increase the processing time of the streaming batch because of the reduction in number of partitions during the write, but it will help in reducing the number of files.
This will reduce the parallelism, hence having too few partitions can cause issues to the Streaming job. You will have to test with different values of partitions for coalesce to find which value works best in your case.
You can reduce the number of part files .
By default spark generates output in 200 part files . You can decrease the number of part files .
