How to process data in parallel but write results in a single file in Spark - apache-spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk

The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
coalesce(1)
you can try
repartition(1)

Related

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

Spark executor out of memory on join

Hi I am using spark Mllib and doing approxSimilarityJoin between a 1M dataset and a 1k dataset.
When i do it I bradcast the 1k one.
What I see is that thew job stops going forward at the second-last task.
All the executors are dead but one which keeps running for very long time until it reaches Out of memory.
I checked ganglia and it shows memory keeping rising until it reaches the limit
and the disk space keeps going down until it finishes:
The action I called is a write, but it does the same with count.
Now I wonder: is it possible that all the partitions in the cluster converge to only one node and creating this bottleneck?
Here is my code snippet:
var dfW = cookesWb.withColumn("n", monotonically_increasing_id())
var bunchDf = dfW.filter(col("n").geq(0) && col("n").lt(1000000) )
bunchDf.repartition(3000)
model.
approxSimilarityJoin(bunchDf,broadcast(cookesNextLimited),80,"EuclideanDistance").
withColumn("min_distance", min(col("EuclideanDistance")).over(Window.partitionBy(col("datasetA.uid")))
).
filter(col("EuclideanDistance") === col("min_distance")).
select(col("datasetA.uid").alias("weboId"),
col("datasetB.nextploraId").alias("nextId"),
col("EuclideanDistance")).write.format("parquet").mode("overwrite").save("approxJoin.parquet")
I'll try to answer as best as I can.
In Spark there are things that are called shuffle operations, and they do exactly what you thought , after some calculations they transfer all the information to a single node.
If you think about it there's no other way for those operations to work without putting all the data in a single node in the end.
example for join operation:
you have to partitions on 2 different nodes
partition 1:
s, 1
partition 2:
s, k
and you want to join by the s.
If you dont get both rows on a single machine it will be impossible to calculate they need to be joined.
It is the same with count and reduce and many more operations.
You can read about shuffle operations or ask me if you want more clarification.
a possible solution for you is :
instead of only saving data in memory you can use something like :
dfW.persist(StorageLevel.MEMORY_AND_DISK_SER)
there are other options for persist but what it does basically is saving the partitions and data not only in memory but in disk as well in a Serialized way to save space.

Breaking lineage of an RDD without relying on HDFS

I'm running a spark application on Amazon spot instances. In the end, I'm exporting my results to parquet files on S3. The tasks are memory intensive, so I have to run the initial calculations using a large number of partitions (hundreds of thousands). In the end, I would like to coalesce the partitions to a few large partitions and save them to big parquet files. And this is where I get into trouble:
- If I'm using .coalesce(), which is a narrow transformation, the entire lineage that precedes the coalesce will be executed on a small number of partitions, which will cause OOMs.
- If I'm using .repartition(), I rely on HDFS for the shuffle files.
This is a problem when using spot instances, which may be decommissioned, leaving corrupt/missing HDFS blocks.
- checkpointing also relies on HDFS so I can't use that.
- converting to a Dataframe and back didn't actually break the lineage (rdd.toDF.rdd, am I missing something?).
To conclude, I'm looking for a way to coalesce to a smaller amount of partitions only to persist the data on S3 - I would like for the calculation to happen using the original partitions.

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver.
I have 3 questions -
Let's day you have a 20 TB flat file file stored in HDFS and from a driver program you pull it into a data frame or an RDD, using one of the respective libraries' out of the box functions (sc.textfile(path) or sc.textfile(path).toDF, etc). Will it cause the driver program to have OOM if the driver is run with only 32 gb memory? Or at least have swaps on the driver Jim? Or will spark and hadoop be smart enough to distribute the data from HDFS into a spark executor to make a dataframe/RDD without going through the driver?
The exact same question as 1 except from an external RDBMS?
The exact same question as 1 except from a specific nodes file system (just Unix file system, a 20 TB file but not HDFS)?
Regarding 1
Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:
All the transformation operations like (map, filter, etc.) are lazy.
This means that no reading will be performed unless you require a
concrete result of your operations (like reduce, fold or save the
result to some file).
When processing a file on HDFS Spark operates
with file partitions. Partition is a minimal logical batch of data
the can be processed. Normally one partition equals to one HDFS
block and the total number of partitions can never be less then
number of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD and
Dataset are performed inside of executors and never on driver. Driver
creates a DAG and logical plan of execution and assigns tasks to
executors for further processing.
Each executor runs the previously
assigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.
So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.
There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.
Regarding 2
In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.
Regarding 3
The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

how does hashpartitioner work in spark?

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Resources