How to write a large file to disk on each worker - apache-spark

Newbie to spark. I have a large data file that needs to be writen to disk on each worker before my application runs. RDD.mapPartitions seems like what I should be using, but I'm not sure if each worker contains only a single partition and how to create an RDD guaranteed to have a partition for each worker.

A few things:
RDD.saveAsTextFile will write to disk from each worker, all you need to do is make sure that you have the right number of partitions (you probebly want to set the number of partitions to the number of cores you have available for workers in the cluster) for example:
val files = sc.textFile("file:...")
val prt = files.repartition(5)
prt.saveAsTextFile("file:...")
Also you should note that RDD.mapPartition executes the map operation for all the element in the partition, map is a transformation which means it is expected transform the data and is lazy.

Related

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?
To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

How to process data in parallel but write results in a single file in Spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk
The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
coalesce(1)
you can try
repartition(1)

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver.
I have 3 questions -
Let's day you have a 20 TB flat file file stored in HDFS and from a driver program you pull it into a data frame or an RDD, using one of the respective libraries' out of the box functions (sc.textfile(path) or sc.textfile(path).toDF, etc). Will it cause the driver program to have OOM if the driver is run with only 32 gb memory? Or at least have swaps on the driver Jim? Or will spark and hadoop be smart enough to distribute the data from HDFS into a spark executor to make a dataframe/RDD without going through the driver?
The exact same question as 1 except from an external RDBMS?
The exact same question as 1 except from a specific nodes file system (just Unix file system, a 20 TB file but not HDFS)?
Regarding 1
Spark operates with distributed data structure like RDD and Dataset (and Dataframe before 2.0). Here are the facts that you should know about this data structures to get the answer to your question:
All the transformation operations like (map, filter, etc.) are lazy.
This means that no reading will be performed unless you require a
concrete result of your operations (like reduce, fold or save the
result to some file).
When processing a file on HDFS Spark operates
with file partitions. Partition is a minimal logical batch of data
the can be processed. Normally one partition equals to one HDFS
block and the total number of partitions can never be less then
number of blocks in a file. The common (and default one) HDFS block size is 128Mb
All actual computations (including reading from the HDFS) in RDD and
Dataset are performed inside of executors and never on driver. Driver
creates a DAG and logical plan of execution and assigns tasks to
executors for further processing.
Each executor runs the previously
assigned task against a particular partition of data. So normally if you allocate only one core to your executor it would process no more than 128Mb (default HDFS block size) of data at the same time.
So basically when you invoke sc.textFile no actual reading happens. All mentioned facts explain why OOM doesn't occur while processing even 20 Tb of data.
There are some special cases like i.e. join operations. But even in this case all executors flush their intermediate results to local disk for further processing.
Regarding 2
In case of JDBC you can decide how many partitions will you have for your table. And choose the appropriate partition key in your table that will split the data into partitions properly. It's up to you how many data will be loaded into a memory at the same time.
Regarding 3
The block size of the local file is controlled by the fs.local.block.size property (I guess 32Mb by default). So it is basically the same as 1 (HDFS file) except the fact that you will read all data from one machine and one physical disk drive (which is extremely inefficient in case of 20TB file).

Data distribution in Apache Spark

I'm new to spark and have general question.As far as I know the whole file must be available on all worker nodes to be processed.If so, how do they know which partition should read?Driver controls the partitions but how does driver tell them to read what partition?
Each RDD is divided into multiple partition. To compute each partition, Spark will generate a task and assign it to a worker node. When the driver sends a task to the worker, it also specifies the PartitionID of that task.
The worker then executes the task by chaining the RDD's iterator all the way back to the InputRDD and pass along the PartitionID. The InputRDD determines which part of the input corresponding to the specified partition id and return the data.
rddIter.next -> parentRDDIter.next -> grandParentRDDIter.next -> ... -> InputRDDIter.next
Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
https://github.com/jaceklaskowski/mastering-apache-spark-book

how does hashpartitioner work in spark?

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Resources