Spark behavior on native file system - apache-spark

We are experimenting to run Spark in our project without Hadoop and no distributed storage like HDFS. Spark is installed on a single node with 10 Cores and 16GB RAM and this node is not part of any cluster. Assuming Spark driver takes 2 cores and the rest of them are consumed by executors(2 each) at the time of execution.
If we process a big CSV file (of size 1 GB) stored in local disk in Spark as RDD and repartition it to 4 different partitions, will executors process each partition in parallel?
What would executors do if we don't repartition the RDD to 4 diff partitions?
Do we loose the power of distributed computing and parallelism if dont use HDFS?

Spark caps the maximum size of a partition at 2G, so you should be able to process the entire data with minimal partitioning and quicker processing time. You can set spark.executor.cores to 8 so as to utilize all you resources.
Ideally, you should set the number of partitions depending on the size of your data, and you are better off setting the number of partitions as a multiple of cores/executors.
To answer your question, setting number of partitions to 4 in your case will probably result in each partition being sent to an executor. So yes, each partition will be processed in parallel.
If you don't repartition, then Spark will do it for you depending on the data and split the load between the executors.
Spark works perfectly fine without Hadoop. You might see a negligible performance drop since your files are on the local filesystem and not on HDFS, but for a file of size 1GB it really doesn't matter.

Related

Spark configuration based on my data size

I know there's a way to configure a Spark Application based in your cluster resources ("Executor memory" and "number of Executor" and "executor cores") I'm wondering if exist a way to do it considering the data input size?
What would happen if data input size does not fit into all partitions?
Example:
Data input size = 200GB
Number of partitions in cluster = 100
Size of partitions = 128MB
Total size that partitions could handle = 100 * 128MB = 128GB
What about the rest of the data (72GB)?
I guess Spark will wait to have free the resources free due to is designed to process batches of data Is this a correct assumption?
Thank in advance
I recommend for best performance, don't set spark.executor.cores. You want one executor per worker. Also, use ~70% of the executor memory in spark.executor.memory. Finally- if you want real-time application statistics to influence the number of partitions, use Spark 3, since it will come with Adaptive Query Execution (AQE). With AQE, Spark will dynamically coalesce shuffle partitions. SO you set it to an arbitrarily-large number of partitions, such as:
spark.sql.shuffle.partitions=<number of cores * 50>
Then just let AQE do its thing. You can read more about it here:
https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
There are 2 aspects to your question. The first is regarding storage of this data, & the second is regarding data execution.
With regards to storage, when you say Size of partitions = 128MB, I assume you use HDFS to store this data & 128M is your default block size. HDFS itself internally decides how to split this 200GB file & store in chunks not exceeding 128M. And your HDFS cluster should have more than 200GB * replication factor of combined storage to persist this data.
Coming to the Spark execution part of the question, once you define spark.default.parallelism=100, it means that Spark will use this value as the default level of parallelism while performing certain operations (like join etc). Please note that the amount of data being processed by each executor is not affected by the block size (128M) in any way. Which means each executor task will work on 200G/100 = 2G of data (provided the executor memory is sufficient for the required operation being performed). In case there isn't enough capacity in the spark cluster to run 100 executors in parallel, then it will launch as many executors it can in batches as and when resources are available.

Why is spark dataframe repartition faster than coalesce when reducing number of partitions?

I have a df with 100 partitions, and before saving to HDFS as .parquet I want to reduce the number of partitions because the parquet files would be too small (<1MB).
I've added coalesce before writing:
df.coalesce(3).write.mode("append").parquet(OUTPUT_LOC)
It works but slows down the process from 2-3s per file to 10-20s per file.
When I try repartition:
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
The process does not slow down at all, 2-3s per file.
Why? Shouldn't coalesce always be faster when reducing the number of partitions because it avoids a full shuffle?
Background:
I'm importing files from local storage to spark cluster and saving the resulting dataframes as a parquet file. Each file is approx 100-200MB.
Files are located on the "spark-driver" machine, I'm running spark-submit in client deploy mode.
I'm reading files one by one in driver:
data = read_lines(file_name)
rdd = sc.parallelize(data,100)
rdd2 = rdd.flatMap(lambda j: myfunc(j))
df = rdd2.toDF(mySchema)
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
Spark version is 3.1.1
Spark/HDFS cluster has 5 workers with 8CPU,32GB RAM
Each executor has 4cores and 15GB RAM, that makes 10 executors total.
EDIT:
When I use coalesce(1) I get spark.rpc.message.maxSize limit breached error, but not when I use repartition(1). Could that be a clue?
Attaching DAG visualizations .. Looks like WholeStageCodegen part is taking too long on coalesce DAGs?
This can happen sometimes if your data is not evenly distributed and when you do coalesce it tries to reduce the partitions by combining the small partitions in order to reduce full shuffle but there could still be some data skew in one of the partition and that single partition would be taking the most of the time.
While you do repartition the data gets distributed almost evenly on all the partitions as it does full shuffle and all the tasks would almost get completed in the same time.
You could use the spark UI to see why when you are doing coalesce what is happening in terms of tasks and do you see any single task running long.

Apache Spark: How many partitions can a executor hold in spark.? How are the partitions distributed (mechanism) among the executors?

I am interested to know the following nitty gritties of spark parallelism and Partitioning
How many partitions can a executor hold in spark?
How are the partitions distributed (mechanism) among the executors?
How to set the size of the partition. Would like to know the relevant the config parameter.
Does executor store all the partitions in memory? If not when spilled to disk does it spill entire partition to disk or a part of partition to disk?
5 When there are 2 cores per executor but there are 5 partition in that executor then
Not quite the right way to look at it. An Executor holds nothing, it just does work.
A Partition is processed by a Core that has been assigned to an Executor. An Executor typically has 1 core but can have more than 1 such Core.
An App has Actions that translate to 1 or more Jobs.
A Job has Stages (based on Shuffle Boundaries).
Stages have Tasks, the number of these depends on number of Partitions.
Parallel processing of the Partitions depends on number of Cores allocated to Executors.
Spark is scalable in terms of Cores, Memory and Disk. The latter two in relation to your questions means that if the Partitions cannot all fit into Memory on the Worker for your Job, then that Partition or more will spill in its entirety to Disk.

How Apache Spark partitions data of a big file [duplicate]

This question already has answers here:
How does Spark partition(ing) work on files in HDFS?
(4 answers)
Closed 4 years ago.
Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.
But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark. So, apologies if this is a naive question.
As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)
So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)
Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.
Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.
It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.
Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver.
So to process the data, you will be able to process 3 partitions in parallel.
so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.
Hope that helps!
Cheers!
To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.
Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.
HTH!

What performance parameters to set for spark scala code to run on yarn using spark-submit?

My use case is to merge two tables where one table contains 30 million records with 200 cols and another table contains 1 million records with 200 cols.I am using broadcast join for small table.I am loading both the tables as data-frames from hive managed tables on HDFS.
I need the values to set for driver memory and executor memory and other parameters along with it for this use case.
I have this hardware configurations for my yarn cluster :
Spark Version 2.0.0
Hdp version 2.5.3.0-37
1) yarn clients 20
2) Max. virtual cores allocated for a container (yarn.scheduler.maximum.allocation-vcores) 19
3) Max. Memory allocated for a yarn container 216gb
4) Cluster Memory Available 3.1 TB available
Any other info you need I can provide for this cluster.
I have to decrease the time to complete this process.
I have been using some configurations but I think its wrong, it took me 4.5 mins to complete it but I think spark has capability to decrease this time.
There are mainly two things to look at when you want to speed up your spark application.
Caching/persistance:
This is not a direct way to speed up the processing. This will be useful when you have multiple actions(reduce, join etc) and you want to avoid the re-computation of the RDDs in the case of failures and hence decrease the application run duration.
Increasing the parallelism:
This is the actual solution to speed up your Spark application. This can be achieved by increasing the number of partitions. Depending on the use case, you might have to increase the partitions
Whenever you create your dataframes/rdds: This is the better way to increase the partitions as you don't have to trigger a costly shuffle operation to increase the partitions.
By calling repartition: This will trigger a shuffle operation.
Note: Once you increase the number of partitions, then increase the executors(may be very large number of small containers with few vcores and few GBs of memory
Increasing the parallelism inside each executor
By adding more cores to each executor, you can increase the parallelism at the partition level. This will also speed up the processing.
To have a better understanding of configurations please refer this post

Resources