Spark Executors RAM and file Size - apache-spark

I am reading text files of size 8.2 GB(all files in a folder) with WholeTextFiles method.
The job that read the files got 3 executors each with 4 cores and 4GB memory a shown in picture..
Though the job page is showing 3 executors, only 2 executors are really working on the data.(i can understand that from stderr logs which would print the files it's reading). 3rd executor doesnt have any trace that it's processing files.
There are 2 partitions from the wholetextfile API..
2 executors had 4GB each total 8GB of memory. But my files had 8.2GB.
Can anyone explain how the 2 executors with 8GB ram in total are having 8.2GB files?
My job is sucesfully completed.

In the spark doc of the function WholeTextFiles:
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
So a RDD record is an entire file content and the num partitions is equal to the number of files.
To have multiple partitions you can use the function textFile

Each and every executor has memory overhead [ which is 10% of allocated memory or with a minimum of 384 M].
You can see the actual allocated memory from YARN Running Jobs.
Also, there is something called Container memory [min and max limit] allocation.

Related

Can spark manage partitions larger than the executor size?

Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:

Number of executors and executor memory to process 100gb file in spark

I have a csv file 100gb in HDFS.and cluster of size 10 nodes 15 cores (in a node) and 64gb RAM (in a node). I could not find an article configuring number of exceutors and executor memory based on file size. Can some one help to find optimal values of these parameters based on the cluster size and input file size
There is no direct co-relationship between file input size and spark cluster configuration. Generally a well distributed configuration (Ex: take no of cores per executor as 5, and calculate rest of the things optimally) works really well for most of the cases.
On the file side: Make sure it's splittable. (CSV is splittable in raw and few other formats only) . If it's splittable and on HDFS, then depending on the block size of HDFS you will have the no of partitions.
Ex: if Block size is 128MB , no of possible partitions for 100GB : 800 partitions. (this is approximate, actual formula is complex)
In your case, the no of cores : 14 * 10 = 140 , so only 140 parts of your file will be processed in parallel
So, higher the no of cores you have, more parallelism you will get.

Spark executors require driver memory

I have 2 spark application. The first read csv files then translates it to parquet (simple read - filter - write). The second reads the parquet files, computes statistics then writes the result to csv files. I had to allocate more driver memory to make them run, otherwise it crashes to a out of memory error.
I noticed that when I reduce the executors and cores to 1 and 1, I don't have to give more driver memory. This looks like managing multiple executors (in my case, I have 10 executors with 5 cores) requires driver memory. If I set up 10 executors with 1 core, or 1 executor with 5 cores, this will crash for example during parquet reading.
What is the correct explanation?

Spark only running one executor for big gz files

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

Counting line number of a huge folder failed in Spark

Update:
The folder size is 2T!!! Now my question is how to handle such a large file with Spark?
I have a online storage with a huge folder size (at least 200 GB, I do not know the exact size).
I am counting the line number of all files inside the huge folder.
spark.sparkContext.textFile("online/path").filter(x => x.contains("keyword")).count
But it always failed. I checked the Spark UI which shows total task number is 1,546,000 and my program fails after around finishing 110,000 tasks.
I tried to check the log file but the log file itself is huge and got stuck being read into my browser.
I also tried mapParititions:
spark.sparkContext.textFile.mapPartitions(p => p.filter(x => x.contains("keyword"))).count()
No luck.
My config:
Driver Memory: 16G
Executor memory: 16G
Executor Number: 12
Executor Core number :10
My spark cluster has 138 cores and 800G memory.
With each task assigned to a ~128MB partition, and 10 cores per executor, I would expect this job to complete on your cluster. It may be the case that you have too many tasks, as each tasks comes with non-trivial overhead. To test this hypothesis try reducing the number of partitions with coalesce. e.g.:
spark.sparkContext.textFile("online/path").coalesce(1000).filter(x => x.contains("keyword")).count
"textFile" has second parameter - "minPartitions", maybe, you can try it.
If files size is small, and file count is huge, other read method can be used "wholeTextFiles"

Resources