Spark only running one executor for big gz files - apache-spark

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

Related

Spark - 54 GB CSV file transform to single JSON in 16 GB RAM single machine

I want to take a CSV file and transform into single JSON, I have written and verified the code. I have a CSV file of 54 GB and I want to transform and export this single file into single JSON, I want to take this data in Spark and it will design the JSON using SparkSQL collect_set(struct built-in functions.
I am running Spark job in Eclipse IDE in a single machine only. The machine configuration has 16 GB RAM, i5 Processor, 600 GB HDD.
Now when I have been trying to run the spark program it is throwing java.lang.OutOfMemory and insufficient heap size error. I tried to increase the spark.sql.shuffle.partitions value 2000 to 20000 but still the job is failing after loading and during the transformation due to the same error I have mentioned.
I don't want to split the single CSV into multiple parts, I want to process this single CSV, how can I achieve that? Need help. Thanks.
Spark Configuration:
val conf = new SparkConf().setAppName("App10").setMaster("local[*]")
// .set("spark.executor.memory", "200g")
.set("spark.driver.memory", "12g")
.set("spark.executor.cores", "4")
.set("spark.driver.cores", "4")
// .set("spark.testing.memory", "2147480000")
.set("spark.sql.shuffle.partitions", "20000")
.set("spark.driver.maxResultSize", "500g")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "200g")
Few observations from my side,
When you collect data at the end on driver it needs to have enough memory to hold your complete json output. 12g is not sufficient memory for that IMO.
200g executor memory is commented then how much was allocated? Executors too need enough memory to process/transform this heavy data. If driver was allocated with 12g and if you have total of 16 then only available memory for executor is 1-2gb considering other applications running on system. It's possible to get OOM. I would recommend find whether driver or executor is lacking on memory
Most important, Spark is designed to process data in parallel on multiple machines to get max throughput. If you wanted to process this on single machine/single executor/single core etc. then you are not at all taking the benefits of Spark.
Not sure why you want to process it as a single file but I would suggest revisit your plan again and process it in a way where spark is able to use its benefits. Hope this helps.

Spark executors require driver memory

I have 2 spark application. The first read csv files then translates it to parquet (simple read - filter - write). The second reads the parquet files, computes statistics then writes the result to csv files. I had to allocate more driver memory to make them run, otherwise it crashes to a out of memory error.
I noticed that when I reduce the executors and cores to 1 and 1, I don't have to give more driver memory. This looks like managing multiple executors (in my case, I have 10 executors with 5 cores) requires driver memory. If I set up 10 executors with 1 core, or 1 executor with 5 cores, this will crash for example during parquet reading.
What is the correct explanation?

Bad read performance on Spark over HBase Hadoop

When reading 161 000 elements from HBase (462 MB based on HDFS file size) Spark spends at least 6 seconds to read them.
HBase is configured to use a block cache. During the test (there is no other process running at that moment), the block cache has a size of 470.1 MB (752.0 MB free).
All the elements are in the block cache.
The executor is running in an Yarn container (yarn mode) of 1408 MB memory.
Everything is running on a single node (including the master) over an Amazon m4 large node.
There is no other row in the table and a range scanning is performed.
RDD initialized like this
Executor Logs (it took 8 seconds in debug logging level)
The job is executed via Spark JobServer
Even a simple count on the RDD (no other operation) takes 5 seconds
I don't know what I can do based on the figures below. Where does the executor spend its time? How can I identify the bottleneck?
Thank you very much,
Sébastien.

Counting line number of a huge folder failed in Spark

Update:
The folder size is 2T!!! Now my question is how to handle such a large file with Spark?
I have a online storage with a huge folder size (at least 200 GB, I do not know the exact size).
I am counting the line number of all files inside the huge folder.
spark.sparkContext.textFile("online/path").filter(x => x.contains("keyword")).count
But it always failed. I checked the Spark UI which shows total task number is 1,546,000 and my program fails after around finishing 110,000 tasks.
I tried to check the log file but the log file itself is huge and got stuck being read into my browser.
I also tried mapParititions:
spark.sparkContext.textFile.mapPartitions(p => p.filter(x => x.contains("keyword"))).count()
No luck.
My config:
Driver Memory: 16G
Executor memory: 16G
Executor Number: 12
Executor Core number :10
My spark cluster has 138 cores and 800G memory.
With each task assigned to a ~128MB partition, and 10 cores per executor, I would expect this job to complete on your cluster. It may be the case that you have too many tasks, as each tasks comes with non-trivial overhead. To test this hypothesis try reducing the number of partitions with coalesce. e.g.:
spark.sparkContext.textFile("online/path").coalesce(1000).filter(x => x.contains("keyword")).count
"textFile" has second parameter - "minPartitions", maybe, you can try it.
If files size is small, and file count is huge, other read method can be used "wholeTextFiles"

Spark Executors RAM and file Size

I am reading text files of size 8.2 GB(all files in a folder) with WholeTextFiles method.
The job that read the files got 3 executors each with 4 cores and 4GB memory a shown in picture..
Though the job page is showing 3 executors, only 2 executors are really working on the data.(i can understand that from stderr logs which would print the files it's reading). 3rd executor doesnt have any trace that it's processing files.
There are 2 partitions from the wholetextfile API..
2 executors had 4GB each total 8GB of memory. But my files had 8.2GB.
Can anyone explain how the 2 executors with 8GB ram in total are having 8.2GB files?
My job is sucesfully completed.
In the spark doc of the function WholeTextFiles:
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
So a RDD record is an entire file content and the num partitions is equal to the number of files.
To have multiple partitions you can use the function textFile
Each and every executor has memory overhead [ which is 10% of allocated memory or with a minimum of 384 M].
You can see the actual allocated memory from YARN Running Jobs.
Also, there is something called Container memory [min and max limit] allocation.

Resources