Counting line number of a huge folder failed in Spark

The folder size is 2T!!! Now my question is how to handle such a large file with Spark?
I have a online storage with a huge folder size (at least 200 GB, I do not know the exact size).
I am counting the line number of all files inside the huge folder.
spark.sparkContext.textFile("online/path").filter(x => x.contains("keyword")).count
But it always failed. I checked the Spark UI which shows total task number is 1,546,000 and my program fails after around finishing 110,000 tasks.
I tried to check the log file but the log file itself is huge and got stuck being read into my browser.
I also tried mapParititions:
spark.sparkContext.textFile.mapPartitions(p => p.filter(x => x.contains("keyword"))).count()
No luck.
My config:
Driver Memory: 16G
Executor memory: 16G
Executor Number: 12
Executor Core number :10
My spark cluster has 138 cores and 800G memory.

With each task assigned to a ~128MB partition, and 10 cores per executor, I would expect this job to complete on your cluster. It may be the case that you have too many tasks, as each tasks comes with non-trivial overhead. To test this hypothesis try reducing the number of partitions with coalesce. e.g.:
spark.sparkContext.textFile("online/path").coalesce(1000).filter(x => x.contains("keyword")).count

"textFile" has second parameter - "minPartitions", maybe, you can try it.
If files size is small, and file count is huge, other read method can be used "wholeTextFiles"


PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData ="TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

Spark Job Internals

I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
If I dont have any wide transformations, will this cause a spill to disk?
For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!
Spark will run all jobs for the first stage before starting the second. This does not mean that it will start 8 partitions, wait for them all to complete, and then start another 8 partitions. Instead, this means that each time an executor finishes a partition, it will start another partition from the first stage until all partions from the first stage is started, then spark will wait until all stages in the first stage are complete before starting the second stage.
The data is stored in memory, or if not enough memory is available, spilled to disk on the executor memory. Whether a spill happens will depend on exactly how much memory is available, and how much intermediate data results.
The optimal file size is varies, and is best measured, but some key factors to consider:
The total number of files limits total parallelism, so should be greater than the number of cores.
The amount of memory used processing a partition should be less than the amount available to the executor. (~4GB for AWS glue)
There is overhead per file read, so you don't want too many small files.
I would be inclined towards 10MB files or larger if you only have 8 cores.

How Apache Spark partitions data of a big file [duplicate]

Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.
But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark. So, apologies if this is a naive question.
As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)
So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)
Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.
Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.
It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.
Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver.
So to process the data, you will be able to process 3 partitions in parallel.
so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.
Hope that helps!
To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.
Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.

Spark only running one executor for big gz files

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

Spark Executors RAM and file Size

I am reading text files of size 8.2 GB(all files in a folder) with WholeTextFiles method.
The job that read the files got 3 executors each with 4 cores and 4GB memory a shown in picture..
Though the job page is showing 3 executors, only 2 executors are really working on the data.(i can understand that from stderr logs which would print the files it's reading). 3rd executor doesnt have any trace that it's processing files.
There are 2 partitions from the wholetextfile API..
2 executors had 4GB each total 8GB of memory. But my files had 8.2GB.
Can anyone explain how the 2 executors with 8GB ram in total are having 8.2GB files?
My job is sucesfully completed.
In the spark doc of the function WholeTextFiles:
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
So a RDD record is an entire file content and the num partitions is equal to the number of files.
To have multiple partitions you can use the function textFile
Each and every executor has memory overhead [ which is 10% of allocated memory or with a minimum of 384 M].
You can see the actual allocated memory from YARN Running Jobs.
Also, there is something called Container memory [min and max limit] allocation.
