Spark 2.0.0: read many .gz files - apache-spark

I have more than 150,000 .csv.gz files, organised in several folders (on s3) that have the same prefix. The size of each file is approximately 550KB. My goal is to read all these files into one DataFrame, the total size is about 80GB.
I am using EMR 5.0.0 with a decent cluster: 3 instances of c4.8xlarge
(36 vCPU, 60 GiB memory, EBS Storage:100 GiB).
I am reading the files using a wildcard character in the path:
sc.textFile("s3://bucket/directory/prefix*/*.csv.gz")
Then I do some map operations and I transform the RDD into a DataFrame by calling toDF("col1_name", "col2_name", "col3_name"). I then do few calls to UDFs to create new columns.
When I call df.show() the operation take longtime and never finish.
I wonder why the process is taking very long time?
Is reading that large number of .csv.gz files is the problem?

.gz files are not splittable and will result in 150K partitions. Spark will not like that: it struggles with even several 10k's of partitions.
You might want to look into aws distcp or S3DistCp to copy to hdfs first - and then bundle the files using an appropriate Hadoop InputFormat such as CombineFileInputFormat that gloms many files into one. Here is an older blog that has more ideas: http://inquidia.com/news-and-info/working-small-files-hadoop-part-3

Related

Improving performance for Spark with a large number of small files?

I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.
you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5
This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

Process multiple small files of total size 100GB in HDFS

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.
May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.
When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-
rdd = sc.textFile('hdfs://messageDir/*.txt')
How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.
What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?
I think you are mixing things up a little.
You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.
Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?
You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.
If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.
Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.
You can use Hive or otherwise.

Spark partition by files

I have several thousand compressed CSV files on a S3 bucket, each of size approximately 30MB(around 120-160MB after decompression), which I want to process using spark.
In my spark job, I am doing simple filter select queries on each row.
While partitioning Spark is dividing the files into two or more parts and then creating tasks for each partition. Each task is taking around 1 min to complete just to process 125K records. I want to avoid this partitioning of a single file across many tasks.
Is there a way to fetch files and partition data such that each task works on one complete file, that is, Number of tasks = Number of input files.?
as well as playing with spark options, you can tell the s3a filesystem client to tell it to tell Spark that the "block size" of a file in S3 is 128 MB. The default is 32 MB, which is close enough to your "approximately 30MB" number that spark could be splitting the files in two
spark.hadoop.fs.s3a.block.size 134217728
using the wholeTextFiles() operation is safer though

Read single parquet file in parallel in Spark?

We work with Spark 1.6 (and also Spark 2.1) and operate on Hive-table which are saved as parquet files. In certain cases we have only few files (some 10 MBytes in size).
For example having two parquet files, reading those tables using sqlContext.table(tableName).rdd.count creates a sparkjob with only 2 tasks which take quite some time (~12 s).
My question is : Is it possible to read N files with more parallelism than only N? Is there a way to speed up this spark job without changing the number of files on the filesystem? As HDFS is a distributed filesystem (and files are replicated), I can imagine that more than 1 machine can read (a part) of a file simultaneously.
up
Using several executors and/or several threads (with spark.task.cpus>1)

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

Resources