Putting many small files to HDFS to train/evaluate model - apache-spark

I want to extract the contents of some large tar.gz archives, that contain millions of small files, to HDFS. After the data has been uploaded, it should be possible to access individual files in the archive by their paths, and list them. The most straight forward solution would be to write a small script, that extracts these archives to some HDFS base folder. However, since HDFS is known not to deal particularly well with small files, I'm wondering how this solution can be improved. These are the potential approaches I found so far:
Sequence Files
Hadoop Archives
HBase
Ideally, I want the solution to play well with Spark, meaning that accessing the data with Spark should not be more complicated than it was, if the data was extracted to HDFS directly. What are your suggestions and experiences in this domain?

You can land the files into a landing zone and then process them into something useful.
zcat <infile> | hdfs dfs -put - /LandingData/
Then build a table on top of that 'landed' data. Use Hive or Spark.
Then write out a new table (in a new folder) using the format of Parquet or ORC.
Whenever you need to run analytics on the data use this new table, it will perform well and remove the small file problem. This will keep the small file problem to a one time load.

Sequence files are the great way to handle small files hadoop problem.

Related

Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow

I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
The json files were web-scraped from the same REST API
I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
AWS s3 bucket is already mounted so there is absolutely no need to use boto3
Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.
An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.
you can rewrite the data in parquet format at an intermediate location as
one large file using coalesce or multiple even-sized files using repartition
you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem:
https://mungingdata.com/apache-spark/compacting-files/
https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252

How to perform parallel processing of files (pdf,docs,txt, xls) in a directory on local folder on Desktop using PySpark?

I have about 9000 files in different sub directories in a single directory on my local desktop. Total size of the directory is about 15GB. I do not want to go through python programming approach, which is extremely time consuming. I would like to use some kind of distributed parallel processing for the task. I want to perform the follow
Ingest all these files in a parallel manner.
Extract text from these documents (I already have a tika based python script to extract the text from these files)
Store the filename and the content (text extracted) in a dataframe.
I have already done the above task using normal python script. But I want to use Spark/pySpark to perform the above tasks. I have never used Spark before so need some guidance on what could be the roadmap.
How do i parallely pass these documents to Spark and then apply my extraction script to these documents? What approach can I take?
spark is not optimal for pdf, xls, docx formats. These formats have their own type of compression and do not parallelize well. They need to be entirely loaded in memory to be decompressed.
The preferred compression formats are column oriented compression formats such as parquet, orc or flat files such as json, txt... These can be processed efficiently by parts without having to load the entire file in memory for decompression.
If you happen to have only text files, that have different structures or that are unstructured, then I recommend using spark's RDD API to read them:
sc.wholeTextFiles(input_directory)
This will load the content of every text file and append the name of the file to each record.
Otherwise, parallelizing in python using multiprocessing will be more efficient.

How to load lots of files into one RDD in Spark

I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Spark write.avro creates individual avro files

I have a spark-submit job I wrote that reads an in directory of json docs, does some processing on them using data frames, and then writes to an out directory. For some reason, though, it creates individual avro, parquet or json files when I use df.save or df.write methods.
In fact, I even used the saveAsTable method and it did the same thing with parquet.gz files in the hive warehouse.
It seems to me that this is inefficient and negates the use of a container file format. Is this right? Or is this normal behavior and what I'm seeing just an abstraction in HDFS?
If I am right that this is bad, how do I write the data frame from many files into a single file?
As #zero323 told its normal behavior due to many workers(to support fault tolerance).
I would suggest you to write all the records in parquet or avro file which has avro generic record using something like this
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
but it wont write in to single file but it will group similar kind of Avro Generic records to one file(may be less number of medium sized) files

Resources