Save a large Spark Dataframe as a single json file in S3 - apache-spark

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :
dataframe.repartition(1).save("s3n://mybucket/testfile","json")
But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.
Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?
Btw i need the data in a single file because another user is going to download it after.
*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.
Thanks a lot
JG

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.
df.write.mode('append').json(yourtargetpath)

Try this
dataframe.write.format("org.apache.spark.sql.json").mode(SaveMode.Append).save("hdfs://localhost:9000/sampletext.txt");

s3a is not production version in Spark I think.
I would say the design is not sound. repartition(1) is going to be terrible (what you are telling spark is to merge all partitions to a single one).
I would suggest to convince the downstream to download contents from a folder rather than a single file

Related

PySpark S3 file read performance consideration

I am new bee to pyspark.
Just wanted to understand how large files I should write into S3 so that Spark can read those files and process.
I have around 400 to 500GB of total data, I need to first upload them to S3 using some tool.
Just trying to understand how big each file should be in S3 so that Spark can read and process efficiently.
And how spark will distribute the S3 files data to multiple executors?
Any god reading link?
Thanks
Try 64-128MB, though it depends on the format.
Spark treats S3 data as independent of location, so doesn't use locality in its placement decisions -just whichever workers have capacity for extra work

Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow

I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:
df = (spark.read
.option("multiline", True)
.option("inferSchema", False)
.json(path))
display(df)
The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.
Notes to consider:
Since there are millions of json files, I can't say with confidence that they will have the same exact structure (there could be some discrepancies)
The json files were web-scraped from the same REST API
I read somewhere that inferSchema = False can help reduce runtime, which is why I used it
AWS s3 bucket is already mounted so there is absolutely no need to use boto3
Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.
An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.
you can rewrite the data in parquet format at an intermediate location as
one large file using coalesce or multiple even-sized files using repartition
you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.
My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.
At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!
There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.
zipfile solution: https://docs.python.org/3/library/zipfile.html
AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
Really good blog posts explaining the small file problem:
https://mungingdata.com/apache-spark/compacting-files/
https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252

Why are there a mass of tasks for loading a CSV file in S3 bucket?

i have a small spark standalone cluster with dynamic resource allocation which uses aws s3 as storage, then i start a spark sql, create a hive external table loading data from a 779.3KB csv file in s3 bucket, when i execute a sql "select count(1) from sales;", there exactly are 798009 tasks in the spark sql job, just like a task per byte. And "spark.default.parallelism" doesn't work. Is there any advice?
If you are using Hadoop 2.6 JARs then it's a bug in that version of s3a; if you are seeing it elsewhere then it may be a config problem.
Your file is being split into one partition per byte because the filesystem is saying "each partition is one byte long". Which means that FileSystem.getBlockSize() is returning the value "0" (cf. HADOOP-11584: s3a file block size set to 0 in getFileStatus).
For s3a connector, make sure that you are using 2.7+ and then set fs.s3a.block.size to something like 33554432 (i.e. 32MB), at which point your source file won't get split up at all.
If you can go up to 2.8; we've done a lot of work speeding both input and output, especially with column format IO and its seek patterns.
Try DF.repartition(1) before running the query. There must too many number of partition when you are running this command.
use spark.sql.shuffle.partitions=2

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

library to process .rrd(round robin data) using spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?
The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Resources