Why apache spark save function with a folder contain multi sub-file? - apache-spark

When save spark dataframe, spark save to multi file inside a folder instead only one file.
df.write.format("json") \
.option("header", "true") \
.save('data.json', mode='append')
When run this code, data.json will be folder name instead file name.
And I want to know what are the advantages for that ?

When you write the dataframe or rdd the spark uses HadoopAPI underneath
The actual data that contains result is in the part- files which are created as the same number of partition on dataframe. If you have n numbers of partition then it creates n number of part files.
The main advantage of multiple part file is that if you have multiple workers can access and write the file in parallel.
Other files like _SUCCESS is to indicate that it has completed successfully and .crc is for the ckeck.
Hope this helps you.

Related

How to reduce no of partition without changing in spark code

I have a code zip file executing through spark submit and it produce 200 output file now the question is without changing in as its a zip file
how to reduce no of output files?
If you are using config file and your code doing repartition by getting number of partitions from config file dynamically then you can change the value in your config file, no need change zip file.
Another option would be using --conf spark.sql.shuffle.partitions=<number of partitions> in your spark-submit then your spark job will create number of files specified number.
NOTE: Setting up this param will degrade performance as this controls number of partitions whole spark program only advised to use if spark job is not processing millions of records.

How to process multiple parquet files in parallel using Pyspark?

I am using Azure Databricks and i'm new to Pyspark and big data.
Here is my problem:
I have several parquet files in a directory on azure databricks.
I want to read these files to a pyspark dataframe and use the drop duplicates method to remove duplicate rows - a QA check.
I then want to overwrite these files in the same directory after dropping the duplicates.
Currently, I am using a for loop to loop over each parquet file in the directory. However, this is an inefficient way of doing things. I was wondering if there is a way to process these parquet files in parallel to save computational time. If so how do I need to change my code.
Here is the code:
for parquet_file_name in dir:
df = spark.read.option("header", "true").option("inferschema", "false").parquet('{}/{}'.format(dir,parquet_file_name))
df.dropDuplicates().write.mode('overwrite').parquet('{}/{}'.format(dir,parquet_file_name)
Any help here would be much appreciated.
Many thanks.
Rather than reading in one file at a time in a for loop, just read in the entire directory like so.
df = spark.read \
.option("header", "true") \
.option("inferschema", "false").parquet(dir)
df.dropDuplicates().write.mode('overwrite').parquet(dir)
The data will now be read all at once as intended. If you want to change the number of files wrote out, use coalesce command before the .write function like so: df.dropDuplicates().coalesce(4).write.mode('overwrite').parquet(dir).

Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath

Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!
However, if I read in one single .bz2, would I still get one single giant partition?
Or will Spark support automatic split one .bz2 to multiple partitions?
If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).
. one giant .gz file will read in to a single partition.
Please note that you can always do a
sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)
to get the desired number of partitions after the file has been read.
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.
That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.
I don't know why my test-program run on one executor, after some test I think I get it, like that:
by pySpark
// Load a DataFrame of users. Each line in the file is a JSON
// document, representing one row.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val user = sqlContext.read.json("users.json.bz2")

What are the files generated by Spark when using "saveAsTextFile"?

When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :
here are the files that are created :
Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?
The _SUCCESS file is always empty, what does this signify ?
The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
part- files: These are your output data files.
You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
.crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.

Resources