When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :
here are the files that are created :
Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?
The _SUCCESS file is always empty, what does this signify ?
The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
part- files: These are your output data files.
You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
.crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.
Related
I have a code zip file executing through spark submit and it produce 200 output file now the question is without changing in as its a zip file
how to reduce no of output files?
If you are using config file and your code doing repartition by getting number of partitions from config file dynamically then you can change the value in your config file, no need change zip file.
Another option would be using --conf spark.sql.shuffle.partitions=<number of partitions> in your spark-submit then your spark job will create number of files specified number.
NOTE: Setting up this param will degrade performance as this controls number of partitions whole spark program only advised to use if spark job is not processing millions of records.
How to read files inside directory in parallel using spark
I sc.textfile but it reads each file sequentially
You need to understand the difference between sc.textFile and sc.WholeTextFiles
sc.textFile reads all the file inside the given directory and creates a partition as number of file. If there are 5 files then it creates 5 partitions.
sc.WholeTextFiles read all the file inside given directory and creates a PairRDD with file path as a key and file content as a value.The partitions depends upon the number of executors
Spark can read files inside a directory in parallel.
For that you need to use sc.wholeTextFiles.
It will read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
For example, if you have the following files:
hdfs://a-hdfs-path/file1.txt
hdfs://a-hdfs-path/file2.txt
...
hdfs://a-hdfs-path/fileN.txt
Do val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/file1.txt, its content)
(a-hdfs-path/file2.txt, its content)
...
(a-hdfs-path/fileN.txt, its content)
Note - On some filesystems, .../path/* can be a more efficient way to read all files in a directory rather than .../path/ or .../path. Small files are preferred, large file is also allowable, but may cause bad performance.
I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath
I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")
I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions