How to read files inside directory in parallel using spark
I sc.textfile but it reads each file sequentially
You need to understand the difference between sc.textFile and sc.WholeTextFiles
sc.textFile reads all the file inside the given directory and creates a partition as number of file. If there are 5 files then it creates 5 partitions.
sc.WholeTextFiles read all the file inside given directory and creates a PairRDD with file path as a key and file content as a value.The partitions depends upon the number of executors
Spark can read files inside a directory in parallel.
For that you need to use sc.wholeTextFiles.
It will read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
For example, if you have the following files:
hdfs://a-hdfs-path/file1.txt
hdfs://a-hdfs-path/file2.txt
...
hdfs://a-hdfs-path/fileN.txt
Do val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/file1.txt, its content)
(a-hdfs-path/file2.txt, its content)
...
(a-hdfs-path/fileN.txt, its content)
Note - On some filesystems, .../path/* can be a more efficient way to read all files in a directory rather than .../path/ or .../path. Small files are preferred, large file is also allowable, but may cause bad performance.
Related
In a pyspark session when I do this:
df = spark.read.parquet(file)
df.write.csv('output')
it creates a directory called output with a bunch of files, one of which is a target csv file with unpredictable name, example: part-00006-80ba8022-33cb-4478-aab3-29f08efc160a-c000.csv
Is there a way to know what the output file name is after the .csv() call?
When you read a parquet file in the dataframe it will have some partitions as we are using distributed storage here. Similarly when you save that dataframe as an csv file it would get saved in an distributed manner based on the number of partitions that dataframe had.
The path that you provide at the time of writing the csv file would create a folder with that name is what happens and then you would have multiple partitions files inside that folder. Each file would have some portion of data and when you combine all that partitions file you get the entire content of the csv file.
Also if you read that folder path then you would be able to see the entire content of the csv file. This is the default behaviour of how spark and distributed computing works.
I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv from pyspark.sql. I've learned that in RDD world, textFile function is designed for reading small number of large files, whereas wholeTextFiles function is designed for reading a large number of small files (e.g. see this thread). Does read.csv use textFile or wholeTextFiles under the hood?
Yes thats possible, just give the path until the parent directory as
df = spark.read.csv('path until the parent directory where the files are located')
And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.
I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.
I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath
When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :
here are the files that are created :
Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?
The _SUCCESS file is always empty, what does this signify ?
The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
part- files: These are your output data files.
You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
.crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.