Spark: reading many files with read.csv - apache-spark

I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv from pyspark.sql. I've learned that in RDD world, textFile function is designed for reading small number of large files, whereas wholeTextFiles function is designed for reading a large number of small files (e.g. see this thread). Does read.csv use textFile or wholeTextFiles under the hood?

Yes thats possible, just give the path until the parent directory as
df = spark.read.csv('path until the parent directory where the files are located')
And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.

Related

Find out target csv file name for a Spark DataFrame.write.csv() call

In a pyspark session when I do this:
df = spark.read.parquet(file)
df.write.csv('output')
it creates a directory called output with a bunch of files, one of which is a target csv file with unpredictable name, example: part-00006-80ba8022-33cb-4478-aab3-29f08efc160a-c000.csv
Is there a way to know what the output file name is after the .csv() call?
When you read a parquet file in the dataframe it will have some partitions as we are using distributed storage here. Similarly when you save that dataframe as an csv file it would get saved in an distributed manner based on the number of partitions that dataframe had.
The path that you provide at the time of writing the csv file would create a folder with that name is what happens and then you would have multiple partitions files inside that folder. Each file would have some portion of data and when you combine all that partitions file you get the entire content of the csv file.
Also if you read that folder path then you would be able to see the entire content of the csv file. This is the default behaviour of how spark and distributed computing works.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath

Spark: difference when read in .gz and .bz2

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!
However, if I read in one single .bz2, would I still get one single giant partition?
Or will Spark support automatic split one .bz2 to multiple partitions?
If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).
. one giant .gz file will read in to a single partition.
Please note that you can always do a
sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)
to get the desired number of partitions after the file has been read.
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.
That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.
I don't know why my test-program run on one executor, after some test I think I get it, like that:
by pySpark
// Load a DataFrame of users. Each line in the file is a JSON
// document, representing one row.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val user = sqlContext.read.json("users.json.bz2")

Saving in parquet format from multiple spark workers

I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")

Resources