How to load lots of files into one RDD in Spark - apache-spark

I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.

You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.

you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.

Related

Putting many small files to HDFS to train/evaluate model

I want to extract the contents of some large tar.gz archives, that contain millions of small files, to HDFS. After the data has been uploaded, it should be possible to access individual files in the archive by their paths, and list them. The most straight forward solution would be to write a small script, that extracts these archives to some HDFS base folder. However, since HDFS is known not to deal particularly well with small files, I'm wondering how this solution can be improved. These are the potential approaches I found so far:
Sequence Files
Hadoop Archives
HBase
Ideally, I want the solution to play well with Spark, meaning that accessing the data with Spark should not be more complicated than it was, if the data was extracted to HDFS directly. What are your suggestions and experiences in this domain?
You can land the files into a landing zone and then process them into something useful.
zcat <infile> | hdfs dfs -put - /LandingData/
Then build a table on top of that 'landed' data. Use Hive or Spark.
Then write out a new table (in a new folder) using the format of Parquet or ORC.
Whenever you need to run analytics on the data use this new table, it will perform well and remove the small file problem. This will keep the small file problem to a one time load.
Sequence files are the great way to handle small files hadoop problem.

When writing to hdfs, how do I overwrite only the necessary folders?

So, I have this folder, let's call it /data.
And it has partitions in it, e.g.:
/data/partition1, /data/partition2.
I read new data from kafka, and imagine I only need to update /data/partition2. I do:
dataFrame
.write
.mode(SaveMode.Overwrite)
.partitionBy("date", "key")
.option("header", "true")
.format(format)
.save("/data")
and it successfully updates /data/partition2, but /data/partition1 is gone... How can I force spark's SaveMode.Overwrite to not touch HDFS partitions that don't need to be updated?
You are using SaveMode.Overwrite which deletes previously existing directories. You should instead use SaveMode.Append
NOTE: The append operation is not without cost. When you call save using append mode, spark needs to ensure uniqueness of the file names so that it won't overwrite an existing file by accident. The more files you already have in the directory, the longer the save operation takes. If you are talking about a handful of files, then it's a very cost effective operation. But if you have many terabytes of data in thousands of files in the original directory (which was my case), you should use a different approach.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath

How the input data is split in Spark?

I'm coming from a Hadoop background, in hadoop if we have an input directory that contains lots of small files, each mapper task picks one file each time and operate on a single file (we can change this behaviour and have each mapper picks more than one file but that's not the default behaviour). I wonder to know how that works in Spark? Does each spark task picks files one by one or..?
Spark behaves the same way as Hadoop working with HDFS, as in fact Spark uses the same Hadoop InputFormats to read the data from HDFS.
But your statement is wrong. Hadoop will take files one by one only if each of your files is smaller than a block size or if all the files are text and compressed with non-splittable compression (like gzip-compressed CSV files).
So Spark would do the same, for each of the small input files it would create a separate "partition" and the first stage executed over your data would have the same amount of tasks as the amount of input files. This is why for small files it is recommended to use wholeTextFiles function as it would create much less partitions

Resources