is it possible in spark to read large s3 csv files in parallel? - apache-spark

Typically spark files are saved in multiple parts, allowing each worker to read different files.
is there a similar solution when working on a single files?
s3 provides the select API that should allow this kind of behaviour.
spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading

S3 Select is unrelated to your use case.
S3 Select: have SQL select and project done in the S3 store, so that the client gets the prefiltered data. Result is returned as CSV with the header stripped, or JSON. You cannot then have >1 worker target this. (you could try, but each worker would have to read in and discard all the data in the runup to its offset, and predicting the ranges each worker can process is essentially impossible)
You: have > 1 worker process different parts of a file which has been partitioned
Partitioning large files into smaller parts for parallel processing is exactly what Spark (and mapreduce, hive etc) do for any format where it makes sense.
CSV files are easily partitioned provided they are compressed with a splittable compression format (none, snappy -but not gzip)
All that's needed is to tell spark what the split threshold is. For S3a, set the value fs.s3a.block.size to a value which it can then split up on, then your queries against CSV, Avro, ORC, Parquet and similar will all be split up amongst workers.
Unless your workers are doing a lot of computation per row, there's a minimum block size before it's even worth doing this. Experiment.

Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not an guarantee that a newlines only represent a new row can be split
FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.
If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.

Related

S3 and Spark: File size and File format best practices

I need to read data (originating from a RedShift table with 5 columns, total size of the table is on the order of 500gb - 1tb) from S3 into Spark via PySpark for a daily batch job.
Are there any best practices around:
Preferred File Formats for how I store my data in S3? (does the format even matter?)
Optimal file size?
Any resources/links that can point me in the right direction would also work.
Thanks!
This blog post has some great info on the subject:
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
Look at the section titled: Use the Best Data Store for Your Use Case
From personal experience, I prefer using parquet in most scenarios, because I’m usually writing the data out once, and then reading it many times (for analytics).
In terms of numbers of files, I like to have between 200 and 1,000. This allows clusters of all sizes to read and write in parallel, and allows my reading of the data to be efficient because with parquet I can zoom in on just the file I’m interested in. If you have too many files, there is a ton of overhead in spark remembering all the file names and locations, and if you have too few files, it can’t parallelize your reads and writes effectively.
File size I have found to be less important than number of files, when using parquet.
EDIT:
Here’s a good section from that blog post that describes why I like to use parquet:
Apache Parquet gives the fastest read performance with Spark. Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data, using a technique that conserves resources. Parquet also stores column metadata and statistics, which can be pushed down to filter columns (discussed below). Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.

Pyspark SQL job slowed down by reading fixed width instead of parquet

I have a script that consists of several joins and a few other basic operations. When I was reading parquet format data the script would run and write the new parquet data quickly. I recently updated the script to accept fixed width data and determine the columns based on given specifications.
This has slowed down the script by a factor of 10.
I have tried playing with the spark configs and different partitions of the data but the runtime is still abysmal.
Parquet and ORC are optimized formats for columnar reading/writing, therefore work well in SparkSQL dataframes.
Plaintext CSV/TSV are much slower, by design, as entire rows of data need parsed and extracted again and again during processing.
There's likely nothing you're doing wrong here. If you need fast processing speeds overall, then you'll need a proper database that can optimize your queries, and not be processing raw files

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

PySpark: Writing input files to separate output files without repartitioning

I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.

Saving in parquet format from multiple spark workers

I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")

Resources