Spark with Avro, Kryo and Parquet - apache-spark

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?

Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.

This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Related

S3 and Spark: File size and File format best practices

I need to read data (originating from a RedShift table with 5 columns, total size of the table is on the order of 500gb - 1tb) from S3 into Spark via PySpark for a daily batch job.
Are there any best practices around:
Preferred File Formats for how I store my data in S3? (does the format even matter?)
Optimal file size?
Any resources/links that can point me in the right direction would also work.
Thanks!
This blog post has some great info on the subject:
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
Look at the section titled: Use the Best Data Store for Your Use Case
From personal experience, I prefer using parquet in most scenarios, because I’m usually writing the data out once, and then reading it many times (for analytics).
In terms of numbers of files, I like to have between 200 and 1,000. This allows clusters of all sizes to read and write in parallel, and allows my reading of the data to be efficient because with parquet I can zoom in on just the file I’m interested in. If you have too many files, there is a ton of overhead in spark remembering all the file names and locations, and if you have too few files, it can’t parallelize your reads and writes effectively.
File size I have found to be less important than number of files, when using parquet.
EDIT:
Here’s a good section from that blog post that describes why I like to use parquet:
Apache Parquet gives the fastest read performance with Spark. Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data, using a technique that conserves resources. Parquet also stores column metadata and statistics, which can be pushed down to filter columns (discussed below). Spark 2.x has a vectorized Parquet reader that does decompression and decoding in column batches, providing ~ 10x faster read performance.

Spark: Avro vs Parquet performance

Now that Spark 2.4 has built-in support for Avro format, I'm considering changing the format of some of the data sets in my data lake - those that are usually queried/joined for entire rows rather than specific column aggregations - from Parquet to Avro.
However, most of the work on top of the data is done via Spark, and to my understanding, Spark's in-memory caching and computations are done on columnar-formatted data. Does Parquet offer a performance boost in this regard, while Avro would incur some sort of data "transformation" penalty? What other considerations should I be aware of in this regard?
Both formats shine under different constraints but have things like strong types with schemas and a binary encoding in common. In its basic form it boils down to this differentiation:
Avro is a row-wise format. From this it follows that you can append row-by-row to an existing file. These row-wise appends are then also immediately visible to all readers that work on these files. Avro is best when you have a process that writes into your data lake in a streaming (non-batch) fashion.
Parquet is a columnar format and its files are not appendable. This means that for new arriving records, you must always create new files. In exchange for this behaviour Parquet brings several benefits. Data is stored in a columnar fashion and compression and encoding (simple type-aware, low-cpu but highly effective compression) is applied to each column. Thus Parquet files will be much smaller than Avro files. Also Parquet writes out basic statistics that when you load data from it, you can push down parts of your selection to the I/O. Then only the necessary set of rows is loaded from disk. As Parquet is already in a columnar fashion and most in-memory structures will also be columnar, loading data from them is in general much faster.
As you already have your data and the ingestion process tuned to write Parquet files, it's probably best for you to stay with Parquet as long as data ingestion (latency) does not become a problem for you.
A typical usage is actually to have a mix of Parquet and Avro. Recent, freshly arrived data is stored as Avro files as this makes the data immediately available to the data lake. More historic data is transformed on e.g. a daily basis into Parquet files as they are smaller and most efficient to load but can only be written in batches. While working with this data, you would load both into Spark as a union of two tables. Thus you have the benefit of efficient reads with Parquet combined with the immediate availability of data with Avro. This pattern is often hidden by table formats like Uber's Hudi or Apache Iceberg (incubating) which was started by Netflix.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Best file formats for S3 using Spark for ETL on EMR

We are planning to perform ETL processing using Spark with source data sitting on S3. The data volume for ETL processing is less than 100 million. What is the best format to store data in S3 in this scenario i.e. the best compression and file format (text, sequence, parquet etc.)
ORC or Parquet for queries, compressed with Snappy. Avro is another general purpose format, but way less efficient for SparkSQL queries as you have to scan a lot more data.
Important At the time of writing (June 2017), you cannot safely use S3 as a direct destination of spark RDD/dataframe queries (i.e. save()) calls. See Cloud Integration for an explanation. Write to HDFS then copy

Spark write.avro creates individual avro files

I have a spark-submit job I wrote that reads an in directory of json docs, does some processing on them using data frames, and then writes to an out directory. For some reason, though, it creates individual avro, parquet or json files when I use df.save or df.write methods.
In fact, I even used the saveAsTable method and it did the same thing with parquet.gz files in the hive warehouse.
It seems to me that this is inefficient and negates the use of a container file format. Is this right? Or is this normal behavior and what I'm seeing just an abstraction in HDFS?
If I am right that this is bad, how do I write the data frame from many files into a single file?
As #zero323 told its normal behavior due to many workers(to support fault tolerance).
I would suggest you to write all the records in parquet or avro file which has avro generic record using something like this
dataframe.write().mode(SaveMode.Append).
format(FILE_FORMAT).partitionBy("parameter1", "parameter2").save(path);
but it wont write in to single file but it will group similar kind of Avro Generic records to one file(may be less number of medium sized) files

Resources