Best file formats for S3 using Spark for ETL on EMR

Best file formats for S3 using Spark for ETL on EMR - apache-spark

We are planning to perform ETL processing using Spark with source data sitting on S3. The data volume for ETL processing is less than 100 million. What is the best format to store data in S3 in this scenario i.e. the best compression and file format (text, sequence, parquet etc.)

ORC or Parquet for queries, compressed with Snappy. Avro is another general purpose format, but way less efficient for SparkSQL queries as you have to scan a lot more data.
Important At the time of writing (June 2017), you cannot safely use S3 as a direct destination of spark RDD/dataframe queries (i.e. save()) calls. See Cloud Integration for an explanation. Write to HDFS then copy

Related

Would S3 Select speed up Spark analyses on Parquet files?

You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name, last_name and country columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?

This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.
Like I said though: I'm not sure. Numbers are welcome.

Came across this spark package for s3 select on parquet [1]
[1] https://github.com/minio/spark-select

Parquet with Athena VS Redshift

I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift
2 Scenarios:
First,
EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ
Second,
EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT
Issues with this scenario:
Spark JDBC with Redshift is slow
Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago
I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?
Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)
P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.

Here are some ideas / recommendations
Don't use JDBC.
Spark-Redshift works fine but is a complex solution.
You don't have to use spark to convert to parquet, there is also the option of using hive.
see
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
Athena is great when used against parquet, so you don't need to use
Redshift at all
If you want to use Redshift, then use Redshift spectrum to set up a
view against your parquet tables, then if necessary a CTAS within
Redshift to bring the data in if you need to.
AWS Glue Crawler can be a great way to create the metadata needed to
map the parquet in to Athena and Redshift Spectrum.
My proposed architecture:
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena
and/or
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum
You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).

Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.
On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.

There are few details missing in the question. How would you manage incremental upsert in data pipeline.
If you have implemented Slowly Changing Dimension (SCD type 1 or 2) The same can't be managed using parquet files. But This can be easily manageable in Redshift.

Spark: Avro vs Parquet performance

Now that Spark 2.4 has built-in support for Avro format, I'm considering changing the format of some of the data sets in my data lake - those that are usually queried/joined for entire rows rather than specific column aggregations - from Parquet to Avro.
However, most of the work on top of the data is done via Spark, and to my understanding, Spark's in-memory caching and computations are done on columnar-formatted data. Does Parquet offer a performance boost in this regard, while Avro would incur some sort of data "transformation" penalty? What other considerations should I be aware of in this regard?

Both formats shine under different constraints but have things like strong types with schemas and a binary encoding in common. In its basic form it boils down to this differentiation:
Avro is a row-wise format. From this it follows that you can append row-by-row to an existing file. These row-wise appends are then also immediately visible to all readers that work on these files. Avro is best when you have a process that writes into your data lake in a streaming (non-batch) fashion.
Parquet is a columnar format and its files are not appendable. This means that for new arriving records, you must always create new files. In exchange for this behaviour Parquet brings several benefits. Data is stored in a columnar fashion and compression and encoding (simple type-aware, low-cpu but highly effective compression) is applied to each column. Thus Parquet files will be much smaller than Avro files. Also Parquet writes out basic statistics that when you load data from it, you can push down parts of your selection to the I/O. Then only the necessary set of rows is loaded from disk. As Parquet is already in a columnar fashion and most in-memory structures will also be columnar, loading data from them is in general much faster.
As you already have your data and the ingestion process tuned to write Parquet files, it's probably best for you to stay with Parquet as long as data ingestion (latency) does not become a problem for you.
A typical usage is actually to have a mix of Parquet and Avro. Recent, freshly arrived data is stored as Avro files as this makes the data immediately available to the data lake. More historic data is transformed on e.g. a daily basis into Parquet files as they are smaller and most efficient to load but can only be written in batches. While working with this data, you would load both into Spark as a union of two tables. Thus you have the benefit of efficient reads with Parquet combined with the immediate availability of data with Avro. This pattern is often hidden by table formats like Uber's Hudi or Apache Iceberg (incubating) which was started by Netflix.

Parquet VS Database

I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.

Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.

Spark with Avro, Kryo and Parquet

I'm struggling to understand what exactly Avro, Kryo and Parquet do in the context of Spark. They all are related to serialization but I've seen them used together so they can't be doing the same thing.
Parquet describes its self as a columnar storage format and I kind of get that but when I'm saving a parquet file can Arvo or Kryo have anything to do with it? Or are they only relevant during the spark job, ie. for sending objects over the network during a shuffle or spilling to disk? How do Arvo and Kryo differ and what happens when you use them together?

Parquet works very well when you need to read only a few columns when querying your data. However if your schema has lots of columns (30+) and in your queries/jobs you need to read all of them then record based formats (like AVRO) will work better/faster.
Another limitation of Parquet is that it is essentially write-once format. So usually you need to collect data in some staging area and write it to a parquet file once a day (for example).
This is where you might want to use AVRO. E.g. you can collect AVRO-encoded records in a Kafka topic or local files and have a batch job that converts all of them to Parquet file at the end of the day. This is fairly easy to implement thanks to parquet-avro library that provides tools to convert between AVRO and Parquet formats automatically.
And of course you can use AVRO outside of Spark/BigData. It is fairly good serialization format similar to Google Protobuf or Apache Thrift.

This very good blog post explains the details for everything but Kryo.
http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Kryo would be used for fast serialization not involving permanent storage, such as shuffle data and cached data, in memory or on disk as temp files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string