RDD v.s. Dataset for Spark production code - apache-spark

Is there any industrial guideline on writing with either RDD or Dataset for Spark project?
So far what's obvious to me:
RDD, more type safety, less optimization (in the sense of Spark SQL)
Dataset, less type safety, more optimization
Which one is recommended in production code? Seems there's no such topic found in stackoverflow so far since Spark is prevalent in the past few years.
I can already foresee the majority of the community is with Dataset :), hence let me quote first a downvote for it from this answer (and please do share opinions against it):
Personally, I find statically typed Dataset to be the least useful:
Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.
There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.

Here is an excerpt from "Spark: The Definitive Guide" to answer this:
When to Use the Low-Level APIs?
You should generally use the lower-level APIs in three situations:
You need some functionality that you cannot find in the higher-level APIs; for
example, if you need very tight control over physical data placement across the
cluster.
You need to maintain some legacy codebase written using RDDs.
You need to do some custom shared variable manipulation
https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch12.html
In other words: If you don't come across these situations above, in general better use the higher-level API (Datasets/Dataframes)

RDD Limitations :
No optimization engine for input:
There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually.
This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan. We can use same code optimizer for R, Java, Scala, or Python DataFrame/Dataset APIs. It provides space and speed efficiency.
ii. Runtime type safety
There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime.
Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error. It helps detect errors at compile time and makes your code safe.
iii. Degrade when not enough memory
The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.
iv. Performance limitation & Overhead of serialization & garbage collection
Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows.
Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost. Or we can persist the object in serialized form.
v. Handling structured data
RDD does not provide schema view of data. It has no provision for handling structured data.
Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.
This was all in limitations of RDD in Apache Spark so introduced Dataframe and Dataset .
When to use Spark DataFrame/Dataset API and when to use plain RDD?
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds#:~:text=Yes!,data%20analytics%2C%20and%20data%20science.
https://data-flair.training/blogs/apache-spark-rdd-limitations/

Related

What are the differences between GraphX's memory-based shuffle and Spark Core's shuffle

From the paper "GraphX: Graph Processing in a Distributed Dataflow Framework" (Gonzalez et al. 2014) I learned that GraphX modified Spark shuffle:
Memory-based Shuffle: Spark’s default shuffle implementation materializes the temporary data to disk. We modified the shuffle phase to materialize map outputs in memory and remove this temporary data using a timeout.
(The paper does not explain anything more on this point.)
It seems that this change aims at optimizing shuffles in the context of highly iterative graph processing algorithms.
How does this "Memory-based shuffle" works exactly, how it differs from the Spark Core's one and what are the pros and cons: why it is well suited for graphx use cases and not for other Spark jobs ?
I failed to understand the big picture directly from GraphX/Spark sources and I also struggled finding the information out there.
Apart from an ideal answer, comments with links to sources are welcomed too.
I failed to understand the big picture directly from GraphX/Spark sources
Because it was never included in the mainstream distribution.
Back when the first GraphX version was developed Spark used Hash based shuffle, which was rather inefficient. It was one of the main bottlenecks in Spark jobs, and there was significant research into developing of alternative shuffle strategies.
Since GraphX algorithms are iterative and join-based, improving shuffle speed was an obvious path.
Since then, pluggable shuffle manager has been introduced, as well as new sort based shuffle, which finally turned out to be fast enough to make both hash-based shuffle and ongoing work on providing generic memory-based shuffle obsolete.

Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?

Using map over map partitions can give significant performance boost in cases where the transformation incurs creating or loading an expensive resource (e.g - authenticate to an external service or create a db connection).
mapPartition allows us to initialise the expensive resource once per partition verses once per row as happens with the standard map.
But if I am using dataframes, the way I apply custom transformations is by specifying user defined functions that operate on a row by row basis- so I lose the ability I had with mapPartitions to perform heavy lifting once per chunk.
Is there a workaround for this in spark-sql/dataframe?
To be more specific:
I need to perform feature extraction on a bunch of documents. I have a function that inputs a document and outputs a vector.
The computation itself involves initialising a connection to an external service. I don't want or need to initialise it per document. This has non trivial overhead at scale.
In general you have three options:
Convert DataFrame to RDD and apply mapPartitions directly. Since you use Python udf you already break certain optimizations and pay serde cost and using RDD won't make it worse on average.
Lazily initialize required resources (see also How to run a function on all Spark workers before processing data in PySpark?).
If data can be serialized with Arrow use vectorized pandas_udf (Spark 2.3 and later). Unfortunately you cannot use it directly with VectorUDT, so you'd have to expand vectors and collapse later, so the limiting factor here is the size of the vector. Also you have to be careful to keep size of partitions under control.
Note that using UserDefinedFunctions might require promoting objects to non-deterministic variants.

Why aren't RDDs suitable for streaming tasks?

I'm using Spark extensively, the core of Spark is the RDD, and as shown in the RDD paper there are limitations when it comes to streaming applications. This is an exact quote from the RDD paper.
As discussed in the Introduction, RDDs are best suited
for batch applications that apply the same operation to
all elements of a dataset. In these cases, RDDs can ef-
ficiently remember each transformation as one step in a
lineage graph and can recover lost partitions without having
to log large amounts of data. RDDs would be less
suitable for applications that make asynchronous finegrained
updates to shared state, such as a storage system
for a web application or an incremental web crawler
I don't quite understand why the RDD can't effectively manage state. How does Spark Streaming overcome these limitations?
I don't quite understand why the RDD can't effectively manage state.
It is not really about being able on not but more about the cost. We have well established mechanisms of handling finegrained changes with Write-ahead logging but managing logs is just expensive. These have to written to persistent storage, periodically merged and require expensive replaying in case of failure.
Compared to that RDDs are extremely lightweight solution. It is just a small local data structure which has to remember only its lineage (ancestors and applied transformations).
It does it mean it is not possible to create at least partially stateful system on top of Spark. Take a look at the Caffe-on-Spark architecture.
How does Spark Streaming overcome these limitations?
It doesn't or to be more precise it handles this problem externally independent of RDD abstraction. It includes using input and output operations with source specific guarantees and a fault-tolerant storage for handling received data.
It's explained elsewhere in the paper:
Existing abstractions for in-memory storage on clusters, such as distributed shared memory [24], key- value stores [25], databases, and Piccolo [27], offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.
In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.1 If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.
As I interpret that, handling streaming applications would require the system to do lots of writing to individual cells, shoving data across the network, i/o, and other costly things. RDDs are meant to avoid all that stuff by primarily supporting functional-type operations that can be composed.
This is consistent with my recollection from about 9 months ago when I did a Spark-based MOOC on edx (sadly haven't touched it then)---as I remember, Spark doesn't even bother to compute the results of maps on RDDs until the user actually calls for some output, and that way saves a ton of computation.

Which is efficient, Dataframe or RDD or hiveql?

I am newbie to Apache Spark.
My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.
For example,
CSV1
name,age,deparment_id
CSV2
department_id,deparment_name,location
I want to get a third CSV file with
name,age,deparment_name
I am loading both the CSV into dataframes.
And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe
I am also able to do the same using several RDD.map()
And I am also able to do the same using executing hiveql using HiveContext
I want to know which is the efficient way if my CSV files are huge and why?
This blog contains the benchmarks. Dataframes is much more efficient than RDD
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Here is the snippet from blog
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png
Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance
(assuming you are using version >= 1.3)
And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations
Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst

Is Tachyon by default implemented by the RDD's in Apache Spark?

I'm trying to understand Spark's in memory feature. In this process i came across Tachyon
which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation
by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.
What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.
When you persist data to the OFF_HEAP storage level via rdd.persist(StorageLevel.OFF_HEAP) it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.
It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.

Resources