Partitioning with Spark Graphframes - apache-spark

I'm working with a largish (?) graph (60 million vertices and 9.5 billion edges) using Spark Graphframes. The underlying data is not large - the vertices take about 500mb on disk and the edges are about 40gb. My containers are frequently shutting down due to java heap out of memory problems, but I think the underlying problem is that the graphframe is constantly shuffling data around (I'm seeing shuffle read/write of up to 150gb). Is there a way to efficiently partition a Graphframe or the underlying edges/vertices to reduce shuffle?

TL;DR It is not possible to efficiently partition Graphframe.
Graphframe algorithms can be separated into two categories:
Methods which delegate processing to GraphX counterpart. GraphX supports a number of partitioning methods but these are not exposed via Graphframe API. If you use one of these it is probably better to use GraphX directly.
Unfortunately development of GraphX stopped almost completely with only a handful of small fixes over the last two years and overall performance is highly disappointing compared to both in-core libraries and out-of-core libraries.
Methods which are implemented natively using Spark Datasets, which considering limited programming model and only a single partitioning mode, are deeply unfit for complex graph processing.
While relational columnar storage can be used for efficient graph processing naive iterative join approach employed by Graphframes just don't scale (but it is OK for shallow traversing with one or two hops).'
You can try to repartition vertices and edges DataFrames by id and src respectively:
val nPart: Int = ???
GraphFrame(v.repartition(nPart, v("id")), e.repartition(e(nPart, "src")))
what should help in some cases.
Overall, at it's current (Dec, 2016) state, Spark is not a good choice for intensive graph analytics.

Here's the partial solution / workaround - create a UDF that mimics one of the partition functions to create a new column and partition on that.
num_parts = 256
random_vertex_cut = udf.register("random_vertex_cut", lambda src, dst: math.abs((src, dst).hashCode()) % num_parts, IntegerType())
edge.withColumn("v_cut", random_vertex_cut(col("src"), col("dst")).repartition(256, "v_cut")
This approach can help some, but not as well as GraphX.

Related

Suggestion for multiple joins in spark

Recently I got a requirement to perform combination joins.
I have to perform around 30 to 36 joins in Spark.
It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().
Is this a good way to do? Any thoughts, please share.
Yes, it is fine.
This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.
Quoting from https://medium.com/#adrianchang/apache-spark-checkpointing-ebd2ec065371:
Spark programs take a huge performance hit when fault tolerance occurs
as the entire set of transformations to a DataFrame or RDD have to be
recomputed when fault tolerance occurs or for each additional
transformation that is applied on top of an RDD or DataFrame.
localCheckpoint() is not "reliable".
Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:
Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.

RDD v.s. Dataset for Spark production code

Is there any industrial guideline on writing with either RDD or Dataset for Spark project?
So far what's obvious to me:
RDD, more type safety, less optimization (in the sense of Spark SQL)
Dataset, less type safety, more optimization
Which one is recommended in production code? Seems there's no such topic found in stackoverflow so far since Spark is prevalent in the past few years.
I can already foresee the majority of the community is with Dataset :), hence let me quote first a downvote for it from this answer (and please do share opinions against it):
Personally, I find statically typed Dataset to be the least useful:
Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.
There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.
Here is an excerpt from "Spark: The Definitive Guide" to answer this:
When to Use the Low-Level APIs?
You should generally use the lower-level APIs in three situations:
You need some functionality that you cannot find in the higher-level APIs; for
example, if you need very tight control over physical data placement across the
cluster.
You need to maintain some legacy codebase written using RDDs.
You need to do some custom shared variable manipulation
https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch12.html
In other words: If you don't come across these situations above, in general better use the higher-level API (Datasets/Dataframes)
RDD Limitations :
No optimization engine for input:
There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually.
This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan. We can use same code optimizer for R, Java, Scala, or Python DataFrame/Dataset APIs. It provides space and speed efficiency.
ii. Runtime type safety
There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime.
Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error. It helps detect errors at compile time and makes your code safe.
iii. Degrade when not enough memory
The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.
iv. Performance limitation & Overhead of serialization & garbage collection
Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows.
Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost. Or we can persist the object in serialized form.
v. Handling structured data
RDD does not provide schema view of data. It has no provision for handling structured data.
Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.
This was all in limitations of RDD in Apache Spark so introduced Dataframe and Dataset .
When to use Spark DataFrame/Dataset API and when to use plain RDD?
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds#:~:text=Yes!,data%20analytics%2C%20and%20data%20science.
https://data-flair.training/blogs/apache-spark-rdd-limitations/

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Spark: query dataframe vs join

Spark 1.5.
There is a static dataset which may range from some hundred MB to some GB (here I discard the option of broadcasting the dataset - too much memory needed).
I have a Spark Streaming input which I want to enrich with data from that static dataset, providing a common key (I understand this can be done using transform over the DStream to apply RDD/PairRDD logic). Key cardinality is high, on the thousands.
Here there are the options I can see:
I can make the full join, which I guess it would scale well in terms of memory, however it would pose problems in case of too much data having to flow between nodes. I understand it may pay off to partition both static and input RDDs by the same key.
I am considering though to just having the data loaded in a Dataframe, and go querying it every time from the input. Is this too much of a performance penalty? I think this would not be a proper way to use it unless the stream has low cardinality, right?
Are my assumptions correct? Then, would having the full join with partitioning be the preferred option?

Is the a way to force Spark Aggregate / Reduce to "bubble-up"?

I tried both aggregate and reduce in Spark that produce large datasets. I noticed that part of the reduction was executed in my driver. According an MLLib blog, they have managed to implement the bubbling, ie. once workers have reduced each task/partition then move the reduction phase to a subset of workers until eventually this is delegated back to the driver.
In my use case, I have 580 partitions that don't have too many entries in common, ie. each partition size is 2GB but all partitions aggregated are also 2GBs. As the driver is delegating the reduction of partitions to the driver I get an OOME. Have I missed an API call that can do this or is the best way to force this behaviours by applying incremental repartitioning ?
Tnx
I think you are looking for rdd.treeAggregate that applies the reducer in a multi-leveled way, reducing the amount of data passed to the driver for final reduction.
It has been moved from mllib to Spark core on Spark 1.3.0. See SPARK-5430

Resources