How to duplicate RDD into multiple RDDs? - apache-spark

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).

RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

Related

Cost of transforming a dataframe to rdd in spark

I'm trying to fetch the number of partitions of a dataframe using this:
df.rdd.getNumPartitions.toString
But when I monitor the spark log, I see it spins up many stages and is a costly operation to have.
As per my understanding, dataframe adds a structural layer to rdd via metadata. So, how come stripping that while converting to rdd takes this much time?
A DataFrame is an optimized distributed tabular collection. Since it keeps a tabular format (similar to a SQL table) it can mantain metadata to allow Spark some optimizations performed under the hood.
This optimizations are performed by side project such as Catalyst and Tungsten
RDD does not mantain any schema, it is required for you to provide one if needed. So RDD is not as highly oiptimized as Dataframe, (Catalyst is not involved at all)
Converting a DataFrame to an RDD force Spark to loop over all the elements converting them from the highly optimized Catalyst space to the scala one.
Check the code from .rdd
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
#transient private lazy val rddQueryExecution: QueryExecution = {
val deserialized = CatalystSerde.deserialize[T](logicalPlan)
sparkSession.sessionState.executePlan(deserialized)
}
So first, it's executing the plan and retrieve the output as an RDD[InternalRow] which, as the name implies, are only for internal use and need to be converted to RDD[Row]
Then it loops over all the rows converting them. As you can see, it's not just removing the schema
Hope that answer your question.

Partition data when reading file in Spark

I am new to Spark. Consider the following code:
val rdd = sc
.objectFile[(Int, Int)]("path")
.partitionBy(new HashPartitioner(sc.defaultParallelism))
.persist()
rdd.count()
Is each tuple read from the file directly sent to the its partition specified by the hash partitioner? Or is it that the whole file is first read into memory without considering the partitioner and then distributed according to the partitioner. To me, the former may be more efficient since the data is shuffled once while the latter needs two shuffles.
Please find the comments in the code
val rdd = sc
.objectFile[(Int, Int)]("path") // Loads the whole file with default minimum partitions and default partitioner
.partitionBy(new HashPartitioner(sc.defaultParallelism)) // Re-partitions the RDD using HashPartitioner
.persist()

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

Spark: Mapping an RDD of HBase row keys to an RDD of values

I have an RDD that contains HBase row keys. The RDD is relatively large to fit in memory. I need to get an RDD of values for each of the provided key. Is there a way to do something like this:
keys.map(key => table.get(new Get(key)))
So the question is how can I obtain an instance of HTable inside map task? Should I instantiate an HConnection for every partition, and then obtain HTable instance from it, or is there a better way?
There are a few options you can can do but first consider the fact that spark does not allow you to create RDDs of RDDs. So really that leaves you with two options
a list of RDDs
A Key/value RDD
I would highly recommend the second one as a list of RDDs could end with you needing to perform a lot of reduces which could massively increase the number of shuffles you need to perform. With that in mind I would recommend you use a flatMap.
So here is some basic skeleton code that could get you that result
val input:RDD[String]
val completedRequests:RDD[(String, List[String]) = input.map(a => (a, table.get(new Get(a)))
val flattenedRequests:RDD[(String, String) = completedRequests.flatMap{ case(k,v) => v.map(b =>(k,b))
You can now handle the RDD as one object, reduceByKey if you have a particular piece of information you need from it, and now spark will be able to access the data with optimal parallelism.
Hope that helps!

In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code:
val rdd1 =
sc.parallelize(1 to 50).keyBy(_ % 10)
.partitionBy(new HashPartitioner(10))
val rdd2 =
sc.parallelize(200 to 230).keyBy(_ % 13)
val cogrouped = rdd1.cogroup(rdd2)
println("cogrouped: " + cogrouped.partitioner)
val unioned = rdd1.union(rdd2)
println("union: " + unioned.partitioner)
I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always revert back to default. This is counterintuitive as we usually assume that a PairRDD should use its first element as partition key. Is there a way to "force" Spark to merge 2 PairRDDs to use the same partition key?
union is a very efficient operation, because it doesn't move any data around. If rdd1 has 10 partitions and rdd2 has 20 partitions then rdd1.union(rdd2) will have 30 partitions: the partitions of the two RDDs put after each other. This is just a bookkeeping change, there is no shuffle.
But necessarily it discards the partitioner. A partitioner is constructed for a given number of partitions. The resulting RDD has a number of partitions that is different from both rdd1 and rdd2.
After taking the union you can run repartition to shuffle the data and organize it by key.
There is one exception to the above. If rdd1 and rdd2 have the same partitioner (with the same number of partitions), union behaves differently. It will join the partitions of the two RDDs pairwise, giving it the same number of partitions as each of the inputs had. This may involve moving data around (if the partitions were not co-located) but will not involve a shuffle. In this case the partitioner is retained. (The code for this is in PartitionerAwareUnionRDD.scala.)
This is no longer true. Iff two RDDs have exactly the same partitioner and number of partitions, the unioned RDD will also have those same partitions. This was introduced in https://github.com/apache/spark/pull/4629 and incorporated into Spark 1.3.

Resources