It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true
Related
I am using spark 2.0 to read the data from parquet file .
val Df = sqlContext.read.parquet("c:/data/parquet1")
val dfSelect= Df.
select(
"id",
"Currency",
"balance"
)
val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)
In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?
In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark
Thanks
Getting the data by using first is a perfectly valid way of getting the data. That said, doing:
val total = dfSelect.agg(sum("balance")).first().getDouble(0)
would probably give you better performance for getting the total.
group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.
When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.
Is it possible to create a JavaPairRDD<K,V> from a List<Tuple2<K,V>> with a specified partitioner? the method parallelizePairs in JavaSparkContext only takes the number of slices and does not allow using a custom partitioner. Invoking partitionBy(...) results in a shuffle which I would like to avoid.
Why do I need this? let's say I have rdd1 of some type JavaPairRDD<K,V> which is partitioned according to the hashCode of K. Now, I would like to create rdd2 of another type JavaPairRDD<K,U> from a List<Tuple2<K,U>> in order to finally obtain rdd3 = rdd1.join(rdd2).mapValues(...). If rdd2 is not partitioned the same way rdd1 is, the cogroup call in join will result in expensive data movement across the machines. Calling rdd2.partitionBy(rdd1.partitioner()) does not help either since it also invokes shuffle. Therefore, it seems like the only remedy is to ensure rdd2 is created with the same partitioner as rdd1 to begin with. Any suggestions?
ps. If List<Tuple2<K,U>> is small, another option is broadcast hash joins, i.e. making a HashMap<K,U> from List<Tuple2<K,U>>, broadcasting it to all partitions of rdd1, and performing a map-side joining. This turns out to be faster than repartitioning rdd2, however, it is not an ideal solution.
I have an RDD that contains HBase row keys. The RDD is relatively large to fit in memory. I need to get an RDD of values for each of the provided key. Is there a way to do something like this:
keys.map(key => table.get(new Get(key)))
So the question is how can I obtain an instance of HTable inside map task? Should I instantiate an HConnection for every partition, and then obtain HTable instance from it, or is there a better way?
There are a few options you can can do but first consider the fact that spark does not allow you to create RDDs of RDDs. So really that leaves you with two options
a list of RDDs
A Key/value RDD
I would highly recommend the second one as a list of RDDs could end with you needing to perform a lot of reduces which could massively increase the number of shuffles you need to perform. With that in mind I would recommend you use a flatMap.
So here is some basic skeleton code that could get you that result
val input:RDD[String]
val completedRequests:RDD[(String, List[String]) = input.map(a => (a, table.get(new Get(a)))
val flattenedRequests:RDD[(String, String) = completedRequests.flatMap{ case(k,v) => v.map(b =>(k,b))
You can now handle the RDD as one object, reduceByKey if you have a particular piece of information you need from it, and now spark will be able to access the data with optimal parallelism.
Hope that helps!
I am trying to find out any information on the ordering of the rows in a RDD.
Here is what I am trying to do:
Rdd1, Rdd2
Rdd3 = Rdd1.union(rdd2);
in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards?
For my tests I saw this behaviorunion
happening but wasn't able to find it in any docs.
just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is the requirement).
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered http://spark.apache.org/docs/latest/programming-guide.html#background
If you check your RDD3, you should find that RDD3 is just all the partitions of RDD1 followed by all the partitions of RDD2, so in this case the results happen to be ordered in the way you want. You can read here that simply concatenating the partitions from the 2 RDDs is the standard behaviour of Spark In Apache Spark, why does RDD.union not preserve the partitioner?
So in this case, it appears that Union will give you what you want. However this behaviour is an implementation detail of Union, it is not part of its interface definition, so you cannot rely on the fact that it won't be reimplemented with different behaviour in the future.
Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")