I have multiple RDDs with one common field CustomerId.
For eg:
debitcardRdd has data as (CustomerId, debitField1, debitField2, ......)
creditcardRdd has data as (CustomerId, creditField1, creditField2, ....)
netbankingRdd has data as (CustomerId, nbankingField1, nbankingField2, ....)
We perform different transformations on each individual rdd, however we need to perform a transformation on the data from all the 3 rdds by grouping CustomerId.
Example: (CustomerId,debitFiedl1,creditField2,bankingField1,....)
Is there any way we can group the data from all RDDs based on same key.
Note: In Apache Beam it can be done by using coGroupByKey, just checking if there is such alternative available in spark.
Just cogroup
debitcardRdd.keyBy(_.CustomerId).cogroup(
creditcardRdd.keyBy(_.CustomerId),
netbankingRdd.keyBy(_.CustomerId)
)
In contrast to the below, the .keyBy is not imho actually required here and we note that cogroup - not well described can extend to n RDDs.
val rddREScogX = rdd1.cogroup(rdd2,rdd3,rddn, ...)
Points should go to the first answer.
Related
I receive a Dataset and I am required to join it with another Table. Hence the most simple solution that came to my mind was to create a second Dataset for the other table and perform the joinWith.
def joinFunction(dogs: Dataset[Dog]): Dataset[(Dog, Cat)] = {
val cats: Dataset[Cat] = spark.table("dev_db.cat").as[Cat]
dogs.joinWith(cats, ...)
}
Here my main concern is with spark.table("dev_db.cat"), as it feels like we are referring to all of the cat data as
SELECT * FROM dev_db.cat
and then doing a join at a later stage. Or will the query optimizer directly perform the join with out referring to the whole table? Is there a better solution?
Here are some suggestions for your case:
a. If you have where, filter, limit, take etc operations try to apply them before joining the two datasets. Spark can't push down these kind of filters therefore you have to do by your own reducing as much as possible the amount of target records. Here an excellent source of information over the Spark optimizer.
b. Try to co-locate the datasets and minimize the shuffled data by using repartition function. The repartition should be based on the keys that participate in join i.e:
dogs.repartition(1024, "key_col1", "key_col2")
dogs.join(cats, Seq("key_col1", "key_col2"), "inner")
c. Try to use broadcast for the smaller dataset if you are sure that it can fit in memory (or increase the value of spark.broadcast.blockSize). This consists a certain boost for the performance of your Spark program since it will ensure the co-existense of two datasets within the same node.
If you can't apply any of the above then Spark doesn't have a way to know which records should be excluded and therefore will scan all the available rows from both datasets.
You need to do an explain and see if predicate push down is used. Then you can judge your concern to be correct or not.
However, in general now, if no complex datatypes are used and/or datatype mismatches are not evident, then push down takes place. You can see that with simple createOrReplaceTempView as well. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html
I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().
Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all
Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.
I loaded data from Hbase and did some operation on that data and a paired RDD is created. I want to use the data of this RDD in my next function. I have half million records in RDD.
Can you please suggest performance effective way of reading data by key from the paired RDD .
Do the following:
rdd2 = rdd1.sortByKey()
rdd2.lookup(key)
This will be fast.
That is a tough use case. Can you use some datastore and index it?
Check out Splice Machine (Open Source).
Only from Driver, you can use rdd.lookup(key) to return all values associated with the provided key.
You can use
rddName.take(5)
where 5 is the number of top most elements to be returned. You can change the number accordingly.
Also to read the very first element, you can use
rddName.first
GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.
According to the link below, GroupByKey should be avoided.
Avoid GroupByKey
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions
A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)