Apache Spark RDD value lookup - apache-spark

I loaded data from Hbase and did some operation on that data and a paired RDD is created. I want to use the data of this RDD in my next function. I have half million records in RDD.
Can you please suggest performance effective way of reading data by key from the paired RDD .

Do the following:
rdd2 = rdd1.sortByKey()
rdd2.lookup(key)
This will be fast.

That is a tough use case. Can you use some datastore and index it?
Check out Splice Machine (Open Source).

Only from Driver, you can use rdd.lookup(key) to return all values associated with the provided key.

You can use
rddName.take(5)
where 5 is the number of top most elements to be returned. You can change the number accordingly.
Also to read the very first element, you can use
rddName.first

Related

How to use groupByKey() on multiple RDDs?

I have multiple RDDs with one common field CustomerId.
For eg:
debitcardRdd has data as (CustomerId, debitField1, debitField2, ......)
creditcardRdd has data as (CustomerId, creditField1, creditField2, ....)
netbankingRdd has data as (CustomerId, nbankingField1, nbankingField2, ....)
We perform different transformations on each individual rdd, however we need to perform a transformation on the data from all the 3 rdds by grouping CustomerId.
Example: (CustomerId,debitFiedl1,creditField2,bankingField1,....)
Is there any way we can group the data from all RDDs based on same key.
Note: In Apache Beam it can be done by using coGroupByKey, just checking if there is such alternative available in spark.
Just cogroup
debitcardRdd.keyBy(_.CustomerId).cogroup(
creditcardRdd.keyBy(_.CustomerId),
netbankingRdd.keyBy(_.CustomerId)
)
In contrast to the below, the .keyBy is not imho actually required here and we note that cogroup - not well described can extend to n RDDs.
val rddREScogX = rdd1.cogroup(rdd2,rdd3,rddn, ...)
Points should go to the first answer.

how saveToCassandra() work?

i want to know when i use rdd.saveToCassandra() if this function save all elements of current rdd into table cassandra a single time or save element by element similar than map function which process element by element of each rdd and return new parsed element?
Thanks
Neither first option nor second one. It writes data after grouping it in batches of configured size (by default 1024 bytes per batch and 1000 batches per Spark task). If you interested in details - it's open-sourced, so check RDDFunctions and TableWriter for start.
Updated as a response to comments. You may split your RDD in multiple RDDs and save each using saveToCassandra. RDD splitting is not standard feature of Spark as for now, so you need a 3rd-party library like Silex. Check documentation for flatMuxPartitions here

when should groupByKey API used in spark programming?

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.
According to the link below, GroupByKey should be avoided.
Avoid GroupByKey
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle.
Good example would be TopN.
More on this -
https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is
1. Combine as alister mentioned above
2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

How to access Map modified in RDD, in the driver program of Apache Spark?

Need help.
I am working on Apache Spark 1.2.0. I have a requirement or rather I should say I am stuck in some issue.
Its like :-
I am running a map function on RDD in which I am creating some Object instances and storing those instances in a ConcurrentMap against some key. Now after Map function has finished I need data that was stored in ConcurrentMap in the driver program. Which as of now is blank outside the map function.
Is it at all possible ? Can I access it by any means ?
Thanks
Nitin
I think you are misusing Spark or misunderstanding the concept. The thing you want to do can be achieved with mapPartitions function. This function will provide you an iterator over all the rows in the input RDD partition, this way you would know when the processing has finished and would be able to either save the ConcurrentMap you've generated to persistent storage or return its iterator as the function result
It you would elaborate on your use case or attach the code, I would be able to recommend the right solution for you

How to find the number of keys created in map part?

I am trying to write Spark application that would find me the number of keys that has been created in the map function. I could find no function that would allow me to do that.
One way I've thought of is using accumulator where I'd add 1 to the accumulator variable in the reduce function. My idea is based on the assumption that accumulator variables are shared across nodes as counters.
Please guide.
if you are looking something like the Hadoop counters in spark, the most accurate approximation is an Accumulator that you can increase in every task, but you do not have any information of the amount of data that Spark has processed so far.
If you only want to know how many distinct keys do you have in your rdd, you could do something like a count of the distinct mapped keys (rdd.map(t=>t_1)).distinct.count)
Hope this will be useful for you

Resources