create unique values for each key in a spark RDD - apache-spark

I would like to create an RDD of key, value pairs where each key would have a unique value. The purpose is to "remember" key indices for later use since keys might be shuffled around the partitions, and basically create a lookup table of sorts. I am vectorizing some text and need to create feature vectors so I have to have a unique value for each key.
I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.
My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...

zipWithIndex and zipWithUniqueId were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master branch).
If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py.

As mentioned by #aaronman this is a simple operation that for some reason hasn't made it into the pyspark api yet. Going off the Java implementation, here's what seems to work (but gives indices with consecutive ordering on each partition):
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yield (id,c)
def zipindex(l, indices, k) :
start_index = indices[k]
for i,item in enumerate(l) :
yield (item,start_ind+i)
> parts = rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()
> indices = parts.values()
> indices.append(0,0)
> rdd_index = rdd.mapPartitionsWithIndex(lambda k,l: zipindex(l,indices,k))

Related

Spark HashPartitioner Collision Mechanism?

does anyone know if Spark HashPartitioner has an automatic collision mechanism to assign key to a new partition? I.e. If I have very skewed data where a single key holds many records, and by
partition = hash(key) % num_partitions
I will land many records in the same partition which memory won’t hold. In this case, does the HashPartitioner have something like probing to assign records to a new partition, or does it not? If it does not, do I need to implement a customized partitioner to deal with the skewed key? Thanks very much.
I don't think the HashPartitioner is going to put records with the same key to two different partitions in any situation. The javadoc for partitioner clearly says the following:
An object that defines how the elements in a key-value pair RDD are
partitioned by key. Maps each key to a partition ID, from 0 to
numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the
same partition id given the same partition key.
If putting the records with same keys into the same partition is not a requirement for you, maybe you can try the following without implementing a custom partitioner.
Let's say you want to write the dataframe into 1000 files.
Add a new column to your dataframe with random integers between 0 to 999.
_num_output_files = 1000
df = df.withColumn('rand', round(rand() * (_num_output_files-1), 0).astype(IntegerType()))
WLG, let's Assume the rand column is your i-th column in the dataframe. We need to use that column as key for the rdd, and then partition by that key. This will ensure almost uniform distribution of data across all partitions. Following code snippet will achieve that.
tmp_rdd = df.rdd.keyBy(lambda x: x[i-1])
tmp_rdd = tmp_rdd.partitionBy(_num_output_files, lambda x: x)
df_rdd = spark.createDataFrame(tmp_rdd.map(lambda x: x[1]))
Note: This is a handy code snippet to check the current distribution of records across partitions in Pyspark: print('partition distrib: ' + str(df_rdd.rdd.glom().map(len).collect())). After calling the previous set of methods you should see roughly the same numbers in each of the partition.

How to use groupByKey() on multiple RDDs?

I have multiple RDDs with one common field CustomerId.
For eg:
debitcardRdd has data as (CustomerId, debitField1, debitField2, ......)
creditcardRdd has data as (CustomerId, creditField1, creditField2, ....)
netbankingRdd has data as (CustomerId, nbankingField1, nbankingField2, ....)
We perform different transformations on each individual rdd, however we need to perform a transformation on the data from all the 3 rdds by grouping CustomerId.
Example: (CustomerId,debitFiedl1,creditField2,bankingField1,....)
Is there any way we can group the data from all RDDs based on same key.
Note: In Apache Beam it can be done by using coGroupByKey, just checking if there is such alternative available in spark.
Just cogroup
debitcardRdd.keyBy(_.CustomerId).cogroup(
creditcardRdd.keyBy(_.CustomerId),
netbankingRdd.keyBy(_.CustomerId)
)
In contrast to the below, the .keyBy is not imho actually required here and we note that cogroup - not well described can extend to n RDDs.
val rddREScogX = rdd1.cogroup(rdd2,rdd3,rddn, ...)
Points should go to the first answer.

Spark: where doesn't work properly

I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().
Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all
Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.

Spark data manipulation with wholeTextFiles

I have 20k compressed files of ~2MB to manipulate in spark. My initial idea was to use wholeTextFiles() so that I get filename - > content tuples. This is useful because I need to maintain this kind of pairing (because the processing is done on a per file basis, with each file representing a minute of gathered data). However, whenever I need to map/filter/etc the data and to maintain this filename - > association, the code gets ugly (and perhaps not efficient?) i.e.
Data.map(lambda (x,y) : (x, y.changeSomehow))
The data itself, so the content of each file, would be nice to read as a separate RDD because it contains 10k's of lines of data; however, one cannot have an rdd of rdds (as far as i know).
Is there any way to ease the process? Any workaround that would basically allow me to use the content of each file as an rdd, hence allowing me to do rdd.map(lambda x: change(x)) without the ugly keeping track of filename (and usage of list comprehensions instead of transformations) ?
The goal of course is to also maintain the distributed approach and to not inhibit it in any way.
The last step of the processing will be to gather together everything through a reduce.
More background: trying to identify (near) ship collisions on a per minute basis, then plot their path
If you have normal map functions (o1->o2), you can use mapValues function. You've got also flatMap (o1 -> Collection()) function: flatMapValues.
It will keep Key (in your case - file name) and change only values.
For example:
rdd = sc.wholeTextFiles (...)
# RDD of i.e. one pair, /test/file.txt -> Apache Spark
rddMapped = rdd.mapValues (lambda x: veryImportantDataOf(x))
# result: one pair: /test/file.txt -> Spark
Using reduceByKey you can reduce results

Grouping related keys of an RDD together

I have a generated RDD with a set of key value pairs. Assume that the keys are [10, 20, 25,30, 40, 50]. The real keys are close by Geographic bins of size X.X meters that need to be aggregated to size 2*X.2*X size.
So in this RDD set I need to aggregate keys that are having a relation between them. Example a key that is twice that of the current key - say 10 and 20. Then these will be added together to give 30. The values will also be added together Similarly the result set would be [30,25,70,50].
I am assuming that since map and reduce work on the current key of an element in an RDD , there is no way to do it using map or groupbyKey or aggregatebyKey; as the grouping I want needs the state of the previous key
I was thinking the only way to do this is to iterate through the elements in the RDD using foreach and for each element pass in also the entire RDD to it.
def group_rdds_together(rdd,rdd_list):
key,val = rdd
xbin,ybin = key
rdd_list.foreach(group_similar_keys,xbin,ybin)
bin_rdd.map(lambda x : group_rdds_together(rdd,bin_rdd))
For that I have to pass in rdd to the map lambda as well as custom parameters to the foreach function
What I am doing is horribly wrong; just wanted to illustrate where i am going with this. There should be a simpler and better way than this

Resources