Spark: (key, value) partition into different partition by key - apache-spark

I want partiton has only one key. Code in spark-shell
val rdd = sc.parallelize(List(("a",1), ("a",2),("b",1), ("b",3),("c",1),("e",5))).cache()
import org.apache.spark.HashPartitioner
rdd.partitionBy(new HashPartitioner(4)).glom().collect()
And, the result is:
Array[Array[(String, Int)]] = Array(Array(), Array((a,1), (a,2), (e,5)), Array((b,1), (b,3)), Array((c,1)))
There are 4 keys("a","b","c","e"), but they are in just 3 partitons though I define 4 HashPartitioner.I think it's hash collision because I use HashPartitioner. So how can I implement different keys into different partitions.
I have read this answer, still cannot solve my question .

You are right. It has a collision of hashes - some of the keys produce such hash values, that hash % numPartitions returns same value.
The only solution here is to create your own partitioner, which will put each key into separate partition. Just make sure to have enough partitions.
More about partitioning is here and here and here and here.

Related

Spark HashPartitioner Collision Mechanism?

does anyone know if Spark HashPartitioner has an automatic collision mechanism to assign key to a new partition? I.e. If I have very skewed data where a single key holds many records, and by
partition = hash(key) % num_partitions
I will land many records in the same partition which memory won’t hold. In this case, does the HashPartitioner have something like probing to assign records to a new partition, or does it not? If it does not, do I need to implement a customized partitioner to deal with the skewed key? Thanks very much.
I don't think the HashPartitioner is going to put records with the same key to two different partitions in any situation. The javadoc for partitioner clearly says the following:
An object that defines how the elements in a key-value pair RDD are
partitioned by key. Maps each key to a partition ID, from 0 to
numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the
same partition id given the same partition key.
If putting the records with same keys into the same partition is not a requirement for you, maybe you can try the following without implementing a custom partitioner.
Let's say you want to write the dataframe into 1000 files.
Add a new column to your dataframe with random integers between 0 to 999.
_num_output_files = 1000
df = df.withColumn('rand', round(rand() * (_num_output_files-1), 0).astype(IntegerType()))
WLG, let's Assume the rand column is your i-th column in the dataframe. We need to use that column as key for the rdd, and then partition by that key. This will ensure almost uniform distribution of data across all partitions. Following code snippet will achieve that.
tmp_rdd = df.rdd.keyBy(lambda x: x[i-1])
tmp_rdd = tmp_rdd.partitionBy(_num_output_files, lambda x: x)
df_rdd = spark.createDataFrame(tmp_rdd.map(lambda x: x[1]))
Note: This is a handy code snippet to check the current distribution of records across partitions in Pyspark: print('partition distrib: ' + str(df_rdd.rdd.glom().map(len).collect())). After calling the previous set of methods you should see roughly the same numbers in each of the partition.

Spark: where doesn't work properly

I have 2 dataset, and i want to create a join dataset, so I did
Dataset<Row> join = ds1.join(ds2, "id");
However for performance enhancement I tried to replace join with .where(cond) ( I also tried .filter(cond) ) like this:
Dataset<Row> join = ds1.where(col("id").equalTo(ds2.col("id"));
which also work, but not when one of the datasets is empty ( In this case it will return the non-empty dataset), However this is not the expected result.
So my question why .where doesn't work properly in that case, or is there another optimized solution for joining 2 datasets without using join().
Join and where condition are 2 different things. Your code for where condition will fail due to the resolve attribute issue. The where condition or filter condition is specific to that DataFrame. If you will mention second DataFrame in the condition it won’t iterate over like join. Please check your code if you are getting the result at all
Absolutely one of the key points when you want to join two RDDs, is the partitioner used over those two. If the first and the second rdd has the same partitioner then your join operation would be in the best performance it could be. If paritioner varies, then the first rdd's partitioner would be used to partition the second rdd.
Then try to just use a "light key", e.g. use encoded or hashed output of a String instead using the raw, and the same partitioner for both the rdds.

Apache Cassandra Several Partition Keys or Single Computed Key?

I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben

How do you query cassandra for a set of keys?

Given a set of primary keys (including the partition and clustering keys), what is the more performant way to query those rows from cassandra?
I am trying to implement a method that, given a list of keys, will return a spark RDD for a couple of other columns in the CF. I've implemented a solution based on this question Distributed loading of a wide row into Spark from Cassandra but this returns an RDD with a partition for each key. If the list of keys is large this will be inefficient and cause too many connections to cassandra.
As such, I'm looking for an efficient way to query cassandra for a set of primary keys.
The fastest solution should be grouping them by partition key using IN operator (or > if possible) for clustering keys and then, if needed, splitting these "supersets" client side.
Cheers,
Carlo

create unique values for each key in a spark RDD

I would like to create an RDD of key, value pairs where each key would have a unique value. The purpose is to "remember" key indices for later use since keys might be shuffled around the partitions, and basically create a lookup table of sorts. I am vectorizing some text and need to create feature vectors so I have to have a unique value for each key.
I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.
My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...
zipWithIndex and zipWithUniqueId were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master branch).
If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py.
As mentioned by #aaronman this is a simple operation that for some reason hasn't made it into the pyspark api yet. Going off the Java implementation, here's what seems to work (but gives indices with consecutive ordering on each partition):
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yield (id,c)
def zipindex(l, indices, k) :
start_index = indices[k]
for i,item in enumerate(l) :
yield (item,start_ind+i)
> parts = rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()
> indices = parts.values()
> indices.append(0,0)
> rdd_index = rdd.mapPartitionsWithIndex(lambda k,l: zipindex(l,indices,k))

Resources