Spark HashPartitioner Collision Mechanism? - apache-spark

does anyone know if Spark HashPartitioner has an automatic collision mechanism to assign key to a new partition? I.e. If I have very skewed data where a single key holds many records, and by
partition = hash(key) % num_partitions
I will land many records in the same partition which memory won’t hold. In this case, does the HashPartitioner have something like probing to assign records to a new partition, or does it not? If it does not, do I need to implement a customized partitioner to deal with the skewed key? Thanks very much.

I don't think the HashPartitioner is going to put records with the same key to two different partitions in any situation. The javadoc for partitioner clearly says the following:
An object that defines how the elements in a key-value pair RDD are
partitioned by key. Maps each key to a partition ID, from 0 to
numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the
same partition id given the same partition key.
If putting the records with same keys into the same partition is not a requirement for you, maybe you can try the following without implementing a custom partitioner.
Let's say you want to write the dataframe into 1000 files.
Add a new column to your dataframe with random integers between 0 to 999.
_num_output_files = 1000
df = df.withColumn('rand', round(rand() * (_num_output_files-1), 0).astype(IntegerType()))
WLG, let's Assume the rand column is your i-th column in the dataframe. We need to use that column as key for the rdd, and then partition by that key. This will ensure almost uniform distribution of data across all partitions. Following code snippet will achieve that.
tmp_rdd = df.rdd.keyBy(lambda x: x[i-1])
tmp_rdd = tmp_rdd.partitionBy(_num_output_files, lambda x: x)
df_rdd = spark.createDataFrame(tmp_rdd.map(lambda x: x[1]))
Note: This is a handy code snippet to check the current distribution of records across partitions in Pyspark: print('partition distrib: ' + str(df_rdd.rdd.glom().map(len).collect())). After calling the previous set of methods you should see roughly the same numbers in each of the partition.

Related

How can I reduce or is it necessary to reducing partition count for large amount of data in Cassandra?

I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number and event_date. number as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.
How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?
CREATE TABLE events_by_number_and_date (
number bigint,
event_date int, /*eg. 20200520*/
event text,
col1 int,
col2 decimal
PRIMARY KEY (number, event_date)
);
For your query, change of the data model won't help, as you're using the query that is unsuitable for Cassandra. Although Cassandra supports aggregations, such as, max, count, avg, sum, ..., they are designed for work inside single partition, and not designed to work in the whole cluster. If you issue them without restriction on the partition key, coordinating node, need to reach every node in the cluster, and they will need to go through all the data in the cluster.
You can still do this kind of query, but it's better to use something like Spark to do that, as it's heavily optimized for parallel data processing, and Spark Cassandra Connector is able to correctly perform querying of the data. If you can't use Spark, you can implement your own full token range scan, using code similar to this. But in any case, don't expect that there will be a "real-time" answer (< 1sec).

Spark: (key, value) partition into different partition by key

I want partiton has only one key. Code in spark-shell
val rdd = sc.parallelize(List(("a",1), ("a",2),("b",1), ("b",3),("c",1),("e",5))).cache()
import org.apache.spark.HashPartitioner
rdd.partitionBy(new HashPartitioner(4)).glom().collect()
And, the result is:
Array[Array[(String, Int)]] = Array(Array(), Array((a,1), (a,2), (e,5)), Array((b,1), (b,3)), Array((c,1)))
There are 4 keys("a","b","c","e"), but they are in just 3 partitons though I define 4 HashPartitioner.I think it's hash collision because I use HashPartitioner. So how can I implement different keys into different partitions.
I have read this answer, still cannot solve my question .
You are right. It has a collision of hashes - some of the keys produce such hash values, that hash % numPartitions returns same value.
The only solution here is to create your own partitioner, which will put each key into separate partition. Just make sure to have enough partitions.
More about partitioning is here and here and here and here.

Cassandra partition keys organisation

I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.

How do you query cassandra for a set of keys?

Given a set of primary keys (including the partition and clustering keys), what is the more performant way to query those rows from cassandra?
I am trying to implement a method that, given a list of keys, will return a spark RDD for a couple of other columns in the CF. I've implemented a solution based on this question Distributed loading of a wide row into Spark from Cassandra but this returns an RDD with a partition for each key. If the list of keys is large this will be inefficient and cause too many connections to cassandra.
As such, I'm looking for an efficient way to query cassandra for a set of primary keys.
The fastest solution should be grouping them by partition key using IN operator (or > if possible) for clustering keys and then, if needed, splitting these "supersets" client side.
Cheers,
Carlo

create unique values for each key in a spark RDD

I would like to create an RDD of key, value pairs where each key would have a unique value. The purpose is to "remember" key indices for later use since keys might be shuffled around the partitions, and basically create a lookup table of sorts. I am vectorizing some text and need to create feature vectors so I have to have a unique value for each key.
I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.
My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...
zipWithIndex and zipWithUniqueId were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master branch).
If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py.
As mentioned by #aaronman this is a simple operation that for some reason hasn't made it into the pyspark api yet. Going off the Java implementation, here's what seems to work (but gives indices with consecutive ordering on each partition):
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yield (id,c)
def zipindex(l, indices, k) :
start_index = indices[k]
for i,item in enumerate(l) :
yield (item,start_ind+i)
> parts = rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()
> indices = parts.values()
> indices.append(0,0)
> rdd_index = rdd.mapPartitionsWithIndex(lambda k,l: zipindex(l,indices,k))

Resources