Given a set of primary keys (including the partition and clustering keys), what is the more performant way to query those rows from cassandra?
I am trying to implement a method that, given a list of keys, will return a spark RDD for a couple of other columns in the CF. I've implemented a solution based on this question Distributed loading of a wide row into Spark from Cassandra but this returns an RDD with a partition for each key. If the list of keys is large this will be inefficient and cause too many connections to cassandra.
As such, I'm looking for an efficient way to query cassandra for a set of primary keys.
The fastest solution should be grouping them by partition key using IN operator (or > if possible) for clustering keys and then, if needed, splitting these "supersets" client side.
Cheers,
Carlo
Related
What is 'wide partition pattern' in Cassandra? In book 'Defiinitive Cassandra' it seems its a recommended thing, but in some online articles I see its something to be avoided.
So what actually it is and it is preferrable or not?
Partition in Cassandra represent grouping of similar kind of rows. In Cassandra it is recommended to model your data such that you should have similar kind of rows fall in same partition. This is called wide partition pattern.
Searching in Cassandra is super fast using partition key. So wide partition pattern is recommened. But with this recommendation comes a warning that your wide partitions should not become too large.
The reason for warning (avoiding large partitions) is that searching becomes too slow as search within partition is slow. Also it puts lot of pressure on heap.
For better understanding, would recommend reading this blog https://thelastpickle.com/blog/2019/01/11/wide-partitions-cassandra-3-11.html
wide partition refers a partition which contains many cells/values (number_of_cells = column * row)
The commonly used time series design pattern is a great example for wide partition table design: You use the date bucket as the partition key, each bucket will contain all timestamps for the date range defined by that bucket.
This model illustrates why it's called "wide" partition
another dynamodb example, the idea is the same (sort key in dynamodb is the same as clustering key in cassandra)
I am using this line to get entries from my Cassandra database
val data1 =
ssc.
cassandraTable("orbigo2", "my_trips").
select("trip_id").
where ("user_id=?", uid)
But this is taking a lot of time, I guess the reason is that my uid is not a primary key but an index key.
Is there any way in which I can speed this up?
I would only recommend to change data model so the uid will become the partition key - in this case it will be much faster... Right now it's just scanning the whole table and extracting the necessary data.
Also, you may consider to use DataFrames instead of RDD - it has more optimizations, and maybe you can use the secondary indexes. You can check the execution plan with .explain on the DataFrame to see how the data is accessed...
I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries
I have a little bit special request.
Constelation: I use a Redis DB to store geo data and use georedius to get them back, sorted by distance. With this keys I search the data in cassandra. But the result of cassandra is sorted in the key or something else.
What I want is, to get the inforamtions back in the same order i requested it.
The partition key is build from id (I get back form redis) and a status.
Could I tell cassandra to sort by id array?
Partition key are designed to be randomly distributed across different nodes. You can use ByteOrderedPartitioner to do ordered queries. But BOP are considered anti-pattern is cassandra and I will highly recommend against it. You can read more about it here Cassandra ByteOrderedPartitioner.
You can add more parameters to the Primary Key which will determine how to store data on the disk. These are known as clustering keys. You can do Order By queries on clustering keys. This is a good document on clustering keys https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html.
If you can share more schema details, I can suggest what to use as clustering key.
I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben