Cassandra - iterate over all Row Keys without duplicates on random partitioner - cassandra

get_range_slices iterates over all keys also in case of random partitioner. As I understand result of this query will not return duplicated keys, because it goes ascending over ring. Since keys are hashed, Cassandra would need additional "index" to be able to execute such query - like each key would need to keep references to next key (which is not the case).
Could someone give me some hints on how Cassandra realizes iteration over all keys in case of random partitioner?

Results are returned in random order. Or more specifically, token order (the hashed value of the keys).

EDIT: I am not sure I understood the original question as if you have 100 nodes, you would never from a single node want to run get_range_slices. Typically you would install hadoop map/reduce on top of cassandra with cassandra's adapter so you can process all keys in parallel.
get_range_slices in general is never used for getting "all" the keys on random partitioner. Instead, map/reduce is utilized as it is MUCH MUCH faster to send your binary code to each machine and each machine executes in parallel so you can traverse the entire data set much much faster.
ie. maybe you need to look into map/reduce instead of get_range_slices?
Another option is PlayOrm's partitioning if you use PlayOrm since you can use storm and you can have a machine processing each partition. AND you can do a
PARTITIONS(:partitionId) SELECT * FROM Table
to get all the rows for a partition.
You can of course do joins and such too and they are fast as they read from multiple disks in parallel and dealing with disks, you want that parallel action to speed things up.

Related

Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field?

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

Are secondary indices always a bad idea in Cassandra even if I specify them in conjunction with the partitioning key in all my queries?

I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?
In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns.
Also you can have other issue while dealing with tombstones ...
To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better !
you can find more details here in section F :
https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive
I hope this helps !
These guys have a nice write-up on the performance impacts of secondary indexes: 
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to
satisfy a query by indexed value, each node has to query its own records to build the
final result set (as opposed to a primary key query where it is known exactly which node
needs to be queried). So there's not just an impact on writes, but on read performance as
well.
Cassandra on a ring of five machines, with a primary index of user IDs and a secondary index of user emails. If you were to query for a user by their ID or by their primary indexed key any machine in the ring would know which machine has a record of that user. One query, one read from disk. However to query a user by their email or their secondary indexed value each machine has to query its own record of users. One query, five reads from disk. By either scaling the number of users system wide, or by scaling the number of machines in the ring, the noise to signal-to-ratio increases and the overall efficiency of reading drops. In some cases to the point of timing out also.
Please refer below link for good explanation on secondary index.
https://dzone.com/articles/cassandra-scale-problem

Cassandra Query Performance: Using IN clause for one portion of the composite partition key

I currently have a table set up in Cassandra that has either text, decimal or date type columns with a composite partition key of a business_date and an account_number. For queries to this table, I need to be able to support look-ups for a single account, or for a list of accounts, for a given date.
Example:
select x,y,z from my_table where business_date = '2019-04-10' and account_number IN ('AAA', 'BBB', 'CCC')
//Note: Both partition keys are provided for this query
I've been struggling to resolve performance issues related to accessing this data because I'm noticing latency patterns that I am having trouble trying to understand / explain.
In many scenarios, the same exact query can be run a total of three times in a short period by the client application. For these scenarios, I see that two out of three requests will have really bad response times (800 ms), and one of them will have a really fast one (50 ms). At first I thought this would be due to key or row caches, however, I'm not so sure since I believe that if this were true, the third request out of the three should always be the fastest, which isn't the case.
The second issue I believed I was facing was the actual data model itself. Although the queries are being submitted with all the partition keys being provided, since it's an IN clause, the results would be separate partitions and can be distributed across the cluster and so, this would be a bad access pattern. However, I see these latency problems when even single account queries are run. Additionally, I see queries that come with 15 - 20 accounts performing really well (under 50ms), so I'm not sure if the data model is actually an issue.
Cluster setup:
Datacenters: 2
Number of nodes per data center: 3
Keyspace Replication:local_dc = 2, remote_dc = 2
Java Driver set:
Load-balancing: DCAware with LatencyAware
Protocol: v3
Queries are still set up to use "IN" clauses instead of async individual queries
Read_consistency: LOCAL_ONE
Does anyone have any ideas / clues of what I should be focusing on in terms of really identifying the root cause of this issue?
the use of IN on the partition key is always the bad idea, even for composite partition keys. The value of partition key defines the location of your data in cluster, and different values of partition key will most probably put data onto different servers. In this case, coordinating node (that received the query) will need to contact nodes that hold the data, wait that these nodes will deliver results, and only after that, send you results back.
If you need to query several partition keys, then it will be faster if you issue individual queries asynchronously, and collect result on client side.
Also, please note that TokenAware policy works best when you use PreparedStatement - in this case, driver is able to extract value of partition key, and find what server holds data for it.

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6

Getting rid of confusion regarding NoSQL databases

This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.

Resources