Performance while writing events into Cassandra table - cassandra

Query 1: Event data from device is stored in Cassandra table. Obviously this is time series data. If we need to store how older dated events (if cached in device due to some issue) at current time, are we going to get performance issue? If yes, what is the solution to avoid that?
Query 2: Is it good practice to write the event into Cassandra table as soon as the event comes in? Or shall we queue it for sometime to write multiple events in one go if that improves Cassandra write performance significantly?

Q1: this all depends on the table design. Usually this shouldn't be an issue, but this may depend on your access patterns & compaction strategy. If you have table structure, please share it.
Q2: Individual writes shouldn't be a problem, but it really depends on your requirements for throughput. If you'll write several data points that belong to the same partition key you potentially may use unlogged batches, and in this case Cassandra will perform only one write for several inserts that are in this batch. Please read this document.

Related

Proper Consistency Level to read 'everything'

I'm creating a sync program to periodically copy our Cassandra data into another database. The database I'm copying from only gets INSERTs - data is never UPDATEd or DELETEd. I would like to address Cassandra's eventual consistency model in two ways:
1 - Each sync scan overlaps the last by a certain time span. For example, if the scan happens every hour, then each scan looks an hour and a half backwards. The data contains a unique key, so reading the same record in more than one scan is not an issue.
2 - I use a Consistency level of ALL to ensure that I'm scanning all of the nodes on the cluster for the data.
Is ALL the best Consistency for this situation? I just need to see a record on any node, I don't care if it appears on any other nodes. But I don't want to miss any INSERTed records either. But I also don't want to experience timeouts or performance issues because Cassandra is waiting for multiple nodes to see that record.
To complicate this a bit more, this Cassandra network is made up of 6 clusters in different geographic locations. I am only querying one. My assumption is that the overlap mentioned in #1 will eventually catch up records that exist on other clusters.
The query I'm doing is like this:
SELECT ... FROM transactions WHERE userid=:userid AND transactiondate>:(lastscan-overlap)
Where userid is the partioning key and transactiondate is a clustering column. The list of userId's is sourced elsewhere.
I use a Consistency level of All to ensure that I'm scanning all of the nodes on the cluster for the data
So consistency ALL has more to do with the number of data replicas read than it does with the number of nodes contacted. If you have a replication factor (RF) of 3 and query a single row at ALL, then Cassandra will hash your partition key to figure out the three nodes responsible for that row, contact all 3 nodes, and wait for all 3 to respond.
I just need to see a record on one node
So I think you'd be fine with LOCAL_ONE, in this regard.
The only possible advantage of using ALL, is that it actually does help to enforce data consistency by triggering a read repair 100% of the time. So if eventual consistency is a concern, that's a "plus." But *_ONE is definitely faster.
The CL documentation talks a lot about 'stale data', but I am interested in 'new data'
In your case, I don't see stale data as a possibility, so you should be ok there. The issue that you would face instead, is in the event that one or more replicas failed during the write operation, querying at LOCAL_ONE may or may not get you the only replica that actually exists. So your data wouldn't be stale vs. new, it'd be exists vs. does not exist. One point I talk about in the linked answer, is that perhaps writing at a higher consistency level and reading at LOCAL_ONE might work for your use case.
A few years ago, I wrote an answer about the different consistency levels, which you might find helpful in this case:
If lower consistency level is good then why we need to have a higher consistency(QUORUM,ALL) level in Cassandra?

Is RDBMS with redundancy as good as nosql dbs?

I am reading about NoSQL DBs (Specifically Cassandra) and It says that Cassandra is faster for writing and queries are fast as well. Schema design is done more based on queries than based on data. For example, You have queries like in this example
then I have a question, Suppose I design the RDBMS schema similar to Cassandra's way and I ensure that no joins are required for queries. Will I get any significant performance gains still by using Cassandra(NoSql DBs)?
Cannot have an exact answer but few points,
JOIN is just of the many things - Cassandra stores the data physically based on the partition keys and hence making the read by partition as fast as possible.
On the performance side - its not about the performance at the beginning but keeping the performance consistent over a period of time. Say for example you have a time series like requirement where data is inserted every second, RDBMS performance will usually degrade as the data grows and not easy to keep up the index and stats up to date etc, while cassandra will fit better for a time series pattern and as the data grows its easy to scale up by adding nodes.
On the write performance - Cassandra's write workflow itself is different and is designed in a way to take up faster (the complicated process like merging sstabls, compaction etc happens in the background without affecting the actual write).
In short - you need to review the business case and make decision.

How to Create slowness in Cassandra?

I want to create slowness in Cassandra to test my application. Is there any specific ways to induce slowness in Cassandra. In RDBMS we use locking, to wait for other operation until the lock is released. As Cassandra doesn't have locking, is there any other way to create deadlock, slowness etc.
You could use cassandra-stress tool
You could check out our project here simulacron. https://github.com/datastax/simulacron
This is a C*/DSE simulator, that was written specifically to test things like race conditions, and error conditions. You would have to prime all your relevant queries ahead of time, but it would allow you introduce a wait time, or errors to your responses. You can also simulate a large cluster on your local machine.
There is also a similar tool called scassandra, which does much of the same thing.
http://www.scassandra.org/
There are many ways to do it, i'll list two:
Create UDF with sleep/wait function within, if your version of Cassandra supports it.
Link to the docs:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html
Create large table (the larger it be, slower it will run), and run:
select some_column from table where other_column = 'something' allow filtering;
where other_column is not a partition key of the table. It will result in full table scan, and since Cassandra isn't built for it, it will take some time (also I/O and CPU).
Maybe easier to just limit the network on the nodes. Depending on the OS ure using there are different options.

Storing binary blobs in Cassandra

I am building a simple HTTP service, that stores arbitrary binary objects. The service is backed by Cassandra. It is a simplified version of Amazon's S3. The system must withstand a heavy write load and should be highly available on the write and read path.
The stored data is kind of immutable. It can be deleted, but it cannot be updated. Therefore, data inconsistency is not an issue. The datastore must be able to efficiently expire old data.
The service uses Netflix's Astyanax library, which provides a recipe for storing (large) binary objects in Cassandra.
I see two solution to tackle the problem, which both have pros and cons. For me it is hard to estimate, which way fits Cassandra better.
Single table with TTL
Astyanax automatically chunks large objects into small pieces and stores them into a single table. A TTL is assigned to each blob to expire it after a certain period of time. A compaction run removes blobs, when the TTL is expired.
This solutions works and is pretty straight forward to implement. I started using the SizeTieredCompactionStrategy, but I think, that DateTieredCompactionStrategy might be the better choice, when dealing with TTL data.
My main concern is: can Cassandra's compaction keep up? Has anyone experience with a similar use case?
Sharding data by time
Another approach would be to shard the data by time. I could create a table for each day and store the chunks in that table. In this case I can drop the complete table to get rid of the expired data.
This solution requires a little more effort in the implementation, but simplifies and probably speeds up the deletion of expired data.
How performant is Cassandra in dropping a table?
Correct option for your scenario is DateTieredCompactionStrategy and Assign TTL to each blob.
Refer:
http://www.datastax.com/dev/blog/datetieredcompactionstrategy

Getting rid of confusion regarding NoSQL databases

This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.

Resources