We are exploring using Cassandra as a way to store time series type data, so this may be somewhat of a noob question. One of the use cases is to read data from a Kafka stream, look for matches, and incrementing a counter (e.g. 5 customers have clicked through link alpha on page beta, increment (beta, alpha) by 5). However, we expect a very wide degree of parallelism to keep up with the load, so there may be more than one consumer reading from Kafka at the same time.
My question is: How would Cassandra resolve multiple simultaneous writes to a given counter from multiple sources?
It's my understanding that multiple writes to the counter with different timestamps will be added to the counter in the timestamp order received. However, if there were to be a simultaneous write with exact same timestamp, would the LWW model of Cassandra throw out one of those counter increments?
If we were to have a large cluster (100+ nodes), ALL or QUORUM writes may not be sufficient performant to keep up with the messasge traffic. Writes with THREE would seem to be likely to result in a situation where process #1 writes to nodes A, B, and C, but process #2 might write to X, Y, and Z. Would LWT work here, or do they not play well with counter activity?
I would try out a proof of concept and benchmark it, it will most likely work just fine. Counters are not super performant in Cassandra though, especially if there will be a lot of contention.
Counters are not like the normal writes with a simple LWW, it uses paxos with some pessimistic locking and specialized caches. The partition lock contention will slow it down soome, and paxos is an expensive multiple network hop process with reads before writes.
Use quorum, don't try to do something funky with CL's with counters, especially before benchmarking to know if you need it. 100 node cluster should be able to handle a lot as long as your not trying to update all the same partitions constantly.
Related
I have issues understanding the new features of hazelcast 5.0 Which is the main difference between those data structures? because pncounter is a counter that replicates data and when there is no more updates they combine together, I want to understand how does PNCounter of hazelcast controls the concurrency on hazelcast.
The replicant count max value it's about the max nodes you have running on hazelcast?
How does work internally? because I need to understand how this works, I'm working with a counter that counts the activity of several clients, we create like 1000 or even more pncounters for different activities because I don't know if one pn counter would work.
Does the client know which counter need to connect or does the counter follow a certain logic flow? I don't understand this feature, I really want to know the difference between pncounter and atomiclong.
for me it's like atomiclong which the feature that it can replicate.
It all boils down to the CAP Theorem.
In summary, out of Consistency, Availability & Partition-Resistance, you can pick 2 out of 3. And since Hazelcast is distributed by nature, your choice is between Consistency and Availability.
IAtomicLong is a member of CP Subsystem API
-- https://docs.hazelcast.com/imdg/4.2/data-structures/iatomiclong
A Conflict-free Replicated Data Type (CRDT) is a distributed data structure that achieves high availability by relaxing consistency constraints. There may be several replicas for the same data and these replicas can be modified concurrently without coordination. This means that you may achieve high throughput and low latency when updating a CRDT data structure. On the other hand, all of the updates are replicated asynchronously.
-- https://docs.hazelcast.com/imdg/4.2/data-structures/pn-counter
In summary, IAtomicLong sacrifices Availability for Consistency. The result will always be correct, but it might not always be available.
PNCounter makes the opposite trade-off. It's always available (depending on the number of nodes of the cluster, of course) but it's eventually consistent as it's asynchronously replicated.
I am now doing a small research to find a way to store huge volume of data (temporarily, till some consumers consume these messages) from various 'message producers' (source). The data come from different sources, say HTTP, FTP, SMPP or file upload, each type may have tens or hundreds of instances creating messages. The messages produced by them can grow so huge that the message consumers may lag behind in consuming the messages as the processes may take long or not short time. Now, the system uses RabbitMQ in some parts, but its performance drops when huge volume of unconsumed message grows (I'm also looking into improving its performance, but that's different). As an alternate, I am looking on to Apache Kafka which uses the disk for persisting messages.
As I read through many articles in the internet, I read some articles that talks about the Apache Cassandra with very fast write, processing a million inserts per second and similar volume reads. I was astonished, and tried to find some leads in using Cassandra for my case but with no clear results.
Assuming I have large number of message producers, can Cassandra (cluster) handle inserts (in batches) so faster (overall high throughput) that the producers does not throttle?
I am sure some among you could have used Cassandra for this or similar kind of use cases, share you experiences. (I am ready to provide you any more information if this does not suffice)
Yes! Cassandra can handle writes very effectively. But in my experience, using it as a messaging system (queue and the likes) brings some technical constraints because of the tombstones.
Cassandra doesn't remove deleted rows immediately and marks them with a tomstone to be garbarge collected later. Overtime, if there are a lot of deletions (eg. dequeue messages), the overall performance will be hurt, and quite quickly.
You can go for Cassandra but you will have to implement work around for the tomstone problem (time bucket, multiple status tables).
IMHO, Apache Kafka is much more appropriate to the messaging use case and can also be scaled massively.
I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.
I have a Cassandra question. Do you know how Cassandra does updates/increments of counters?
I want to use a storm bolt (CassandraCounterBatchingBolt from storm-contrib repo on github) which writes into cassandra. However, I'm not sure how some of the implementation of the incrementCounterColumn() method works .. and there is also the limitations with cassandra counters (from: http://wiki.apache.org/cassandra/Counters) which makes them useless for my scenario IMHO:
If a write fails unexpectedly (timeout or loss of connection to the coordinator node) the client will not know if the operation has been performed. A retry can result in an over count CASSANDRA-2495.
Counter removal is intrinsically limited. For instance, if you issue very quickly the sequence "increment, remove, increment" it is possible for the removal to be lost
Anyway, here is my scenario:
I update the same counter faster than the updates propagate to other Cassandra nodes.
Example:
Say I have 3 cassandra nodes. The counters on each of these nodes are 0.
Node1:0, node2:0, node3:0
An increment comes: 5 -> Node1:0, node2:0, node3:0
Increment starts at node 2 – still needs to propagate to node1 and node3
Node1:0, node2:5, node3:0
In the meantime, another increment arrives before previous increment
is propagated: 3 -> Node1:0, node2:5, node3:0
Assuming 3 starts at a different node than where 5 started we have:
Node1:3, node2:5, node3:0
Now if 3 gets propagated to the other nodes AS AN INCREMENT and not as a new value
(and the same for 5) then eventually the nodes would all equal 8 and this is what I want.
If 3 overwrites 5 (because it has a later timestamp) this is problematic – not what I want.
Do you know how these updates/increments are handled by Cassandra?
Note, that a read before a write is still susceptible to the same problem depending from which replica node the read executes (Quorum can still fail if propagation is not far along)
I'm also thinking that maybe putting a cache b/w my storm bolt and Cassandra might solve this issue but that's a story for another time.
Counters in C* have a complex internal representation that avoids most (but not all) problems of counting things in a leaderless distributed system. I like to think of them as sharded counters. A counter consists of a number of sub-counters identified by host ID and a version number. The host that receives the counter operation increments only its own sub-counter, and also increments the version. It then replicates its whole counter state to the other replicas, which merge it with their states. When the counter is read the node handling the read operation determines the counter value by summing up the total of the counts from each host.
On each node a counter increment is just like everything else in Cassandra, just a write. The increment is written to the memtable, and the local value is determined at read time by merging all of the increments from the memtable and all SSTables.
I hope that explanation helps you believe me when I say that you don't have to worry about incrementing counters faster than Cassandra can handle. Since each node keeps its own counter, and never replicates increment operations, there is no possibility of counts getting lost by race conditions like a read-modify-write scenario would introduce. If Cassandra accepts the write, your're pretty much guaranteed that it will count.
What you're not guaranteed, though, is that the count will appear correct at all times unless. If an increment is written to one node but the counter value read from another just after, there is not guarantee that the increment has been replicated, and you also have to consider what would happen during a network partition. This more or less the same with any write in Cassandra, it's in its eventually consistent nature, and it depends on which consistency levels you used for the operations.
There is also the possibility of a lost acknowledgement. If you do an increment and loose the connection to Cassandra before you can get the response back you can't know whether or not your write got though. And when you get the connection back you can't tell either, since you don't know what the count was before you incremented. This is an inherent problem with systems that choose availability over consistency, and the price you pay for many of the other benefits.
Finally, the issue of rapid remove, increment, removes are real, and something you should avoid. The problem is that the increment operation will essentially resurrect the column, and if these operations come close enough to each other they might get the same timestamp. Cassandra is strictly last-write-wins and determines last based on the timestamp of the operation. If two operations have the same time stamp, the "greater" one wins, which means the one which sorts after in a strict byte order. It's real, but I wouldn't worry too much about it unless you're doing very rapid writes and deletes to the same value (which is probably a fault in your data model).
Here's a good guide to the internals of Cassandra's counters: http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf
The current version of counters are just not a good fit for a use case that requires guarantees of no over-counting and immediate consistency.
There are increment and decrement operations, and those will not collide with each other, and, barring any lost mutations or replayed mutations, will give you a correct result.
The rewrite of Cassandra counters (https://issues.apache.org/jira/browse/CASSANDRA-6504) might be interesting to you, and it should address all of the current concerns with getting a correct count.
In the meantime, if I had to implement this on top of a current version of Cassandra, and an accurate count was essential, I would probably store each increment or decrement as a column, and do read-time aggregation of the results, while writing back a checkpoint so you don't have to read back to the beginning of time to calculate subsequent results.
That adds a lot of burden to the read side, though it is extremely efficient on the write path, so it may or may not work for your use case.
To understand updates/increments i.e write operations, i will suggest you to go through Gossip, protocol used by Cassandra for communication. In Gossip every participant(node) maintains their state using the tuple σ(K) = (V*N) where σ(K) is the state of K key with V value and N as version number.
To maintain the single version of truth for a data packet Gossip maintains a Reconciliation mechanism namely Precise & Scuttlebutt(current). According to Scuttlebutt Reconciliation, before updating any tuple they communicate with each other to check who is holding the highest version (newest value) of the key. Whosoever is holding the highest version is responsible for the write operation.
For further information read this article.
I have a table counting around 1000 page views per second. What read and write ConsistencyLevel should I use with it? I am using the Cassandra Thrift client.
Carlo has more or less the right idea. But you have to balance it with your use case.
I work in the game industry and we use cassandra for player data. It is quite heavily bound by the read-modify-write pattern which is not the strong suit of cassandra. But we also have some functionality that are Write heavy (thousands of writes for a few reads a day).
This is my opinion, based upon experience, of how you should use the consistency levels.
Write + Read at QUORUM means that before returning for both operations it will wait for a majority of nodes in the cluster to confirm the operation. It is the solution I use when Read and Writes are roughly at the same frequency. (Player data blob)
Write One + Read All is useful for something very write heavy. We use this for high scores for examples (write often read every 5 minutes for regenerating the high score table of the whole game)
You could use Write Any if you do not care about the data that much (non critical logs comes to mind).
The only use case I could come up for the Write All + Read One would be messaging or feeds with periodical checks for updates. Chats and messaging seem a good fit for that since Cassandra does not have a subscription/push functionality to it.
Write & Read ALL is a bad implementation. It IS a WASTE of resource as you will get the same consistency as if you were using one of the three set up I mentioned above.
A final note about Write ANY vs. Write ONE : ANY only confirms that anything in the cluster has received the mutation, but ONE confirms that it has been applied at least by one node. ANY is not safe as it could return without error even if all the nodes responsible for that mutation are down, or any other condition that could make the mutation fail after reception. It is also slightly quicker (I only use it as an async dump for logs that are not critical) that is its only advantage, but do not trust the response at 100%.
A good reference to study this subject about cassandra is http://www.datastax.com/docs/1.2/dml/data_consistency
If you want always be consistent at any read the rule is
(write consistency level + read consistency level) > replication factor.
So you could
Write All + Read All (worst solution)
Write One + Read All (second-worst solution)
Write All + Read One (probably faster solution)
Write Quorum + Read Quorum (imho, best solution)
I want remember that if a node of RF is down during the r/w operation the operation will fail so I'd avoid the CL ALL.
Regards, Carlo
Based on their document (https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_counters_c.html), consistency level ONE is recommended. I guess some sort of merging is used to resolve conflict for counter columns, instead of usual last write win. That's likely why setting a value is not allowed.