We are attempting to use Prometheus to manage Cassandra cluster (v2.1.x) in my organization. I have a question related to Read or Write latency metrics.
For a node level read or write latency, the metric that I am using is -
org.apache.cassandra.metrics<type=(ClientRequest), scope=(Read|Write), name=(Latency)><>(OneMinuteRate)
For a table level, I am using - org.apache.cassandra.metrics<type=(Keyspace), keyspace=(\S*), name=(ReadLatency|WriteLatency)><>(OneMinuteRate)
I tested the metrics by running cassandra-stress utility on a sample table. At the table level, I am seeing that the writes are roughly 5000 per sec. This is as expected.
But the node level writes are around 100 per sec. My requirement is to graph node level writes per second and table level writes per sec.
Yes, ClientRequest Latency metric is on node level. The reason you only have 100 writes per second there means that this node is getting very few client requests.
The table level is for all nodes in the cluster, meaning all your nodes together are doing 5000 writes to this table.
You find out if you have any hotspots (meaning these nodes handle more traffic than others). Perhaps you are using a limited amount of partition keys which would target a limited amount of nodes. This is most likely your issue (poor data model). You can also check what Load Balancing Policy your driver is using and change that.
Related
I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.
Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.
This may sound like a dumb question but still I wanted someone/expert to answer/confirm this.
Lets say I have a 3 node cassandra cluster. Lets say I have one database and just one table. For this single table lets say I get a throughput of 1K writes/second with 3 node cassandra. If tomorrow my write load on this table increases/scales to 10K or 20K, will I be able to handle this write load by increasing the size of cluster by say 10x or 20x?
My understanding of cassandra says it is possible (as cassandra is both read and write scalable) but would want an expert to confirm.
Yes, Cassandra has Linear Scalability.
The scalability is linear as shown in the chart below. Each client system generates about 17,500 write requests per second, and there are no bottlenecks as we scale up the traffic. Each client ran 200 threads to generate traffic across the cluster.
Source : https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e
Yes - but only if your data is properly modeled - your data especially needs to be distributed evenly among your partition keys (since they map to specific replica nodes) to avoid hot spots. Given that, yes cassandra will scale horizontally well.
A "table" in cassandra is distributed among all nodes in your cluster. Each node is responsible for a range of tokens which are hashes of the partition key portion of your primary key.
Now, if you double your node count for example - the existing token ranges are split in half and distributed while bootstrapping the new nodes. So each node will only handle half of your inital requests. If you double your requests afterwards, each node will have roughly the same load as before.
For read intensive requests - choosing a higher replication factor helps when you can live with stale data for a while (e.g. read and write at a low consistency level).
There are good tutorials from DataStax available here https://academy.datastax.com/
Datastax states that:
What are the benefits of Apache Cassandra?
Massively scalable ring architecture: Based on the best of Amazon Dynamo and Google BigTable, Cassandra’s peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability.
Linear scale performance: Nodes added to a Cassandra cluster (all done online) increase the throughput of your database in a predictable, linear fashion for both read and write operations.
So the answer is YES, it is possible. It may take some time to adding a new node and redistribute tokens. But it will scale as you change the number of nodes.
If you need more info to understand how it will scale , check this links below:
Benchmarking Cassandra Scalability on AWS
Adding nodes to Cassandra
Adding, replacing, moving and removing nodes
Yes, it is so, but with the single remark. You should consider replication factor (RF) and consistency level (CL) as they affect the scaling behaviour also.
For example, if you initially have the 10 nodes with RF=3, and you increase the nodes count up to 20 with the same RF=3, you'll get the linear increase in write throughput.
But if you want to increase the read throughput, you need to increase RF. And with the increased RF you had to decrease write consistency level to improve write throughput.
To summarize, you could not increase both read and write throughput in a linear way with the same RF and CL params.
I'm trying to understand how to use the org.apache.cassandra.metrics:type=Messaging metric. I setup 3 datacenters with 1 node each. When I measure the metric, for each node I get 2 cross-datacenter metrics and 1 cross-node latency metric as follows (for node in DC-2)
org.apache.cassandra.metrics:type=Messaging,name=dc3-Latency
5.3387457013878636E7
org.apache.cassandra.metrics:type=Messaging,name=CrossNodeLatency
1.1471964354991291E8
org.apache.cassandra.metrics:type=Messaging,name=dc1-Latency
1.6108579786605054E8
However, I have no processes using the cluster currently. Is Cassandra doing a dummy write to measure this metric? Also, what does the cross-node latency metric mean here, each DC contains only one node.
The metric records incoming latency from all things using the message service. The message service is used for read/writes but its also used for streaming and gossip. Gossip fires every 1 second between all the nodes so this is probably dominating it in your situation. Also some tables may be written to (system_distributed, system_traces, and some dse tables if using dse) with even a pretty idle system in some situations.
Whenever a message is sent from one node to another, it attaches a timestamp to it along with some versioning information. The first thing the receiving system will do (ignoring the obvious os/socket/etc) more or less is compare that timestamp to "now". This is what drives the metric. It will then look at the datacenter the source is from to determine which metrics to increment by how much.
I am experimenting with Cassandra 3.0.2 on a 6-node cluster and found "unintuitive" read-scaling/-workload patterns.
Query:
select count(*) from dvds
where dvd has 280k records.
With default vnode settings (num_tokens: 256), I've found that increasing node count from 1 to 2 improves read performance by about 35%, but each additional node beyond 2 nodes decreases performance by about 30%.
With vnode-s disabled (num_tokens: 1 and initial_token-s set manually), a 6-node cluster performs about 35% better than with num_tokens: 256, but the following pattern is clearly observable: The coordinator node's CPU consumption is either about 50% (of the total capacity of a CPU core) or about 110-120%, whereas the other nodes consume either about 0% or 60-70% percent capacity of a single core. The unintuitive part is this: when one node is busy, the other nodes are idle. (When the coordinator CPU consumption is at 110-120%, all the other nodes are pretty idle. When the coordinator's CPU is 50%, one of the other nodes is busy.)
The strongest hypothesis I could come up with was that the cluster is unable to handle the network traffic, but the coordinator's network traffic (where, I assume, a network scalability issue would hardest hit) didn't seem to exceed 1Mb/s at any point in time. (The network interfaces' throughputs on the nodes are 10/100 Mbps.) Also, with a network scalability issue, I would expect the "num_tokens: 1" setup to show initially high CPU load on all nodes (with the exception of the coordinator) -- or at least some evenly distributed simultaneous load.
Please, can anybody shed some light on this?
count(*) has its place, but is very expensive. The coordinator essentially has to pull everything down from all the nodes, merge, and count them. The only thing it provides over "read everything" and counting them locally is reducing some network load between coordinator and your application.
If you need this metric regularly I would recommend using a counter or lwt to keep the count a single read operation (create data model around queries not abstractions of the data). If need it once, or infrequently, hadoop/spark is a great option. Also you can get a decent estimate from the EstimatedPartitionSize metric (per node though) depending on your data model.
I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?