I have a cassandra 2 datacenter pair with single replication with each datacenter containing a single node and each datacenter located on separate physical servers on the network. If one datacenter crashes, the other one will continue to be available for reads and writes I started up my java application, on a 3rd server, and everything it running ok. It's reading and writing to cassandra.
Next I disconnected, pulled the network cable, the 2nd datacenter server from the network.
I expected the application to continue running with no exceptions against the 1st datacenter, but that was not the case.
The following exception started to occur in the application:
me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level.
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:60)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl$9.execute(KeyspaceServiceImpl.java:354)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl$9.execute(KeyspaceServiceImpl.java:343)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
at me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSuperColumn(KeyspaceServiceImpl.java:360)
at me.prettyprint.cassandra.model.thrift.ThriftSuperColumnQuery$1.doInKeyspace(ThriftSuperColumnQuery.java:51)
at me.prettyprint.cassandra.model.thrift.ThriftSuperColumnQuery$1.doInKeyspace(ThriftSuperColumnQuery.java:45)
at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
at me.prettyprint.cassandra.model.thrift.ThriftSuperColumnQuery.execute(ThriftSuperColumnQuery.java:44)
Once I reconnected the network cable to the 2nd server, the error stopped.
Here's more details on cassandra 1.0.10
1) Here's the following describe from cassandra on both datacenters
Keyspace: AdvancedAds:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Options: [DC2:1, DC1:1]
2) I ran a node tool ring against each instance
./nodetool -h 111.111.111.111 -p 11000 ring
Address DC Rack Status State Load Owns Token
1
111.111.111.111 DC1 RAC1 # <-- usUp Normal 1.07 GB 100.00% 0
111.111.111.222 DC2 RAC1 Up Normal 1.1 GB 0.00% 1
./nodetool -h 111.111.111.222 ring -port 11000
Address DC Rack Status State Load Owns Token
1
111.111.111.111 DC1 RAC1 Up Normal 1.07 GB 100.00% 0
111.111.111.222 DC2 RAC1 # <-- usUp Normal 1.1 GB 0.00% 1
3) I checked the cassandra.yaml
the seeds are 111.111.111.111, 111.111.111.222
4) I checked the cassandra-topology.properties
111.111.111.111
# Cassandra Node IP=Data Center:Rack
# datacenter 1
111.111.111.111=DC1:RAC1 # <-- us
# datacenter 2
111.111.111.222=DC2:RAC1
default=DC1:r1
111.111.111.222
# Cassandra Node IP=Data Center:Rack
# datacenter 1
111.111.111.111=DC1:RAC1
# datacenter 2
111.111.111.222=DC2:RAC1 # <-- us
default=DC1:r1
5) we set the consistencyLevel to LOCAL_QUORUM in our java application as follows:
public Keyspace getKeyspace(final String keyspaceName, final String serverAddresses)
{
Keyspace ks = null;
Cluster c = clusterMap.get(serverAddresses);
if (c != null)
{
ConfigurableConsistencyLevel policy = new ConfigurableConsistencyLevel();
policy.setDefaultReadConsistencyLevel(consistencyLevel);
policy.setDefaultWriteConsistencyLevel(consistencyLevel);
// Create Keyspace
ks = HFactory.createKeyspace(keyspaceName, c, policy);
}
return ks;
}
I was told this configuration would work, but maybe I'm missing something.
Thanks for any insight
Hector is known to return spurious unavailable errors. The native protocol Java driver does not have this problem: https://github.com/datastax/java-driver
If you have only two nodes and your data would be placed on the node that is actually down, when the consistency is required, you may not be able to achieve full write availability. Cassandra would be achieving that with Hinted Handoff, but for the QUORUM consistency level the UnavailableException will be thrown anyway.
The same is true when requesting data belonging to the down node.
However it seems like your cluster is not well balanced. Your node 111.111.111.111 owns 100% and then 111.111.111.222 seem to own 0%, looking at your tokens they seem to be a reason for that.
Checkout how to set initial token here : http://www.datastax.com/docs/0.8/install/cluster_init#token-gen-cassandra
Additionally you may want to check Another Question, which contains answer with more reasons, when the situation like this may happen.
LOCAL_QUORUM won't work if you configure NetworkTopologyStrategy like this:
Options: [DC2:1, DC1:1] # this will make LOCAL_QUORUM and QUORUM fail always
LOCAL_QUORUM and (in my experience) QUORUM require data centers to have at least 2 replicas up. If you want a quorum spanning your data centers you have to set consistency level to data center agnostic TWO.
More examples:
Options: [DC2:3, DC1:1] # LOCAL_QUORUM for clients in DC2 works, QUORUM fails
Options: [DC2:2, DC1:1] # LOCAL_QUORUM in DC2 works, but down after 1 node failure
# QUORUM fails, TWO works.
Related
After few successfully ingested data into Cassandra with Spark,
an error is now returned every time I try to ingest data with Spark (after few minutes or instantly) :
Caused by: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses
I checked with simple CQLSH (not Spark), and similar error is indeed returned too (2 nodes of 4) :
Connection error: ('Unable to connect to any servers', {'1.2.3.4': error(111, "Tried connecting to [('1.2.3.4', 9042)]. Last error: Connection refused")})
So basically, when I do ingestion into Cassandra with Spark, some nodes go down at some point. And I have to reboot the node, in order to access it again through cqlsh (and spark).
What is strange, is that it is still written "UP" for the given node when I run nodetool status, while cqlsh tells connection refused for that node.
I try to investigate logs, but I have a big problem : nothing in the logs, no single exception triggered server-side.
What to do in my case ? Why a node go down or become unresponsive in that case ? How to prevent it ?
Thanks
!!! edit !!!
Some of the details asked for, bellow :
Cassandra infrastructure :
network : 10 gbps
two datacenters : datacenter1 and datacenter2
4 nodes in each datacenter
2 replicas per datacenter :
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '2', 'datacenter2': '2'} AND durable_writes = true;
consistency used for input and output : LOCAL_QUORUM
total physical memory per node : 128GB.
memory repartition per node : 64GB dedicated for each Cassandra instance, and 64GB dedicated for each Spark worker (colocated on each Cassandra node)
storage : 4 TB NVME for each node
Spark application config :
total executors cores : 24 cores (4 instances * 6 cores each)
total executors ram : 48 GB (4 instances * 8 GB each)
cassandra config on spark :
spark.sql.catalog.cassandra.spark.cassandra.output.batch.size.rows 1
spark.sql.catalog.cassandra.spark.cassandra.output.concurrent.writes 100
spark.sql.catalog.cassandra.spark.cassandra.output.batch.grouping.key none
spark.sql.catalog.cassandra.spark.cassandra.output.throughputMBPerSec 80
spark.sql.catalog.cassandra.spark.cassandra.output.consistency.level LOCAL_QUORUM
spark.sql.catalog.cassandra.spark.cassandra.output.metrics false
spark.sql.catalog.cassandra.spark.cassandra.connection.timeoutMS 90000
spark.sql.catalog.cassandra.spark.cassandra.query.retry.count 10
spark.sql.catalog.cassandra com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions
(2 nodes of 4)
Just curious, but what is the replication factor (RF) of the keyspace, and what consistency level is being used for the write operation?
I'll echo Alex, and say that usually this happens because Spark is writing faster than Cassandra can process. That leaves you with two options:
Increase the size of the cluster to handle the write load.
Throttle-back the write throughput of the Spark job.
One thing worth calling out:
2 replicas per datacenter
consistency used for input and output : LOCAL_QUORUM
So you'll probably get more throughput by dropping the write consistency to LOCAL_ONE.
Remember, quorum == RF / 2 + 1, which means LOCAL_QUORUM of 2 is 2.
So I do recommend dropping to LOCAL_ONE, because right now Spark is effectively operating # ALL consistency.
Which JMX indicators I need to care about ?
Can't remember the exact name of it, but if you can find the metric for disk IOPs or throughput, I wonder if it's hitting a threshold and plateauing.
I have configured a cassandra clustter with 3 nodes
Node1(192.168.0.2) , Node2(192.168.0.3), Node3(192.168.0.4)
Created a keyspace 'test' with replication factor as 2.
Create KEYSPACE test WITH replication = {'class':'SimpleStrategy',
'replication_factor' : 2}
When I stop either Node2 or Node3 (one at a time and both at one time) , I am able to do the CRUD operations on the keyspace.table.
When I stop Node1 and try to update/create a row from Node4 or Node3, getting following error although Node3 and Node4 are up and running-:
All host(s) tried for query failed (tried: /192.168.0.4:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)))
com.datastax.driver.core.exceptions.NoHostAvailableException: All
host(s) tried for query failed (tried: /192.168.0.4:9042
(com.datastax.driver.core.exceptions.DriverException: Timeout while
trying to acquire available connection (you may want to increase the
driver number of per-host connections)))
I am not sure how Cassandra elects a leader if a leader node dies.
So, you are using replication_factor 2, so only 2 nodes will have a replica of you keyspace (not all the 3 nodes).
My first advise is to change the RF to 3.
You have to pay attention to the consistency level you are using; If you have only 2 copies of you data (RF: 2), and you are using Consistency Level QUORUM, it will try to write the data on half of nodes + 1, in this case, all 2 nodes. So if 1 node is down, you will not be able to write/read data.
to verify where the data is replicated you could see how is the ring in you cluster. As you are using SimpleStrategy it will copy the data clockwise direction. And in your case its copied at nodes at 192.168.0.2 and 192.168.0.3.
Take a look at the concepts of replication factor: http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
And Consistency Level: http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Great answer about RF vs CL: https://stackoverflow.com/a/24590299/6826860
You can use this calculator to find out if your setup have a decent consistency. In your case the result is You can survive the loss of no nodes without impacting the application
I think I wasn't clear at response. The replication factor is about how many copies of your data will exists. The consistency level is how many copies your client will wait to be made before get an response from server.
Ex: All your nodes are up. The client make a CQL with CL Quorum, the server will copy the data in 2 nodes (3/2 + 1) and reply to client, in background it will copy the data at the third node as well.
In your example, if you shutdown 2 nodes of a 3 node cluster you will never achieve an QUORUM to make requests (with CL QUORUM), so you have to use consistency level ONE, once the nodes are up again, cassandra will copy the data on them. One thing that can happen is: before cassandra copy the data on other 2 nodes, the client make a request for node1 or node2 and the data is not there yet.
Experiencing sync issues between different nodes in the same datacenter in Cassandra. The keyspace is set to a replication factor of 3 with NetworkTopology and has 3 nodes in the DC. Effectively making sure each node has a copy of the data. When node tool status is run, it shows all three nodes in the DC own 100% of the data.
Yet the applied_migrations column family in that keyspace is not in sync. This is strange because only a single column family is impacted within the keyspace. All the other column families are replicated fully among the three nodes. The test was done by doing a count of rows on each of the column families in the keyspaces.
keyspace_name | durable_writes | strategy_class | strategy_options
--------------+----------------+------------------------------------------------------+----------------------------
core_service | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC_DATA_1":"3"}
keyspace: core_service
Datacenter: DC_DATA_1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_1_DC_DATA_1 3.75 MB 256 100.0% 3851106b RAC1
UN host_ip_address_2_DC_DATA_1 3.72 MB 256 100.0% d1201142 RAC1
UN host_ip_address_3_DC_DATA_1 3.72 MB 256 100.0% 81625495 RAC1
Datacenter: DC_OPSCENTER_1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN host_ip_address_4_DC_OPSCENTER_1 631.31 MB 256 0.0% 39e4f8af RAC1
Query: select count(*) from core_service.applied_migrations;
host_ip_address_1_DC_DATA_1 core_service applied_migrations
count
-------
1
(1 rows)
host_ip_address_2_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_3_DC_DATA_1 core_service applied_migrations
count
-------
2
(1 rows)
host_ip_address_4_DC_OPSCENTER_1 core_service applied_migrations
count
-------
2
(1 rows)
Similar error is received as described in the issue below. Because all the rows of data are not available, the migration script fails because it is trying to create an existing table:
https://github.com/comeara/pillar/issues/25
I require strong consistency
If you want to ensure that your reads are consistent you need to use the right consistency levels.
For RF3 the following are your options:
Write CL ALL and read with CL One or greater
Write CL Quorum and read CL Quorum. This is what's recommended by Magro who opened the issue you linked to. It's also the most common because you can loose one node and still read and write.
Write CL one but read CL ALL.
What does Cassandra do improve consistency
Cassandra's anti entropy mechanisms are:
Repair will ensure that your nodes are consistent. It gives you a consistency base line and for this reason it should be run as part of your maintenance operations. Run repair at least more often than your gc_grace_seconds in order to avoid zombie tombstones from coming back. DataStax OpsCenter has a Repair Service that automates this task.
Manually you can run:
nodetool repair
in one node or
nodetool repair -pr
in each of your nodes. The -pr option will ensure you only repair a node's primary ranges.
Read repair happens probabilistically (configurable at the table def). When you read a row, c* will notice if some of the replicas don't have the latest data and fix it.
Hints are collected by other nodes when a node is unavailable to take a write.
Manipulating c* Schemas
I noticed that the whole point of Pillar is "to automatically manage Cassandra schema as code". This is a dangerous notion--especially if Pillar is a distributed application (I don't know if it is). Because it may cause schema collisions that can leave a cluster in a wacky state.
Assuming that Pillar is not a distributed / multi-threaded system, you can ensure that you do not break schema by utilizing checkSchemaAgreement() before and after schema changes in the Java driver after schema modifications.
Long term
Cassandra schemas will be more robust and handle distributed updates. Watch (and vote for) CASSANDRA-9424
I am new to Cassandra and at work I have a 4 node cluster.
nodetool gossipinfo tells me that there are one datacentre, 2 racks and 2 nodes in each rack. Replication factor is defined as 2. nodetool ring tell me that each node has 50% ownership. There are 2 seed nodes in our config. Each rack has 1 seed node.
Does this mean that for each rack, there is one seed node and its replicated node. If that is the case then why is datasize not the same for seed node and its replicated node.
what happens if one node goes down. Will it have any impact on the data availability of the cluster.
Seeds
Seeds nodes are only special in the way that new nodes that join the cluster contact the seed nodes to find out about other nodes and the topology of the ring. But in Cassandra, all nodes are the same, i.e. there are no master or slave, no primary or secondary node. Because of this, you can elect any (or all) node as the seed.
Since seeds only relate to gossip information, it does not have anything to do with replicated data.
Size
In relation to data size, each node will never be exactly the same since each partition/row size is never the same. If you look at the nodetool cfstats output, you will see that there is a big range between minimum and maximum sizes.
Availability
If the reads are done with a consistency level CL=ONE, then if a node is down the other replica will continue to serve requests. But if reads are done with a higher consistency, then reads will fail since it needs 2 nodes to be available, i.e. CL=LOCAL_QUORUM requires [ RF/2 + 1 ] nodes to respond.
EDIT: Response to:
Shouldn't each node own 25%?
Ownership
In Cassandra, data is not "distributed" across ALL nodes in ALL DCs. In fact, a DC is a copy of another DC depending on the replication factor.
To illustrate, consider the following keyspace definition:
CREATE KEYSPACE "myKS"
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'DC1' : 2,
'DC2' : 2};
Based on this definition, it means that the myKS keyspace has 2 replicas in DC1 and 2 replicas in DC2. Since each of your data centres only have 2 nodes, this effectively means that each DC is a copy of each other.
Following from that, since the tokens are split between 2 nodes, each node owns half of the data which is 50%. So in DC1, each node owns 50% and in DC2 (which is a copy of DC1) each node also owns 50%.
We are trying to provision a Cassandra cluster to use as a KV storage.
We are deploying a 3-node production cluster on our production DC, but we would also want to have a single node as a DR copy on our disaster recovery DC.
Using PropertyFileSnitch we have
10.1.1.1=DC1:R1
10.1.1.2=DC1:R1
10.1.1.3=DC1:R1
10.2.1.1=DC2:R1
We plan on using a keyspace with the following definition:
CREATE KEYSPACE "cassandraKV"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 1};
In order to achieve the following:
2 replicas distributed among 3 nodes in DC1 (66% of total data per node) while still allowing a single node to go down without any data loss.
1 replica in DC2 (100% of total data per node)
We see the ownership distributed at 25% per node, while we would expect 33% on each node in DC1 and 100% in DC2.
Is this above configuration correct?
Thanks
My guess is that you ran nodetool Status without specifying the keyspace. This will end up just showing you the general distribution of the tokens in your cluster which will not be representative of your "cassandraKV" keyspace.
On running nodetool status "cassandraKV" you should see
Datacenter: DC1
10.1.1.1: 66%
10.1.1.2: 66%
10.1.1.3: 66%
Datacenter: DC2
10.2.1.1: 100%
You should see 66% for the DC1 nodes because each node holds 1 copy of it's primary range (33%) and one copy of whatever it is replicating (33%)
While DC2 has 100% of all the data you are currently holding.