Single node Cassandra behaving like multinode Cassandra - cassandra

I have a single node Cassandra cluster which hits to ReadTimeOutException and I observe the following logs in server which seems weird to me,
ERROR [SharedPool-Worker-91] 2018-05-29 12:09:53,023 ErrorMessage.java:338 - Unexpected exception during request
java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses.
at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:489) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:269) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.RolesCache.getRoles(RolesCache.java:66) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:51) ~[apache-cassandra-3.0.8.jar:3.0.8]
at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-3.0.8.jar:3.0.8]
It says Operation timed out - received only 1 responses., In Single node why Its saying expecting more than one response? Could some explain pls.
NOTE : I have enabled different strategy for this system_auth keyspace
CREATE KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenterproc': '1'} AND durable_writes = true;
and set consistency level as LOCAL_QUORUM
Cassandra server version : 3.0.8
Will this be a reason?

GCs longer than the timeout (like the 9 second GCs you posted in other questions) can cause internal auth timeouts. Most likely it received the response, had the GC, then registered as timeout.

Related

Getting schema version mismatch error while trying to migrate from DSE Cassandra(6.0.5) to Apache Cassandra(3.11.3)

We are trying to migrate/replicate data from a DSE Cassandra node to apache Cassandra node.
Have done POC of that on the local machine and getting a schema version problem.
Below are the details of the POC and the occurring problem.
DSE Cassandra node Details:
dse_version: 6.0.5
release_version: 4.0.0.605
sstable format:
aa-1-bti-CompressionInfo.db
aa-1-bti-Digest.crc32
aa-1-bti-Partitions.db
aa-1-bti-Statistics.db
aa-1-bti-Data.db
aa-1-bti-Filter.db
aa-1-bti-Rows.db
aa-1-bti-TOC.txt
Apache Cassandra node Details:
release_version: 3.11.3
sstable format:
mc-1-big-CompressionInfo.db
mc-1-big-Digest.crc32
mc-1-big-Statistics.db
mc-1-big-Data.db
mc-1-big-Filter.db
mc-1-big-TOC.txt
mc-1-big-Summary.db
There is 1 cluster(My Cluster) in which I have a total of 4 nodes.
2 dse nodes(lets say DSE1 and DSE2) are in one Datacenter(i.e., dc1).
2 apache nodes(lets say APC1 and APC2) are in other Datacenter(i.e., dc2).
Note: I have used NetworkTopologyStrategy Topology Strategy for keyspace and GossipingPropertyFileSnitch as endpoint_snitch. Have added
$JVM_OPTS -Dcassandra.allow_unsafe_replace=true in cassandra-env.sh file also.
When I am creating keyspace on DSE1 node with following CQL query:
CREATE KEYSPACE abc
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'dc1' : 2,
'dc2' : 2
}
AND DURABLE_WRITES = true;
The keyspace is getting created on DSE2 node but the following error is thrown by cqlsh:
Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers
Also the following error is thrown from one of the 2 Apache nodes(APC1/APC2):
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 02559ab1-91ee-11ea-8450-2df21166f6a4. If a table was just created, this is likely due to the schema not being fully propagated. Please wait for schema agreement on table creation
Have checked the schema version also on all 4 nodes, getting below result:
Cluster Information:
Name: My Cluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
84c22c85-8165-398f-ab9a-e25a6169b7d3: [127.0.0.4, 127.0.0.6]
4c451173-5a05-3691-9a14-520419f849da: [127.0.0.5, 127.0.0.7]
Have tried to resolve the same using solution given in the following link:
https://myadventuresincoding.wordpress.com/2019/04/03/cassandra-fix-schema-disagreement/
But the problem still persists.
Also, Can we migrate data from DSE Cassandra node to apache Cassandra node naturally as suggested in the following link:
Migrate Datastax Enterprise Cassandra to Apache Cassandra
Can anyone please suggest, how to overcome the problem. Is there any other upgradation or compatibility fix we need to implement to resolve this.

Cassandra read timeout exception while insert

I have a 16 node cassandra cluster and I am inserting 50.000 rows, pretty much in parallel, from an external tool(which is installed in every node) in every cassandra node with Simba Cassandra JDBC Driver. While the insertion takes place, sometimes/rarely, I get the following error on (mostly/usually) two of the nodes:
Execute failed: [Simba]CassandraJDBCDriver Error setting/closing connection: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.simba.cassandra.shaded.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded))).
java.sql.SQLException: [Simba]CassandraJDBCDriver Error setting/closing connection: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.simba.cassandra.shaded.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded))).
Caused by: com.simba.cassandra.shaded.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.simba.cassandra.shaded.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)))
The weird thing is that it is a readtimeout exception, while I am just trying to insert. I have not changed any read_time_out or other parameters in the .yaml file, so they are the default. This means that if I try to count(*) on something from cqlsh I also get a readtimeout exception.
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I do not know if these two are related though! Any ideas on what might be going on and how to avoid the first error "All host(s) tried for query failed"??

Accumulo's createtable command gets stuck and does not create a table

I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.

Cassandra: Operation time-out

I am using cassandra node js driver and i am getting following error:
error: Database error found %s . On selectAllJobs() call
{ name: 'ResponseError',
message: 'Operation timed out - received only 0 responses.',
info: 'Represents an error message from the server',
code: 4608,
consistencies: 1,
received: 0,
blockFor: 1,
isDataPresent: 0,
query: 'SELECT * FROM cron_tasks WHERE type =? AND starts < ? ALLOW FILTERING ;' }
This error occurred when i ported to new instance of AWS. Earlier, everything went fine.
Cassandra version:
[cqlsh 4.1.1 | Cassandra 2.0.12 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Read_timeout error means that the coordinator of the query does not know whether the request succeeded or failed, so all it can tell the client is that the request timed out.
In your case, it means that the coordinator of the query sent the request internally to the replica but the replica didn't respond in time.
You can enable query tracing and execute in cqlsh to understand why it is happening.
You can read more about how Cassandra deals with replica failure.

Cassandra Streaming error - Unknown keyspace system_traces

In our dev cluster, which has been running smooth before, when we replace a node (which we have been doing constantly) the following failure occurs and prevents the replacement node from joining.
cassandra version is 2.0.7
What can be done about it?
ERROR [STREAM-IN-/10.128.---.---] 2014-11-19 12:35:58,007 StreamSession.java (line 420) [Stream #9cad81f0-6fe8-11e4-b575-4b49634010a9] Streaming error occurred
java.lang.AssertionError: Unknown keyspace system_traces
at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:260)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
at org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:239)
at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:368)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
at java.lang.Thread.run(Thread.java:745)
I got the same error while I was trying to setup my cluster, and as I was experimenting with different switches in cassandra.yaml, I restarted the service multiple times and removed the system dir under data directory (/var/lib/cassandra/data as mentioned here).
I guess for some reason cassandra tries to load system_traces keyspace and fails (the other dir under /var/lib/cassandra/data), and nodetool throws this error. You can just remove both system and system_traces before starting cassandra service, or even better delete all content of bommitlog, data and savedcache there.
This works obviously if you dont have any data just yet in the system.

Resources