As a part of a Research Project, our Lab is running we would like to record the startup time of the Cassandra service, Is there any way I record this data ? (One way I could see I can achieve this is by sniffing if the port is open or closed, But I feel this is not a reliable way)
I couldn't find any standard profiler which could help me with this, Please insist of there is one.
One thing you could do is look for specific messages in the system.log.
When Cassandra starts-up, the first message is about reading the cassandra.yaml file:
INFO [main] 2022-01-13 08:32:06,717 YamlConfigurationLoader.java:93 - Configuration location: file:/Users/aaronploetz/local/apache-cassandra-4.0.0/conf/cassandra.yaml
When startup is complete, there is a log message which clearly states this:
INFO [main] 2022-01-13 08:32:11,823 CassandraDaemon.java:780 - Startup complete
Depending on your exact definition of "startup time," you could just look for messages from the CassandraDaemon. That initially kicks off after the config file loads, but it's the beginning of the Java process actually coming up.
Key on those messages in the log, and you'll also get the times.
If by "profiler" you mean being able to do it programmatically, you can replicate the code in the BootstrapBinaryDisabledTest class. In the bootstrap() method, you will see that it checks the logs for the string "Starting listening for CQL clients":
node.logs().watchFor("Starting listening for CQL clients");
A node will only accept requests from CQL clients (applications) on port 9042 (by default) when the initialisation process has completed successfully. For Cassandra 4.0, the log entry looks like:
INFO [main] <date time> PipelineConfigurator.java:125 - Starting listening for CQL clients on /x.x.x.x:9042 (unencrypted)...
For Cassandra 3.11, the log entry looks like:
INFO [main] <date time> Server.java:159 - Starting listening for CQL clients on /x.x.x.x:9042 (unencrypted)...
The startup time for Cassandra is from when the daemon is started until the time that the node is listening for CQL clients. Cheers!
Related
Okay. Where to start? I am deploying a set of Spark applications to a Kubernetes cluster. I have one Spark Master, 2 Spark Workers, MariaDB, a Hive Metastore (that uses MariaDB - and it's not a full Hive install - it's just the Metastore), and a Spark Thrift Server (that talks to Hive Metastore and implements the Hive API).
So this setup is working pretty well for everything except the setup of the Thrift Server job (start-thriftserver.sh in the Spark sbin directory on the thrift server pod). By working well I say that outside my cluster I can create spark jobs and submit them to master and then using the Web UI I can see my code test app ran to completion utilizing both workers.
Now the problem. When you launch the start-thriftserver.sh it submits a job to the cluster with itself as the driver (I believe - which is correct behavior). And when I look at the related spark job via the WebUI I see it has workers and they repeatedly get hatched and then exit shortly therafter. When I look at the workers' stderr logs I see that every worker launches and tries to connect back to the thrift server pod at the spark.driver.port. This is correct behavior I believe. The gotcha is that connection fails because it says unknown host exception and it uses a kubernetes raw pod name (not a service name and with no IP in the name) of the thrift server pod to say it can't find the thrift server that initiated the connection. Now Kubernetes DNS stores service names and then only pod names as prefaced with their private IP. In other words the raw name of the pod (without an IP) is never registered with the DNS. That is not how kubernetes works.
So my question. I am struggling to figure out why the spark worker pod is using a raw pod name to try to find the thrift server. It seems it should never do this and that it should be impossible to ever satisfy that request. I have wondered if there is some spark config setting that would tell the workers that the (thrift) driver it needs to be searching for is actually spark-thriftserver.my-namespace.svc. But I can't find anything having done much searching.
There are so many settings that go into a cluster like this that I don't want to barrage you with info. One thing that might clarify my setup: the following string is dumped at the top of a worker log that fails. Notice the raw pod name of the thrift server for driver-url. If anyone has any clue what steps to take to fix this please let me know. I'll edit this post and share settings etc as people request them. Thanks for helping.
Spark Executor Command: "/usr/lib/jvm/java-1.8-openjdk/jre/bin/java" "-cp" "/spark/conf/:/spark/jars/*" "-Xmx512M" "-Dspark.master.port=7077" "-Dspark.history.ui.port=18081" "-Dspark.ui.port=4040" "-Dspark.driver.port=41617" "-Dspark.blockManager.port=41618" "-Dspark.master.rest.port=6066" "-Dspark.master.ui.port=8080" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-thriftserver-6bbb54768b-j8hz8:41617" "--executor-id" "12" "--hostname" "172.17.0.6" "--cores" "1" "--app-id" "app-20220408001035-0000" "--worker-url" "spark://Worker#172.17.0.6:37369"
Is there a timeout setting in the cassandra.yaml file used to cause server-side timeouts when issuing a drop table command?
I'm using the following versions of software:
Cassandra database version: 3.11.2
Cassandra datastax java driver version: 3.4.0
I tried changing cassandra.yaml settings for write_request_timeout_in_ms, truncate_request_timeout_in_ms, and request_timeout_in_ms all to 10 ms and then issued a drop table statement via the datastax java driver. From my application logs I can see the statement takes about 2 seconds when measured from the client (client and database are all just on my local development machine and doing nothing else but this test) and finishes without a timeout.
I then executed the exact same test but replaced the "drop table" text in the statement with "truncate table" with no other changes and saw the expected timeout "com.datastax.driver.core.exceptions.TruncateException: Error during truncate: Truncate timed out - received only 0 responses".
I tried searching the Cassandra github project but couldn't find a reference in the code to see how the server side timeouts are applied so I am hoping someone knows the answer to this question.
I'm running KairosDb (time series db) on top of Cassandra. I'm experiencing a recurring debug message from Cassandra/Kairos that comes up all the time. They are able to connect to each other but I fear something is not correctly configured. Here is a snippet of the error and how its presented over and over in time:
08:06:00.000 [QuartzScheduler_QuartzSchedulerThread] DEBUG
[QuartzSchedulerThread.java:268] - batch acquisition of 0 triggers
08:06:00.000 [QuartzScheduler_Worker-3] DEBUG [JobRunShell.java:212] -
Calling execute on job
DEFAULT.org.kairosdb.core.reporting.MetricReporterService
08:06:00.001 [QuartzScheduler_Worker-3] DEBUG
[MetricReporterService.java:93] - Reporting metrics
08:06:09.335
[Hector.me.prettyprint.cassandra.connection.CassandraHostRetryService-1]
DEBUG [CassandraHostRetryService.java:123] - Retry service fired... nothing
to do.
Has anyone experienced this before and knows the issue?
Thanks in advance!
there are no problem with these messages, these are perfectly normal.
KairosDB has some jobs triggered by Quartz that reports metrics, and when using Cassandra (Hector client) it checks for the hosts that have been marked down or are unresponsive.
We are having issues with dropped connections in a node-kaka client. It appears that the zookeeper is resetting connections on us. When I run a tcpdump I see the following all over the place when viewed in wireshark:
The source is always one of our zookeeper servers and the destination is our kafka consumer. It appears that our client handles these in most situations just fine. In fact, I'm not at all convinced this is the cause for our failures. But, it does seem odd. I was hoping someone with more experience in how kaka-node, zookeeper, and kafka interact could provide some explanation.
ADDING SOME DETAILS FROM LOG
So, I see a few things in the logs. First, there are a ton of the following:
2016-03-11 20:26:32,357 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#868] - Client attempting to establish new session at /10.196.2.106:59300
2016-03-11 20:26:32,358 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#822] - Connection request from old client /10.196.2.106:59296; will be dropped if server is in r-o mode
Then there are a whole bunch of these:
2016-03-12 03:40:49,041 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1527b11827bfcfe, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
2016-03-12 03:40:49,042 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1007] - Closed socket connection for client /10.196.2.106:33197 which had sessionid 0x1527b11827bfcfe
The strange thing is that IPs correlate to our secor servers. So, this is probably not related.
Other than that, I do not see anything really out of the ordinary.
While trying to use the cassandra 2.0.1 version, i started facing the handshaking with version problem .
There was an exception from OutboundTcpConnection.java stating that handshaking is not possible with a particular node.
I had a look at the TCP dump and cleared off the doubts that there was no problem in the network layer.
The application is not completing the handshaking process .Moreover , the port 7000 is still active.
For example, all my 8 nodes are up . But when i try a nodetool status, some nodes give a DN- down node status. Later on, after examining , the TCP backlog queue was found overflowing and the particular server has stopped listening for other servers in the cluster.
Am still not able to spot the root cause of this problem.
Note: I have tried with previous version of cassandra , 1.2.4, and it was working ok at that time. Before going to production , i thought it is better to go to 2.0.x version to avoid a migration overhead mainly. Can anyone provide an idea on this ?
Exception am getting is
NFO [HANDSHAKE-/aa.bb.cc.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 385) Handshaking version with /aa.bb.cc.XX
INFO [HANDSHAKE-/aa.bb.cc.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 396) Cannot handshake version with /aa.bb.cc.YY
This sounds like https://issues.apache.org/jira/browse/CASSANDRA-6349. You should upgrade.