Kafka Zookeeper Security Authentication & Authorization(JAAS) Using SASL - security

Regarding Kafka-Zookeeper Security using DIGEST MD5 Authentication, I am trying to rotate/change credentials/password for both server(zookeeper) and client(kafka) jaas config file.
We have a 3 node cluster of 3 zookeepers and 3 kafka broker nodes with below jaas configuration file.
kafka.conf
org.apache.zookeeper.server.auth.DigestLoginModule required
username="super"
password="password";
};
zookeeper.conf
Server {
org.apache.zookeeper.server.auth.DigestLoginModule required
user_super="password";
};
To rotate we do a rolling restart of server(zookeeper) instances after updating the credential(password) and during the process of rolling restart after updating the same credential/password for super user for client(kafka instances) one at a time, we notice
[2019-06-15 17:17:38,929] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2019-06-15 17:17:38,929] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient)
these info level in server logs, which eventually results in unclean shutdown and restart of the broker which impacts the writes and reads for longer than expected. I have tried commenting requireClientAuthScheme=sasl in zookeeper zoo.cfg https://cwiki.apache.org/confluence/display/ZOOKEEPER/Client-Server+mutual+authentication to allow any clients authenticate to zookeeper but no success.
Also, alternative approach - tried to update the credential/password in jaas config file dynamically using sasl.jaas.config and do get the same exception documented in this jira (reference: https://issues.apache.org/jira/browse/KAFKA-8010).
can someone have any suggestions? Thanks in advance.

Related

Errors persisting after recovering YugabyteDB cluster

[Question posted by a user on YugabyteDB Community Slack]
We’re trying to do a postmortem on an issue we hit in our cluster. It looks like one of our 3 nodes went down and the other two were unable to process requests until it came back. Looking over the logs, I see this message a lot from both before and during the outage:
W0810 00:46:40.740047 3997211 leader_election.cc:285] T 00000000000000000000000000000000 P f65e3577ff4e42a3b935c36a99be1fb9 [CANDIDATE]: Term 7 pre-election: Tablet error from VoteRequest() call to peer df99aaa63d14414785aa9842fcf2fdc1: Invalid argument (yb/tserver/service_util.h:75): RequestConsensusVote: Wrong destination UUID requested. Local UUID: 55065b84a4df41ffac5841463871778a. Requested UUID: df99aaa63d14414785aa9842fcf2fdc1
I0810 00:46:40.740072 3997211 leader_election.cc:244] T 00000000000000000000000000000000 P f65e3577ff4e42a3b935c36a99be1fb9 [CANDIDATE]: Term 7 pre-election: Election decided. Result: candidate lost.
We, unfortunately, lost the logs from the node that went down due to a data loss issue on our side.Also, I’m actually still seeing the messages above even though the cluster has recovered so it looks like we’re still in a state.
What does this mean and does it prevent the cluster from electing a new leader?
The yb-master process recently running on prod-db-us-2 has a UUID of 55065b84a4df41ffac5841463871778a but the yb-master process running on prod-db-us-1 believes that the yb-master on prod-db-us-2 has a UUID of df99aaa63d14414785aa9842fcf2fdc1. This seems like a configuration issue.
My guess is that 55065b84a4df41ffac5841463871778a was originally df99aaa63d14414785aa9842fcf2fdc1. The UUID could change if the data directory is wiped.
You had a loss of data incident on prod-db-us-2 about a month and a half ago so that’s probably when the UUID changed.
Here’s the official documentation for replacing a failed master: https://docs.yugabyte.com/preview/troubleshoot/cluster/replace_master/
Alternatively, you could wipe 55065b84a4df41ffac5841463871778a and create a new yb-master using gflag instance_uuid_override to force it to initialize with uuid df99aaa63d14414785aa9842fcf2fdc1.

Kafka Schema Registry failed to initialize after ZK and Brokers are configured with SASL_PLAINTEXT Security

We are using Confluent community edition setup for Kafka, currently we have a requirement to configure ACLs around the cluster, accordingly we have configured the zk and broker nodes so clients requires authentication(username/password) SASL_PLAINTEXT tokens to publish/subscribe to cluster, its working perfectly without schema registry, however while configuring schema-registry, its unable initialize throwing below exception even though we have configurred it to use SASL+PLAINTEXT connection with brokers/zk nodes. Is there anything I'm missing please help.
Please note we are using allow.everyone.if.no.acl.found=true flag and currently we dont have and ACL defined, so I don't think we need to setup any ACLs for _schemas topic which is used by schema registry to initialize.
[2019-12-17 00:33:23,844] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:64)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:212)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:62)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:73)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:40)
at io.confluent.rest.Application.createServer(Application.java:201)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:42)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:172)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:114)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:210)
... 5 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:165)
... 7 more

Accumulo's createtable command gets stuck and does not create a table

I was trying to create a table inside Accumulo using the createtable command and found out that it was getting stuck. I waited for around 20 mins before cancelling the createtable command.
createtable test_table
I have one master and 2 tablet servers and found out that my master and one of the tablets died. I could not telnet to port 9997 of that particular tablet server and I could not even telnet to port 29999 (master.port.client in accumulo-site.xml). When I saw the tserver logs of the dead server, I saw the following entries.
2016-05-10 02:12:07,456 [zookeeper.DistributedWorkQueue] INFO : Got unexpected z
ookeeper event: None for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/recovery
2016-05-10 02:12:23,883 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
at org.apache.accumulo.fate.zookeeper.ZooCache$1.run(ZooCache.java:210)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.getChildren(ZooCache.java
:221)
at org.apache.accumulo.core.client.impl.Tables.exists(Tables.java:142)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.tab
leExists(LargestFirstMemoryManager.java:149)
at org.apache.accumulo.server.tabletserver.LargestFirstMemoryManager.get
MemoryManagementActions(LargestFirstMemoryManager.java:175)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.manageMemory(TabletServerResourceManager.java:408)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework.access$400(TabletServerResourceManager.java:318)
at org.apache.accumulo.tserver.TabletServerResourceManager$MemoryManagem
entFramework$2.run(TabletServerResourceManager.java:346)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.jav
a:35)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,884 [zookeeper.ZooCache] WARN : Saw (possibly) transient exc
eption communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tables/!0/con
f/table.classpath.context
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)
at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)
at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:117)
at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCache
PropertyAccessor.java:103)
at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfigura
tion.java:99)
at org.apache.accumulo.tserver.constraints.ConstraintChecker.classLoader
Changed(ConstraintChecker.java:93)
at org.apache.accumulo.tserver.tablet.Tablet.checkConstraints(Tablet.jav
a:1225)
at org.apache.accumulo.tserver.TabletServer$8.run(TabletServer.java:2848
)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
1)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-10 02:12:23,887 [zookeeper.ZooReader] WARN : Saw (possibly) transient ex
ception communicating with ZooKeeper
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /accumulo/be4f66be-1508-4314-9bff-888b56d9b0ce/tservers/accu
mulo.tablet.2:9997
at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java
:132)
at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:383)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.j
ava:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Changing monitor lo
g4j address to accumulo.master:4560
2016-05-10 02:12:24,252 [watcher.MonitorLog4jWatcher] INFO : Enabled log-forward
ing
Even the master server's logs had the same stacktrace. My zookeeper is running.
At first, I thought it was a disk issue. Maybe there was no space. But that was not the case. I ran the fsck on the accumulo instance.volumes and it returned the HEALTHY status.
Does anyone know what exactly happened and if possible, how to avoid it?
EDIT : Even the tracer_accumulo.master.log had the same stacktrace.
ZooKeeper session expirations occur when a thread inside the ZooKeeper client does not get run within the necessary time (by default, 30s) to maintain the session which is an in-memory state between ZooKeeper client and server. There is no single explanation for this, but many common culprits:
JVM garbage collection pauses in the client. Accumulo should log a warning if it experienced a pause.
Lack of CPU time. If the host itself is overburdened, Accumulo might not have the cycles to run all of the tasks it needs to in a timely manner.
Lack of sockets/filehandles, Accumulo could be trying to connect to ZooKeeper, but be unable to open new connections
ZooKeeper might be rate-limiting connections as a denial-of-service prevention. Check the zookeeper logs for errors about dropping/denying new connections from a specific IP, and, if you see these errors, consider increasing maxClientCnxns in zoo.cfg.

what is zookeeper.broker.path

I'm learning Spark and Kafka and came across this project kafka-spark-consumer that seems to consume messages from Kafka efficiently. This project requires to configure few kafka & zookeeper properties thats where I'm struggling. I mean what does this property mean zookeeper.broker.path? Sorry, if its a basic question.
I have configured kafka in single node and with the following properties,
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
and zookeeper as,
zookeeper.connect=localhost:2181/brokers
zookeeper.connection.timeout.ms=6000
if i try to configure the zookeeper.broker.path with /brokers i get the following exception from the consumer,
Exception in thread "main" java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/<name>/partitions
at consumer.kafka.ReceiverLauncher.getNumPartitions(ReceiverLauncher.java:217)
at consumer.kafka.ReceiverLauncher.createStream(ReceiverLauncher.java:79)
at consumer.kafka.ReceiverLauncher.launch(ReceiverLauncher.java:51)
at com.ibm.spark.streaming.KafkaConsumer.run(KafkaConsumer.java:78)
at com.ibm.spark.streaming.KafkaConsumer.start(KafkaConsumer.java:43)
at com.ibm.spark.streaming.KafkaConsumer.main(KafkaConsumer.java:103)
Can you help me to understand what is the zookeeper broker path here and how can i configure that?
EDIT
The above error is caused due to non-existent topic, the moment i created the topic, the error went away.
As answered by user007, the /brokers directory is created by zookeeper by default.
No need of '/brokers' for zookeeper.connect property. It should be
zookeeper.connect=localhost:2181
I am not familiar with the "kafka-spark-consumer" project which you mentioned. But usually /brokers is the default node kafka creates in zookeeper. I haven't seen any library asking the user to configure it.
/brokers is the znode path under which metadata like topics are stored.
Go to kafka bin directory. Then invoke zookeeper shell - ./zookeeper-shell.sh localhost
Then do ls. You should be able to see topics and other child nodes created there.

Cassandra and defuncting connection

I've got a question about Cassandra. I haven't found any "understable answer" yet...
I made a cluster build on 3 nodes (RackInferringSnitch) on differents VM. I'm using Datastax's Java Driver to read and update my keyspace (with CSVs).
When one node is down (ie : 10.10.6.172), I've got this debug warning:
INFO 00:47:37,195 New Cassandra host /10.10.6.172:9042 added
INFO 00:47:37,246 New Cassandra host /10.10.6.122:9042 added
DEBUG 00:47:37,264 [Control connection] Refreshing schema
DEBUG 00:47:37,384 [Control connection] Successfully connected to /10.10.6.171:9042
DEBUG 00:47:37,391 Adding /10.10.6.172:9042 to list of queried hosts
DEBUG 00:47:37,395 Defuncting connection to /10.10.6.172:9042
com.datastax.driver.core.TransportException: [/10.10.6.172:9042] Channel has been closed
at com.datastax.driver.core.Connection$Dispatcher.channelClosed(Connection.java:621)
at
[...]
[...]
DEBUG 00:47:37,400 [/10.10.6.172:9042-1] Error connecting to /10.10.6.172:9042 (Connection refused: /10.10.6.172:9042)
DEBUG 00:47:37,407 Error creating pool to /10.10.6.172:9042 ([/10.10.6.172:9042] Cannot connect)
DEBUG 00:47:37,408 /10.10.6.172:9042 is down, scheduling connection retries
DEBUG 00:47:37,409 First reconnection scheduled in 1000ms
DEBUG 00:47:37,410 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,423 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,427 Adding /10.10.6.122:9042 to list of queried hosts
DEBUG 00:47:37,435 Shutting down pool
DEBUG 00:47:37,439 Adding /10.10.6.171:9042 to list of queried hosts
DEBUG 00:47:37,443 Shutting down pool
DEBUG 00:47:37,459 Connected to cluster: WormHole
I wanted to know if I need to handle this exception or it will be handled by itself (I mean, when the node will be back again cassandra will do the correct write if the batch was a write...)
EDIT : Current consistency level is ONE.
The DataStax driver keeps track of which nodes are available at all times and routes queries (load balacing) based on this information. The way it does this is based on your reconnection policy.
You will see debug level messages when nodes are detected as down, etc. This is no cause for concern as the driver will re-route to other available nodes, it will also re-try the nodes periodically to find out if they are back up. If you had a problem and the data was not getting saved to Cassandra you would see timeout errors. No action necessary in this case.

Resources