I am trying to connect some Nifi Nodes (1.13.2) to an external Zookeeper to be able to clusterize the service, but I am facing some errors while the NiFi tries to do the election using zookeeper. The error that I found on nifi-app.log was:
2021-07-12 12:32:24,673 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#448ca30f Connection State changed to RECONNECTED
2021-07-12 12:32:24,673 WARN [main] o.a.nifi.controller.StandardFlowService There is currently no Cluster Coordinator. This often happens upon restart of NiFi when running an embedded ZooKeeper. Will register this node to become the active Cluster Coordinator and will attempt to connect to cluster again
2021-07-12 12:32:24,673 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Attempted to register Leader Election for role 'Cluster Coordinator' but this role is already registered
2021-07-12 12:32:24,774 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-07-12 12:32:24,774 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#448ca30f Connection State changed to SUSPENDED
2021-07-12 12:32:24,893 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: RECONNECTED
2021-07-12 12:32:24,893 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#448ca30f Connection State changed to RECONNECTED
2021-07-12 12:32:24,894 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2021-07-12 12:32:24,894 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#448ca30f Connection State changed to SUSPENDED
2021-07-12 12:32:24,894 ERROR [main-EventThread] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:647)
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground$2.processResult(FindAndDeleteProtectedNodeInBackground.java:104)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:630)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
My nifi.properties configs are:
nifi.state.management.configuration.file=./conf/state-management.xml
# The ID of the local state provider
nifi.state.management.provider.local=local-provider
# The ID of the cluster-wide state provider. This will be ignored if NiFi is not clustered but must be populated if running in a cluster.
nifi.state.management.provider.cluster=zk-provider
# Specifies whether or not this instance of NiFi should run an embedded ZooKeeper server
nifi.state.management.embedded.zookeeper.start=false
# Properties file that provides the ZooKeeper properties to use if <nifi.state.management.embedded.zookeeper.start> is set to true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=nifi-node01
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=20
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=10 sec
nifi.cluster.node.read.timeout=10 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=1 mins
nifi.cluster.flow.election.max.candidates=
nifi.zookeeper.connect.string=zookeeper-node0.fqdn:2181,zookeeper-node1.fqdn:2181,zookeeper-node2.fqdn:2181
nifi.zookeeper.connect.timeout=10 secs
nifi.zookeeper.session.timeout=10 secs
nifi.zookeeper.root.node=/nifi-prod
nifi.zookeeper.client.secure=false
nifi.zookeeper.security.keystore=
nifi.zookeeper.security.keystoreType=
nifi.zookeeper.security.keystorePasswd=
nifi.zookeeper.security.truststore=
nifi.zookeeper.security.truststoreType=
nifi.zookeeper.security.truststorePasswd=
nifi.zookeeper.auth.type=default
nifi.zookeeper.kerberos.removeHostFromPrincipal=
nifi.zookeeper.kerberos.removeRealmFromPrincipal=
The state-management configs for zk-provider are:
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String"> zookeeper-node0.fqdn:2181, zookeeper-node1.fqdn:2181, zookeeper-node2.fqdn:2181</property>
<property name="Root Node">/nifi-prod</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
The zookeeper security configs on CM (CDH-6.3.3) are:
And the znode security on nifi-prod is:
[zk: zookeeper-node0:2181(CONNECTED) 1] getAcl /nifi-prod
'world,'anyone
: cdrwa
** I got this acl connecting to zookeeper using this command on the nifi node: /opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p0.1796617/lib/zookeeper/bin/zkCli.sh -server zookeeper-node0:2181
Can this error occurs because I did not put the security configs? Even with the zkCli command without authentication working from the nifi node? The zookeeper log does not print any error
Related
spark.deploy.zookeeper.url
Introduction to connections without zoopeer passwords
https://spark.apache.org/docs/latest/configuration.html#deploy
if zookeeper have password, spark ha how to set up ?
thank you
I try to configure like this, but error
-Dspark.deploy.zookeeper.url=test:test123#172.28.1.43:2181
2023-02-08 16:16:53,448 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=test:test123#172.28.1.43:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState#8eb0a94
2023-02-08 16:17:03,495 WARN zookeeper.ClientCnxn: Session 0x0 for server test:test123#172.28.1.43:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address test:test123#172.28.1.43:2181 because it's not resolvable
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
Every nodes of the cluster are in Version DSE 6.0.14, they set in ssl mode (listen on port 7001).
We're trying to add a node in version open Sources 4.0 RC1.
We force the port communication on this node:
storage_port: 7001
else the node try to communicate on the 7000 port that is closed.
We encountered the following error, when I try to start the service of the new node :
INFO [main] 2021-05-10 16:22:00,985 StorageService.java:528 - Gathering node replacement information for /10.135.66.204:7001
DEBUG [main] 2021-05-10 16:22:00,986 YamlConfigurationLoader.java:112 - Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
DEBUG [main] 2021-05-10 16:22:00,996 YamlConfigurationLoader.java:112 - Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml
INFO [Messaging-EventLoop-3-1] 2021-05-10 16:22:01,138 InboundConnectionInitiator.java:281 - peer /10.137.65.201:54916 only supports messaging versions lower (2) than this node supports (10)
ERROR [Messaging-EventLoop-3-2] 2021-05-10 16:22:01,237 NoSpamLogger.java:98 - /xx.xxx.xx.xxx:7001->/xx.xxx.xx.xxx:7001-URGENT_MESSAGES-[no-channel] failed to connect
java.nio.channels.ClosedChannelException: null
[...]
INFO [ScheduledTasks:1] 2021-05-10 16:22:02,398 TokenMetadata.java:525 - Updating topology for all endpoints that have changed
ERROR [Messaging-EventLoop-3-1] 2021-05-10 16:22:09,467 InboundConnectionInitiator.java:360 - Failed to properly handshake with peer /xx.xxx.xx.xxx:54922. Closing the channel.
java.lang.AssertionError: null
[...]
I don't know if the error come from a mistake in the config of the node oss 4.0 or if there is an incompatibility between the new node version and the version of the existing node in the cluster.
cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...
I'm trying to send spark job to yarn (without HDFS) in HA mode.
For submitting I'm using org.apache.spark.deploy.SparkSubmit.
When I send request from machine with active Resource Manager, it works well. But if I' trying to send from machine with standby Resource Manager, job fails with error:
DEBUG org.apache.hadoop.ipc.Client - Connecting to spark2-node-dev/10.10.10.167:8032
DEBUG org.apache.hadoop.ipc.Client - Connecting to /0.0.0.0:8032
org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
However, when I send request via command line (spark-submit), it works well through both active and standby machine.
What can cause the problem?
P.S. Use the same parameters for both type of sending job: org.apache.spark.deploy.SparkSubmit and spark-submit command line request. And properties yarn.resourcemanager.hostname.rm_id defined for all rm hosts
The problem was with absence of yarn-site.xml within class path for spark-submitter jar. Actually spark submitter jar does not take to account YARN_CONF_DIR or HADOOP_CONF_DIR env var, so cannot see yarn-site.
One solution that I found was to put yarn-site into classpath of jar.
I'm using Spark Standalone Mode tutorial page to install Spark in Standalone mode.
1- I have started a master by:
./sbin/start-master.sh
2- I have started a worker by:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
Note: spark://ubuntu:7077 is my master name, which I can see it in Master-WebUI.
Problem: By second command, a worker started successfully. But it couldn't associate with master. It tries repeatedly and then give this message:
15/02/08 11:30:04 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#ubuntu:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ubuntu/127.0.1.1:7077
15/02/08 11:30:04 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1296628173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [20] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
15/02/08 11:31:15 ERROR Worker: All masters are unresponsive! Giving up.
What is the problem?
Thanks
I usually start from spark-env.sh template. And I set, properties that I need. For simple cluster you need:
SPARK_MASTER_IP
Then, create a file called "slaves" in the same directory as spark-env.sh and slaves ip's (one per line). Assure you reach all slaves through ssh.
Finally, copy this configuration in every machine of your cluster. Then start the entire cluster executing start-all.sh script and try spark-shell to check your configuration.
> sbin/start-all.sh
> bin/spark-shell
You can set export SPARK_LOCAL_IP="You-IP" #to set the IP address Spark binds to on this node in $SPARK_HOME/conf/spark-env.sh
In my case, using spark 2.4.7 in standalone mode, I've created a passwordless ssh key using ssh-keygen, but still got asked for worker password when starting the cluster.
What I did was follow the instructions here
https://www.cyberciti.biz/faq/how-to-set-up-ssh-keys-on-linux-unix/
This line solved the problem:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#server-ip