I am unable to start a 3 node universe with yb-master. I am following the docs here:
https://docs.yugabyte.com/latest/deploy/manual-deployment/start-masters/#verify-health
I created 3 master.conf files for 3 separate ips.
For 10.0.0.185:
--master_addresses=10.0.0.185:7100,10.0.0.141:7100,10.0.0.119:7100
--rpc_bind_addresses=10.0.0.185:7100
--fs_data_dirs=/home/mark/yuga/y1
For 10.0.0.141:
--master_addresses=10.0.0.141:7100,10.0.0.185:7100,10.0.0.119:7100
--rpc_bind_addresses=10.0.0.141:7100
--fs_data_dirs=/home/mark/yuga/y1
For 10.0.0.119:
--master_addresses=10.0.0.119:7100,10.0.0.141:7100,10.0.0.185:7100
--rpc_bind_addresses=10.0.0.119:7100
--fs_data_dirs=/home/mark/yuga/y1
I started each node up with the command ./bin/yb-master --flagfile master.conf >& ./y1/yb-master.out &
What seems to happen is that the first 2 nodes start up fine but as soon as I try to spin up the third node, the first node crashes and I end up with the error:
At first I thought may be this has to do with the servers I've got so I changed up the order I spin up the yb-masters, but it's always the first one I spin up first that dies.
Looking at the yb-master.INFO for each ip from yb1/yb-data/master/logs/yb-master.INFO with the command cat y1/yb-data/master/logs/yb-master.INFO | grep master I see:
The one that crashes:
This master's current role is: FOLLOWER
And the other two show:
I0110 00:02:56.565732 3292 client-internal.cc:2384] New master addresses: [10.0.0.141:7100,10.0.0.185:7100,10.0.0.119:7100, 10.0.0.141:7100, 10.0.0.185:7100, 10.0.0.119:7100]
E0110 00:02:58.069311 3162 async_initializer.cc:99] Failed to initialize client: Timed out (yb/rpc/rpc.cc:224): Could not locate the leader master: GetLeaderMasterRpc(addrs: [10.0.0.141:7100, 10.0.0.185:7100, 10.0.0.119:7100, 10.0.0.141:7100, 10.0.0.185:7100, 10.0.0.119:7100], num_attempts: 46) passed its deadline 1101.945s (passed: 1.504s): Network error (yb/util/net/socket.cc:551): recvmsg error: Connection refused (system error 111)
I0110 00:02:59.071501 3293 client-internal.cc:2355] Reinitialize master addresses from file: master.conf
I0110 00:02:59.071782 3293 client-internal.cc:2384] New master addresses: [10.0.0.141:7100,10.0.0.185:7100,10.0.0.119:7100, 10.0.0.141:7100, 10.0.0.185:7100, 10.0.0.119:7100]
and
I0110 00:02:57.610631 2128 master_service.cc:531] Patching role from leader to follower because of: Leader not ready to serve requests (yb/master/scoped_leader_shared_lock.cc:123): Leader not yet ready to serve requests: leader_ready_term_ = -1; cstate.current_term = 1 [suppressed 77 similar messages]
I0110 00:02:58.072002 2144 client-internal.cc:2355] Reinitialize master addresses from file: master.conf
I0110 00:02:58.072276 2144 client-internal.cc:2384] New master addresses: [10.0.0.119:7100,10.0.0.141:7100,10.0.0.185:7100, 10.0.0.119:7100, 10.0.0.141:7100, 10.0.0.185:7100]
I'm not sure why I'm seeing those errors, am I missing something while attempting to start up the 3 yb-masters?
I should also mention that I've ensured all 3 nodes have the correct system configurations, as mentioned here: https://docs.yugabyte.com/latest/deploy/manual-deployment/system-config/#setting-system-wide-ulimits
Related
I have d/l'd arangodb3-linux-3.9.2 from GIT on Centos 7. I created a database dir and ran the README instructions for a standalone start. The first time it runs, I get 100 failures, the key INFO log lines seem to be
... [INFO] server started component=arangodb pid=49827 type=single
... [INFO] Wait on 49827 returned component=arangodb exit-status=1 trap-cause=-1
It creates the log file, setup.json and a single8529 dir in the database dir I sped'd. Is it just taking too long to start? The whole 100 fails take about 1 or 2 seconds.
If I try to run it again with the same README instructions, the next time I get this error:
... [FATAL] Failed to run service error="open /.../single8529/data/ENGINE: no such file"
I have also tried with --starter.host 127.0.0.1 -- to simplify
Also I and can confirm that port 8529 is open
I couldn't get arangodb 'starter' according to their README to work, but this does start the server:
arangod --database.directory MYDIR --rocksdb.max-background-jobs 4
I have been using georep for the last two months and posted this on their GitHub but no answers so far.
Description of problem: after copying ~8TB without any issue, some nodes are flipping between Active and Faulty with the following error message in gsync log:
ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128).
Default encoding in all machines is utf-8
Command to reproduce the issue:
gluster volume georeplication master_vol user#slave_machine::slave_vol start
The full output of the command that failed:
The command itself it's fine but you need to start it to fail, hence the command it's not the issue on it's own
Expected results:
No such failures, copy should go as planned
Mandatory info:
The output of the gluster volume info command:
Volume Name: volname
Type: Distributed-Replicate
Volume ID: d5a46398-9638-4b50-9db0-4cd7019fa526
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks: 24 bricks (omited the names cause not relevant and too large)
Options Reconfigured:
features.ctime: off
cluster.min-free-disk: 15%
performance.readdir-ahead: on
server.event-threads: 8
cluster.consistent-metadata: on
performance.cache-refresh-timeout: 1
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
performance.flush-behind: off
performance.cache-size: 5GB
performance.cache-max-file-size: 1GB
performance.io-thread-count: 32
performance.write-behind-window-size: 8MB
client.event-threads: 8
network.inode-lru-limit: 1000000
performance.md-cache-timeout: 1
performance.cache-invalidation: false
performance.stat-prefetch: on
features.cache-invalidation-timeout: 30
features.cache-invalidation: off
cluster.lookup-optimize: on
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
storage.owner-uid: 33
storage.owner-gid: 33
features.bitrot: on
features.scrub: Active
features.scrub-freq: weekly
cluster.rebal-throttle: lazy
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
The output of the gluster volume status command:
Don't really think this is relevant as everything seems fine, if needed I'll post it
The output of the gluster volume heal command:
Same as before
**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
Not the relevant ones as is georep, posting the exact issue: (this log is from master volume node)
[2022-09-23 09:53:32.565196] I [master(worker /bricks/brick1/data):1439:process] _GMaster: Entry Time Taken [{MKD=0}, {MKN=0}, {LIN=0}, {SYM=0}, {REN=0}, {RMD=0}, {CRE=0}, {duration=0.0000}, {UNL=0}]
[2022-09-23 09:53:32.565651] I [master(worker /bricks/brick1/data):1449:process] _GMaster: Data/Metadata Time Taken [{SETA=0}, {SETX=0}, {meta_duration=0.0000}, {data_duration=1663926812.5656}, {DATA=0}, {XATT=0}]
[2022-09-23 09:53:32.566270] I [master(worker /bricks/brick1/data):1459:process] _GMaster: Batch Completed [{changelog_end=1663925895}, {entry_stime=None}, {changelog_start=1663925895}, {stime=(0, 0)}, {duration=673.9491}, {num_changelogs=1}, {mode=xsync}]
[2022-09-23 09:53:32.668133] I [master(worker /bricks/brick1/data):1703:crawl] _GMaster: processing xsync changelog [{path=/var/lib/misc/gluster/gsyncd/georepsession/bricks-brick1-data/xsync/XSYNC-CHANGELOG.1663926139}]
[2022-09-23 09:53:33.358545] E [syncdutils(worker /bricks/brick1/data):325:log_raise_exception] : connection to peer is broken
[2022-09-23 09:53:33.358802] E [syncdutils(worker /bricks/brick1/data):847:errlog] Popen: command returned error [{cmd=ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-GcBeU5/38c083bada86a45a28e6710377e456f6.sock geoaccount#slavenode6 /usr/libexec/glusterfs/gsyncd slave mastervol geoaccount#slavenode1::slavevol --master-node masternode21 --master-node-id 08c7423e-c2b6-4d40-adc8-d2ded4f66608 --master-brick /bricks/brick1/data --local-node slavenode6 --local-node-id bc1b3971-50a7-4b32-a863-aaaa02419de6 --slave-timeout 120 --slave-log-level INFO --slave-gluster-log-level INFO --slave-gluster-command-dir /usr/sbin --master-dist-count 12}, {error=1}]
[2022-09-23 09:53:33.358927] E [syncdutils(worker /bricks/brick1/data):851:logerr] Popen: ssh> failed with UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 60: ordinal not in range(128).
[2022-09-23 09:53:33.672739] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Faulty}]
[2022-09-23 09:53:45.477905] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]
**- Is there any crash ? Provide the backtrace and coredump
Provided log up
Additional info:
Master volume: 12x2 Distributed-replicated setup, been working for a couple years no, no big issues as of today. 160TB of Data
Slave volume: 2x(5+1) Distributed-disperse setup, created exclusively to be a slave georep node. Managed to copy 11TB of data from master node, but it's failing.
The operating system / glusterfs version:
On ALL nodes: Glusterfs version= 9.6
Master nodes OS: CentOS 7
Slave nodes OS: Debian11
Extra questions
Don't really know if it's the place to ask this, but while we're at it, any guidance as of how to improve sync performance? Tried changing the parameter sync_jobs up to 9 (from 3) but as we've seen (while it was working) it'd only copy from 3 nodes max, at a "low" speed (about 40% of our bandwidth). It could go as high as 1Gbps but the max we got was 370Mbps.
Also, is there any in-depth documentation for georep? The basics we found were too basic and we did miss more doc to read and dig up into.
cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...
I'm using Spark Standalone Mode tutorial page to install Spark in Standalone mode.
1- I have started a master by:
./sbin/start-master.sh
2- I have started a worker by:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
Note: spark://ubuntu:7077 is my master name, which I can see it in Master-WebUI.
Problem: By second command, a worker started successfully. But it couldn't associate with master. It tries repeatedly and then give this message:
15/02/08 11:30:04 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#ubuntu:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ubuntu/127.0.1.1:7077
15/02/08 11:30:04 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1296628173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [20] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
15/02/08 11:31:15 ERROR Worker: All masters are unresponsive! Giving up.
What is the problem?
Thanks
I usually start from spark-env.sh template. And I set, properties that I need. For simple cluster you need:
SPARK_MASTER_IP
Then, create a file called "slaves" in the same directory as spark-env.sh and slaves ip's (one per line). Assure you reach all slaves through ssh.
Finally, copy this configuration in every machine of your cluster. Then start the entire cluster executing start-all.sh script and try spark-shell to check your configuration.
> sbin/start-all.sh
> bin/spark-shell
You can set export SPARK_LOCAL_IP="You-IP" #to set the IP address Spark binds to on this node in $SPARK_HOME/conf/spark-env.sh
In my case, using spark 2.4.7 in standalone mode, I've created a passwordless ssh key using ssh-keygen, but still got asked for worker password when starting the cluster.
What I did was follow the instructions here
https://www.cyberciti.biz/faq/how-to-set-up-ssh-keys-on-linux-unix/
This line solved the problem:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#server-ip
I am trying to configure two windows servers in my network as Cassandra cluster.
I did some reading in various sites and changed the below in Cassandra.yalm
after changing the default value of 127.0.0.1 to actual IP the Cassandra service is not starting.
I also added the map to actual IP to localhost in (windows) hosts file.
After doing the above change, the service is coming up when I start the service. it is stopping immediately.
The reason I am changing this IP is to make this a cluster with two node setup,
Please let me know if I miss some thing.
Version: Datastax community version of Cassandra
Server : windows.
Thx
Muthu
Message from Cassandra.txt in logs dir:
ERROR [main] 2014-09-18 11:43:12,155 DatabaseDescriptor.java (line 116) Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Invalid yaml Caused by: Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=Cannot create property=seed_provider for JavaBean=org.apache.cassandra.config.Config#34e5190a; No suitable constructor with 2 arguments found for class org.apache.cassandra.config.SeedProviderDef in 'reader', line 8, column 1: cluster_name: 'Test Cluster'
If you want to create Cassandra cluster you must have at least two nodes and configure /etc/cassandra/cassandra.yaml
cassandra.yaml
cluster_name: 'Some Cluster Name'
listen_address: [Current IP]
rpc_address: [Current IP]
seed_provuder:
- seeds: "[Current IP], [Remote IP]"
Note: seeds must have at least two IPs which must be reachable for each other
Clean and start Cassandra instance
sudo rm -rf /var/lib/cassandra/* /var/log/cassandra/*
Note: Cassandra instance must be killed before cleaning those folders.