My problem is about the pacemaker. For example,the pacemaker cluster has two resources, one of which is starting, such as needing for 3 minutes, then in this 3 minutes, if another resource monitor failed, it will not immediately call stop/start method to restart but waiting the first resource to starting complete. After the first resource start completely, the second resource begin restarting, does anyone know why?Thank you very much!
My cluster version:
corosync 2.3.4
pacemaker 1.1.13
My cluster configure is as follows.And for debug,I have add "sleep 60" to function start of ocf.
crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false
When I kill fm_mgt service on 177 node, and then kill logserver service on 177, fm_mgt start need at least one minite, in this minite, logserver will not be restarted until fm_mgt recovery completely.
crm status
Last updated: Thu Oct 26 06:40:24 2017 Last change: Thu Oct 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
Clone Set: logserver_replica [logserver]
logserver (ocf::heartbeat:logserver): FAILED 192.168.2.177
Started: [ 192.168.2.178 192.168.2.179 ]
Clone Set: fm_mgt_replica [fm_mgt]
Started: [ 192.168.2.178 192.168.2.179 ]
Stopped: [ 192.168.2.177 ]
Related
On my instance,
I added a runner on a Apple Silicon M1, but this runner doesn't start.
That's why I assigned a project to it, with the hope of starting, but I see this
But how can I check why is there a red ! ?
What prevents to start it ?
This is what I did.
Create docker runner:
docker stop gitlab-runner && docker rm gitlab-runner
docker run -d --name gitlab-runner --restart always \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /Users/Shared/gitlab-runner/config:/etc/gitlab-runner \
gitlab/gitlab-runner:latest
register runner:
docker run --rm -v /Users/Shared/gitlab-runner/config:/etc/gitlab-runner gitlab/gitlab-runner register \
--non-interactive \
--executor "docker" \
--docker-image hannesa2/android-ndk:api28_emu \
--url "http://latitude:8083/" \
--registration-token "<TOKEN>" \
--description "M1 pro Android NDK + Emu" \
--tag-list "android,android-ndk,android-emu" \
--run-untagged="true" \
--locked="false" \
--access-level="not_protected"
and I see this in docker log
Runtime platform arch=arm64 os=linux pid=8 revision=4b9e985a version=14.4.0
Starting multi-runner from /etc/gitlab-runner/config.toml... builds=0
Running in system-mode.
Configuration loaded builds=0
listen_address not defined, metrics & debug endpoints disabled builds=0
[session_server].listen_address not defined, session endpoints disabled builds=0
ERROR: Checking for jobs... forbidden runner=Jc2yrs_v
ERROR: Checking for jobs... forbidden runner=Jc2yrs_v
ERROR: Checking for jobs... forbidden runner=Jc2yrs_v
ERROR: Runner http://latitude:8083/Jc2yrs_v8JkJyMJAGUd_ is not healthy and will be disabled!
Configuration loaded builds=0
Thank you
Below is the code i'm using to write output df to BigQuery from PySpark cluster(dataproc). While running this earlier i was getting heartbeat timeout issue, fixed that. Then i was getting executor memory overhead exceeded, increased that. Now, this chunk of code it running for unlimited time and in the log it shows
2020-06-12 02:55:45.395 IST Cache Size Before Clean: 34922812, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0. What should i understand from it? Is it running or not? If it's not running then what is the solution?
output.write \
.format("bigquery") \
.option("table","{}.{}".format(bq_dataset, bq_table)) \
.option("temporaryGcsBucket", gcs_bucket) \
.mode('append') \
.save()
cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...
I followed this guide:
https://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/
I stayed with the Active/Passive DRBD file system sharing. I had to reboot my cluster and now I am getting the following error:
Current DC: rbx-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Tue Nov 28 17:01:14 2017
Last change: Tue Nov 28 16:40:09 2017 by root via cibadmin on rbx-1
2 nodes configured
5 resources configured
Node rbx-2: UNCLEAN (offline)
Online: [ rbx-1 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started rbx-1
WebSite (ocf::heartbeat:apache): Stopped
Master/Slave Set: WebDataClone [WebData]
WebData (ocf::linbit:drbd): FAILED rbx-1 (blocked)
Stopped: [ rbx-2 ]
WebFS (ocf::heartbeat:Filesystem): Stopped
Failed Actions:
* WebData_stop_0 on rbx-1 'invalid parameter' (2): call=20, status=complete, exitreason='none',
last-rc-change='Tue Nov 28 16:27:58 2017', queued=0ms, exec=3ms
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
Any ideas?
Also does anyone have any recommended guides for submitting jobs?
This post is relatively old at this point but I'll leave this here for others to find if they stumble upon the same issue.
This problem has to do with an issue with the DRBD integration script that pacemaker uses. If it's broken, missing, has incorrect permissions, etc. you can get an error like this. In CentOS 7 that script is located at /usr/lib/ocf/resource.d/drbd
Note: This is specifically for the guide mentioned by OP but may help you:
Section 7.1 has a big "IMPORTANT" block that talks about replacing the Pacemaker integration script due to a bug. If you use the command it tells you to there, you actually replace the script with a 404 Error page which obviously doesn't work, causing the error. You can fix this issue by replacing the script with the original, either by reinstalling DRBD...
yum remove -y kmod-drbd84 drbd84-utils
yum install -y kmod-drbd84 drbd84-utils
...or finding just the drbd script elsewhere and adding/replacing it to /usr/lib/ocf/resource.d/drbd. Make sure its permissions are correct and that it is set as executable.
Hope that helps!
I am trying to start the cluster on three clean centos machines.
I tried to keep this post short, I am not attaching config files becouse I used this guide and the config files are according this:
https://www.percona.com/doc/percona-xtradb-cluster/5.7/add-node.html#add-node
Starting first node ok.
Starting second node error.
Here is the log on second node
2017-09-28T15:05:09.367856Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.14.104' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '5490' '' )
2017-09-28T15:05:09.368984Z 0 [ERROR] WSREP: Failed to read 'ready ' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.14.104' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '5490' ''
Read: '(null)'
2017-09-28T15:05:09.369064Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.14.104' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '5490' '' : 2 (No such file or directory)
2017-09-28T15:05:09.370161Z 2 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2017-09-28T15:05:09.370192Z 2 [ERROR] Aborting
Second node startup is failing because it is unable to perform and SST (full state transfer) from the donor node.
This is failing because the xtrabackup-v2 is failing. You need to check the logs on the donor node to get more idea as to why, but possible reasons include -
Insufficient memory on donor node
Syntax error in my.cnf on donor node (xtrabackup is more picky about syntax than normal mysql -- check for duplicate lines, which mysql accepts but xtrabackup doesnt)
File permissions
xtrabackup installed incorrectly, not installed, or wrong version
mismatch in wsrep configuration between nodes
invalid credentials for wsrep authentication
There are several reasons the SST could have failed. You need to examine the logs on the first node too. Could be ports blocked, could be no SST user created, wrong SST password, missing xtrabackup software, etc, etc. Impossible to tell from only what you provided.