How to delete the monitoring data of a cluster node that is offline? - tidb

I have deployed TiDB in our production environment. I want to know how to delete the monitoring data of a cluster node that is offline.

The offline node usually indicates the TiKV node. You can determine whether the offline process is finished by the pd-ctl or the monitor. After the node is offline, perform the following steps:
Manually stop the relevant services on the offline node.
Delete the node_exporter data of the corresponding node from the Prometheus configuration file.
Delete the data of the corresponding node from Ansible inventory.ini.

Related

How to know if a machine in a Spark cluster 'participate's a job

I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.

Scale Cassandra with copying data manually

I created an AMI from my cassandra machine and then launched a new instance. After making config changes(setting the seed node to the first one, and setting auto_bootstrap: false) when I start cassandra and do a nodetool status it shows data on the both the nodes. I just want to know if the cluster actually knows that both nodes have the data and if a request comes can route it to the second node also.
As without manually copying data, the streaming is actually not getting completed. It somehow manages to fail after a certain period of time and then I have to again run 'nodetool bootstrap resume' to restart bootstraping process which again fails.
I don't think this should work this way (all the copying thing).
Why you can't perform normal bootstrapping? What are error messages in the logs when you try to do it? What is RF of your keyspace?
In addition to your data, Cassandra also saves information about the node on disk, all the system tables, for example node id, so you can't just replicate the image. If you copied cassandra image, and just changed config, this wouldn't work, you should delete all data prior to starting the node and joining to cluster.
EDIT:
If you going with auto_bootstrap: off
Remove all the data from the new server (both data and commit log directories).
Start the node, and after it joins, run rebuild.
Run repair after the process is finished.
If you going with auto_bootstrap: on
Remove all the data from the new server (both data and commit log directories).
Start the node and monitor the bootstraping.
Before trying these, remove the node you can't add from the cluster.

Cassandra 2+ HPC Deployment

I am trying to deploy Cassandra on a Linux Based HPC cluster and I need some guidelines if possible. Specifically, what is the difference between running Cassandra locally and in cluster.
When managing locally (in which case it runs smoothly) we duplicate the original files for every node inside our Cassandra directory and we apply the appropriate changes for IP address, rcp, JMX etc... however, when managing a network which files do we need to install in each node. The whole package with all the files or just some of the required ones
like, bin/cassandra.in.sh, conf/cassandra.yaml, bin/cassandra.
I am a little bit confused on what to store in each node separately so to start working on the cluster.
You need to install Cassandra on each node (VM), i.e. the whole package and then update config files as neccessary. As described here to configure cluster in a single data center you need:
Install Cassandra on each node
Configure cluster name
Configure seeds
Configure snitch, if needed

Icinga2 cluster node local checks not executing

I am using Icinga2-2.3.2 cluster HA setup with three nodes in the same zone and database in a seperate server for idodb. All are Cent OS 6.5. Installed IcingaWeb2 in the active master.
Configured four local checks for each node including cluster health check as described in the documentation. Installed Icinga Classi UI in all three nodes, beacuse I am not able to see the local checks configured for nodes in Icinga Web2.
Configs are syncing, checks are executing & all three nodes are connected among them. But the configured local checks, specific to the node alone are not happening properly and verified it in the classic ui.
a. All local checks are executed only one time whenever
- one of the node is disconnected or reconnected
- configuration changes done in the master and reload icinga2
b. But after that, only one check is hapenning properly in one node and the remaining are not.
I have attached the screenshot of all node classic ui.
Please help me to fix and Thanks in advance.

Priam backup automatic restore

I have a Cassandra cluster managed by Priam, with 3 nodes. I use ephemeral disks to store my Cassandra data, so when I start 1 node, the Cassandra data dir is empty.
I have Priam properly configured and I can see backups are saved in Amazon S3. Suppose a node goes down and then I start another node. Will Priam know how to automatic restore backup from S3 when the node comes up again? The Cassandra data dir will start empty, so I am assuming Priam would give the new node the same token as the old one and it would restore the data... Right?
Yes. I have been running standalone Cassandra on EC2, small Cassandra clusters on mesos on EC2, and larger DataStax Enterprise clusters (with Cassandra) on EC2.
I have been using the Priam 3.x branch.
On restore, it calculates the initial_token, updates the cassandra.yaml file, restores the snapshot and incremental backup files, and restarts Cassandra.
According to Priam/Netflix conventions, if you have a 3 node cluster with Cassandra, your nodes should be named some_thing-other-things. Each node should be a part of an Auto-scaling group called some_thing. Each node should also use a Security Group named some_thing.
Create a 3 node dev cluster and test your backups and restores with data that you can easily recreate, that you don't care about too much. Get used to managing the Auto-scaling groups and Priam. Then, try it on test clusters with data that you care about.

Resources