monitoring cassandra with prometheus monitoring tool - cassandra

My prometheus tool is on centos 7 machine and cassandra is on centos 6. I am trying to monitor cassandra JMX port 7199 with prometheus. I keep getting error with my yml file. Not sure why I am not able to connect to the centos 6 (cassandra machine) Is my YAML file wrong or does it have something to do with JMX port 7199?
here is my YAML file:
my global config
global:
scrape_interval: 15s
scrape_configs:
- job_name: cassandra
static_configs:
- targets: ['10.1.0.22:7199']
Here is my prometheus log:
level=info ts=2017-12-08T04:30:53.92549611Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2017-12-08T04:30:53.925623847Z caller=main.go:216 build_context="(go=go1.9.2, user=root#615b82cb36b6, date=20171108-07:11:59)"
level=info ts=2017-12-08T04:30:53.92566228Z caller=main.go:217 host_details="(Linux 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 localhost.localdomain (none))"
level=info ts=2017-12-08T04:30:53.932807536Z caller=web.go:380 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2017-12-08T04:30:53.93303681Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=info ts=2017-12-08T04:30:53.932905473Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-12-08T04:30:53.987468942Z caller=main.go:326 msg="TSDB started"
level=info ts=2017-12-08T04:30:53.987582063Z caller=main.go:394 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2017-12-08T04:30:53.988366778Z caller=main.go:371 msg="Server is ready to receive requests."
level=warn ts=2017-12-08T04:31:00.561007282Z caller=main.go:377 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2017-12-08T04:31:00.563191668Z caller=main.go:384 msg="See you next time!"
level=info ts=2017-12-08T04:31:00.566231211Z caller=targetmanager.go:87 component="target manager" msg="Stopping target manager..."
level=info ts=2017-12-08T04:31:00.567070099Z caller=targetmanager.go:99 component="target manager" msg="Target manager stopped"
level=info ts=2017-12-08T04:31:00.567136027Z caller=manager.go:455 component="rule manager" msg="Stopping rule manager..."
level=info ts=2017-12-08T04:31:00.567162215Z caller=manager.go:461 component="rule manager" msg="Rule manager stopped"
level=info ts=2017-12-08T04:31:00.567186356Z caller=notifier.go:483 component=notifier msg="Stopping notification handler..."
If anyone has instruction on how to connect prometheus to cassandra , both being on two different machines, that would be helpful too.

This is not a problem with your config, prometheus received a TERM signal and terminated gracefully.
If you are not getting metrics, check whether 10.1.0.22:7199/metrics loads and returns metrics. You can also check the prometheus server's /targets endpoint for scraping status.
If you're not getting anything on your cassandra server's /metrics endpoint, it could be because you did not configure the cassandra prometheus exporter properly.

Related

Grafana UI not showing up (via docker-compose)

I've been facing some issues with grafana and docker. I'm trying to get a grafana and influxdb up, for monitoring my MISP instance.
The project which I'm following is: https://github.com/MISP/misp-grafana
The InfluxDB docker is up and receiving logs via telegraf... seems to be working fine. But the Grafana Dashboards does not show up via https://ip:3000/login. From inside of the machine where MISP is running(the docker containers are also running on this machine) i can "CURL" the address and receive the HTML from Grafana's login page.
Could someone help me? Have already tried lots of suggestions, but none of them work as expected.
I've already tried to disable Iptables and firewalld(since i'm using an CentOS7), and nothing helped.
My docker-compose file is:
services:
influxdb:
image: influxdb:latest
container_name: influxdb
volumes:
- influxdb-storage:/var/lib/influxdb2:rw
# - ./influxdb/ssl/influxdb-selfsigned.crt:/etc/ssl/influxdb-selfsigned.crt:rw
# - ./influxdb/ssl/influxdb-selfsigned.key:/etc/ssl/influxdb-selfsigned.key:rw
ports:
- "8086:8086"
environment:
- DOCKER_INFLUXDB_INIT_MODE=${DOCKER_INFLUXDB_INIT_MODE}
- DOCKER_INFLUXDB_INIT_USERNAME=${DOCKER_INFLUXDB_INIT_USERNAME}
- DOCKER_INFLUXDB_INIT_PASSWORD=${DOCKER_INFLUXDB_INIT_PASSWORD}
- DOCKER_INFLUXDB_INIT_ORG=${DOCKER_INFLUXDB_INIT_ORG}
- DOCKER_INFLUXDB_INIT_BUCKET=${DOCKER_INFLUXDB_INIT_BUCKET}
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=${DOCKER_INFLUXDB_INIT_ADMIN_TOKEN}
# - INFLUXD_TLS_CERT=/etc/ssl/influxdb-selfsigned.crt
# - INFLUXD_TLS_KEY=/etc/ssl/influxdb-selfsigned.key
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- influxdb
environment:
- GF_SECURITY_ADMIN_USER=${DOCKER_GRAFANA_USERNAME}
- GF_SECURITY_ADMIN_PASSWORD=${DOCKER_GRAFANA_PASSWORD}
network_mode: host
volumes:
influxdb-storage:
grafana-storage:
The logs from Grafana container are:
GF_PATHS_CONFIG='/etc/grafana/grafana.ini' is not readable.
GF_PATHS_DATA='/var/lib/grafana' is not writable.
GF_PATHS_HOME='/usr/share/grafana' is not readable.
You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migrate-to-v51-or-later
logger=settings t=2023-01-18T11:23:18.545281441Z level=info msg="Starting Grafana" version=9.3.2 commit=21c1d14e91 branch=HEAD compiled=2022-12-14T10:40:18Z
logger=settings t=2023-01-18T11:23:18.545574106Z level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini
logger=settings t=2023-01-18T11:23:18.545595127Z level=info msg="Config loaded from" file=/etc/grafana/grafana.ini
logger=settings t=2023-01-18T11:23:18.545601849Z level=info msg="Config overridden from command line" arg="default.paths.data=/var/lib/grafana"
logger=settings t=2023-01-18T11:23:18.545610343Z level=info msg="Config overridden from command line" arg="default.paths.logs=/var/log/grafana"
logger=settings t=2023-01-18T11:23:18.54561656Z level=info msg="Config overridden from command line" arg="default.paths.plugins=/var/lib/grafana/plugins"
logger=settings t=2023-01-18T11:23:18.545623137Z level=info msg="Config overridden from command line" arg="default.paths.provisioning=/etc/grafana/provisioning"
logger=settings t=2023-01-18T11:23:18.5456313Z level=info msg="Config overridden from command line" arg="default.log.mode=console"
logger=settings t=2023-01-18T11:23:18.545637996Z level=info msg="Config overridden from Environment variable" var="GF_PATHS_DATA=/var/lib/grafana"
logger=settings t=2023-01-18T11:23:18.545648448Z level=info msg="Config overridden from Environment variable" var="GF_PATHS_LOGS=/var/log/grafana"
logger=settings t=2023-01-18T11:23:18.545654176Z level=info msg="Config overridden from Environment variable" var="GF_PATHS_PLUGINS=/var/lib/grafana/plugins"
logger=settings t=2023-01-18T11:23:18.545663184Z level=info msg="Config overridden from Environment variable" var="GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
logger=settings t=2023-01-18T11:23:18.545668879Z level=info msg="Config overridden from Environment variable" var="GF_SECURITY_ADMIN_USER=tsec"
logger=settings t=2023-01-18T11:23:18.545682275Z level=info msg="Config overridden from Environment variable" var="GF_SECURITY_ADMIN_PASSWORD=*********"
logger=settings t=2023-01-18T11:23:18.545689113Z level=info msg="Path Home" path=/usr/share/grafana
logger=settings t=2023-01-18T11:23:18.545699682Z level=info msg="Path Data" path=/var/lib/grafana
logger=settings t=2023-01-18T11:23:18.545705402Z level=info msg="Path Logs" path=/var/log/grafana
logger=settings t=2023-01-18T11:23:18.545710714Z level=info msg="Path Plugins" path=/var/lib/grafana/plugins
logger=settings t=2023-01-18T11:23:18.545732177Z level=info msg="Path Provisioning" path=/etc/grafana/provisioning
logger=settings t=2023-01-18T11:23:18.5457395Z level=info msg="App mode production"
logger=sqlstore t=2023-01-18T11:23:18.545859098Z level=info msg="Connecting to DB" dbtype=sqlite3
logger=migrator t=2023-01-18T11:23:18.575806909Z level=info msg="Starting DB migrations"
logger=migrator t=2023-01-18T11:23:18.584646143Z level=info msg="migrations completed" performed=0 skipped=464 duration=1.036135ms
logger=plugin.loader t=2023-01-18T11:23:18.694560017Z level=info msg="Plugin registered" pluginID=input
logger=secrets t=2023-01-18T11:23:18.695056176Z level=info msg="Envelope encryption state" enabled=true currentprovider=secretKey.v1
logger=query_data t=2023-01-18T11:23:18.698004003Z level=info msg="Query Service initialization"
logger=live.push_http t=2023-01-18T11:23:18.709944098Z level=info msg="Live Push Gateway initialization"
logger=infra.usagestats.collector t=2023-01-18T11:23:19.076511711Z level=info msg="registering usage stat providers" usageStatsProvidersLen=2
logger=provisioning.plugins t=2023-01-18T11:23:19.133661231Z level=error msg="Failed to read plugin provisioning files from directory" path=/etc/grafana/provisioning/plugins error="open /etc/grafana/provisioning/plugins: no such file or directory"
logger=provisioning.notifiers t=2023-01-18T11:23:19.133823449Z level=error msg="Can't read alert notification provisioning files from directory" path=/etc/grafana/provisioning/notifiers error="open /etc/grafana/provisioning/notifiers: no such file or directory"
logger=provisioning.alerting t=2023-01-18T11:23:19.133926705Z level=error msg="can't read alerting provisioning files from directory" path=/etc/grafana/provisioning/alerting error="open /etc/grafana/provisioning/alerting: no such file or directory"
logger=provisioning.alerting t=2023-01-18T11:23:19.133951102Z level=info msg="starting to provision alerting"
logger=provisioning.alerting t=2023-01-18T11:23:19.133992848Z level=info msg="finished to provision alerting"
logger=ngalert.state.manager t=2023-01-18T11:23:19.134747843Z level=info msg="Warming state cache for startup"
logger=grafanaStorageLogger t=2023-01-18T11:23:19.140618617Z level=info msg="storage starting"
logger=http.server t=2023-01-18T11:23:19.140693638Z level=info msg="HTTP Server Listen" address=[::]:3000 protocol=http subUrl= socket=
logger=ngalert.state.manager t=2023-01-18T11:23:19.16651119Z level=info msg="State cache has been initialized" states=0 duration=31.757492ms
logger=ticker t=2023-01-18T11:23:19.166633607Z level=info msg=starting first_tick=2023-01-18T11:23:20Z
logger=ngalert.multiorg.alertmanager t=2023-01-18T11:23:19.166666209Z level=info msg="starting MultiOrg Alertmanager"
logger=context userId=0 orgId=0 uname= t=2023-01-18T11:23:27.6249068Z level=info msg="Request Completed" method=GET path=/ status=302 remote_addr=1.134.26.244 time_ms=0 duration=610.399µs size=29 referer= handler=/
logger=cleanup t=2023-01-18T11:33:19.14617771Z level=info msg="Completed cleanup jobs" duration=7.595219ms
The curl http://ip:3000/login command from inside the CentOS7 machine answer (just a part of it, since it's big):
<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="viewport" content="width=device-width"/><meta name="theme-color" content="#000"/><title>Grafana</title><base href="/"/><link rel="preload" href="public/fonts/roboto/RxZJdnzeo3R5zSexge8UUVtXRa8TVwTICgirnJhmVJw.woff2" as="font" crossorigin/><link rel="icon" type="image/png" href="public/img/fav32.png"/><link rel="apple-touch-icon" sizes="180x180" href="public/img/apple-touch-icon.png"/><link rel="mask-icon" href="public/img/grafana_mask_icon.svg" color="#F05A28"/><link rel="stylesheet" href="public/build/grafana.dark.960bbecc684cac29c4a2.css"/><script nonce="">performance.mark('frontend_boot_css_time_seconds');</script><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="apple-mobile-web-app-status-bar-style" content="black"/><meta name="msapplication-TileColor" content="#2b5797"/><meta name="msapplication-config" content="public/img/browserconfig.xml"/></head><body class="theme-dark app-grafana"><style>.preloader {
height: 100%;
flex-direction: column;
display: flex;
justify-content: center;
align-items: center;
}
...
The telnet ip 3000 command from another machine gives me and error.
The `netstat -naput | grep LISTEN" on the CentOS7 machine:
tcp6 0 0 :::8086 :::* LISTEN 30124/docker-proxy-
tcp6 0 0 :::3000 :::* LISTEN 30241/grafana-serve
I've already tried to change de 3000 port to another one (avoiding firewall blocks) but it did not work.
Help me please.....

Prometheus Execution Timeout Exceeded

I use Grafana to monitor my company's infrastructure. Everything worked fine until this week, I started to see alerts on Grafana with an error message :
request handler error: Post "http://prometheus-ip:9090/api/v1/query_range": dial tcp prometheus-ip:9090: i/o timeout
I tried to restart the prometheus server but it seems that it can't be stopped. I have to kill -9 the server and restart it. Here's the log :
Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="All requests for rebuilding the label indexes queued. (Actual processing may lag behind.)" source="crashrecovery.go:529"
Jun 16 01:04:01 prometheus prometheus[18869]: time="2022-06-16T01:04:01+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Done checkpointing fingerprint mappings in 286.224481ms." source="persistence.go:1503"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=warning msg="Crash recovery complete." source="crashrecovery.go:152"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="362306 series loaded." source="storage.go:378"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:04:02 prometheus prometheus[18869]: time="2022-06-16T01:04:02+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:04:15 prometheus prometheus[18869]: time="2022-06-16T01:04:15+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=420483 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=655877 source="storage.go:1660" urgencyScore=0.8020076751708984
Jun 16 01:09:02 prometheus prometheus[18869]: time="2022-06-16T01:09:02+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:10:05 prometheus prometheus[18869]: time="2022-06-16T01:10:05+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 1m3.127365726s." source="persistence.go:639"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:230"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="See you next time!" source="main.go:237"
Jun 16 01:12:25 prometheus prometheus[18869]: time="2022-06-16T01:12:25+02:00" level=info msg="Stopping target manager..." source="targetmanager.go:75"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping rule manager..." source="manager.go:374"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Rule manager stopped." source="manager.go:380"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping notification handler..." source="notifier.go:369"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping local storage..." source="storage.go:396"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping maintenance loop..." source="storage.go:398"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Maintenance loop stopped." source="storage.go:1259"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping series quarantining..." source="storage.go:402"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Series quarantining stopped." source="storage.go:1701"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Stopping chunk eviction..." source="storage.go:406"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Chunk eviction stopped." source="storage.go:1079"
Jun 16 01:12:28 prometheus prometheus[18869]: time="2022-06-16T01:12:28+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 16.170119611s." source="persistence.go:639"
Jun 16 01:12:44 prometheus prometheus[18869]: time="2022-06-16T01:12:44+02:00" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480"
Jun 16 01:12:45 prometheus prometheus[18869]: time="2022-06-16T01:12:45+02:00" level=info msg="Done checkpointing fingerprint mappings in 651.409422ms." source="persistence.go:1503"
Jun 16 01:12:45 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:13:06 prometheus systemd[1]: prometheus.service: Failed with result 'timeout'.
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Build context (go=go1.7.4, user=pkg-go-maintainers#lists.alioth.debian.org, date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Could not lock /path/to/prometheus/metrics/DIRTY, Prometheus already running?" source="persistence.go:198"
Jun 16 01:13:24 prometheus prometheus[20547]: time="2022-06-16T01:13:24+02:00" level=error msg="Error opening memory series storage: resource temporarily unavailable" source="main.go:182"
Jun 16 01:13:24 prometheus systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 01:13:44 prometheus systemd[1]: prometheus.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Jun 16 01:14:02 prometheus prometheus[18869]: time="2022-06-16T01:14:02+02:00" level=info msg="Local storage stopped." source="storage.go:421"
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Unit entered failed state.
Jun 16 01:14:02 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Jun 16 01:14:03 prometheus systemd[1]: prometheus.service: Service hold-off time over, scheduling restart.
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Starting prometheus (version=1.5.2+ds, branch=debian/sid, revision=1.5.2+ds-2+b3)" source="main.go:75"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Build context (go=go1.7.4, user=pkg-go-maintainers#lists.alioth.debian.org, date=20170521-14:39:14)" source="main.go:76"
Jun 16 01:14:03 prometheus prometheus[20564]: time="2022-06-16T01:14:03+02:00" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:248"
Jun 16 01:14:04 prometheus prometheus[20564]: time="2022-06-16T01:14:04+02:00" level=info msg="Loading series map and head chunks..." source="storage.go:373"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="364314 series loaded." source="storage.go:378"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Starting target manager..." source="targetmanager.go:61"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=info msg="Listening on :9090" source="web.go:235"
Jun 16 01:14:08 prometheus prometheus[20564]: time="2022-06-16T01:14:08+02:00" level=warning msg="Storage has entered rushed mode." chunksToPersist=448681 maxChunksToPersist=524288 maxMemoryChunks=1048576 memoryChunks=687476 source="storage.go:1660" urgencyScore=0.8557910919189453
When restarted like so, Prometheus enters Recovery Mode which takes 1h 30 min to complete. When it's done, the logs show the following :
Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=info msg="Storage does not need throttling anymore." chunksToPersist=524288 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1049320 source="storage.go:935"
Jun 16 16:10:42 prometheus prometheus[32708]: time="2022-06-16T16:10:42+02:00" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=525451 maxChunksToPersist=524288 maxToleratedMemChunks=1153433 memoryChunks=1050483 source="storage.go:927"
Jun 16 16:15:31 prometheus prometheus[32708]: time="2022-06-16T16:15:31+02:00" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612"
Jun 16 16:16:28 prometheus prometheus[32708]: time="2022-06-16T16:16:28+02:00" level=info msg="Done checkpointing in-memory metrics and chunks in 57.204367083s." source="persistence.go:639"
The checkpointing is repeating often and takes about 1 min.
The monitoring for this server show the following :
Here are the flags used :
/usr/bin/prometheus --storage.local.path /path/to/prometheus/metrics --storage.local.retention=1460h0m0s --storage.local.series-file-shrink-ratio=0.3
Prometheus version :
prometheus --version
prometheus, version 1.5.2+ds (branch: debian/sid, revision: 1.5.2+ds-2+b3)
build user: pkg-go-maintainers#lists.alioth.debian.org
build date: 20170521-14:39:14
go version: go1.7.4
I decided to move some metrics on another server so this one is not as loaded as before. However, this server does have to scrape the metrics for 50+ other servers. What could be the cause of this ?

Prometheus not getting metrics in snapshots

I want to export data from prometheus and import to another one.
Found three posts and all of them said just put snapshots to the new one under storage.tsdb.path/snapshots and it's okay.
However I can't reproduce it. It seems Prometheus never acknowledge my copied snapshots in logs (see my last section), but I don't know why and I found nothing related to this issue...
Please let me know what I was missing... It seems like an easy thing but I can't figure it out.
I also asked this question in Prometheus User Group but no one replied to it yet.
Reference posts
https://www.robustperception.io/taking-snapshots-of-prometheus-data
This post said pointing storage.tsdb.path to snapshots directory would work, I tried but it didn't work.
https://devopstales.github.io/home/backup-and-retore-prometheus/
https://suraj.io/post/how-to-backup-and-restore-prometheus/
What did I do
I use docker and do the following steps.
Run prometheus container with --web.enable-admin-api
Snapshot by api $curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
Using docker cp to copy snapshots. And checked the directory size which is ~250M.
$ du -sh snapshots
250M snapshots
I tried 2 ways to import data
Copy to another prometheus container.
Copy snapshots to /data/snapshots and mount /data to prometheus' storage.tsdb.path.
Note: /data is empty except snapshots directory.
My tests were done on 2021/4/22 and my data is around 2021/4/16. All of the containers' retention time is default 15 days.
What I expect to see
I use Grafana's "explore" function to check metric "up" and try to see data on 4/16, but nothing shows.
What did you see instead? Under which circumstances?
Containers of both way 1 and way 2 show only metrics when they start on 2021/4/22, but there are no metrics on 2021/4/16.
Environment
System information:
My host is a vmware player 16 virtual machine running Ubuntu 18.04
Linux 5.4.0-70-generic x86_64
Prometheus version:
I'm using the container version of Prometheus.
prometheus, version 2.25.2 (branch: HEAD, revision: bda05a23ada314a0b9806a362da39b7a1a4e04c3)
build user: root#de38ec01ef10
build date: 20210316-18:07:52
go version: go1.15.10
platform: linux/amd64
Prometheus configuration file:
global:
scrape_interval: 10s
scrape_timeout: 5s
evaluation_interval: 15s
external_labels:
monitor: 'monitor'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "/prom_setup/alert.rules"
# - "second.rules"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets:
- 192.168.41.164:9090
- job_name: 'host_A'
static_configs:
- targets: ['192.168.41.164:9100']
- job_name: 'container_A'
static_configs:
- targets: ['cadvisor:8080']
docker-compose file:
prometheus:
container_name: promethues
image: prom/prometheus
privileged: true
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# try to mount only snapshots to prometheus
- ./prom_data/:/prometheus/data/
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-lifecycle'
# to enable snapshot
- '--web.enable-admin-api'
File permissions on host
data directory
drwxrwxrwx 5 lou lou 4096 Apr 22 08:28 prom_data/
snapshots directory
drwxrwxrwx 4 lou lou 4096 Apr 22 08:28 snapshots/
$ du -sh snapshots
250M snapshots
Logs:
promethues | level=info ts=2021-04-22T00:28:20.992Z caller=main.go:366 msg="No time or size retention was set so using the default time retention" duration=15d
promethues | level=info ts=2021-04-22T00:28:20.993Z caller=main.go:404 msg="Starting Prometheus" version="(version=2.25.2, branch=HEAD, revision=bda05a23ada314a0b9806a362da39b7a1a4e04c3)"
promethues | level=info ts=2021-04-22T00:28:20.993Z caller=main.go:409 build_context="(go=go1.15.10, user=root#de38ec01ef10, date=20210316-18:07:52)"
promethues | level=info ts=2021-04-22T00:28:20.993Z caller=main.go:410 host_details="(Linux 5.4.0-70-generic #78~18.04.1-Ubuntu SMP Sat Mar 20 14:10:07 UTC 2021 x86_64 8fa848a981f9 (none))"
promethues | level=info ts=2021-04-22T00:28:20.993Z caller=main.go:411 fd_limits="(soft=1048576, hard=1048576)"
promethues | level=info ts=2021-04-22T00:28:20.993Z caller=main.go:412 vm_limits="(soft=unlimited, hard=unlimited)"
promethues | level=info ts=2021-04-22T00:28:20.998Z caller=web.go:532 component=web msg="Start listening for connections" address=0.0.0.0:9090
promethues | level=info ts=2021-04-22T00:28:21.003Z caller=main.go:779 msg="Starting TSDB ..."
promethues | level=info ts=2021-04-22T00:28:21.008Z caller=head.go:668 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
promethues | level=info ts=2021-04-22T00:28:21.008Z caller=head.go:682 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=4.448µs
promethues | level=info ts=2021-04-22T00:28:21.008Z caller=head.go:688 component=tsdb msg="Replaying WAL, this may take a while"
promethues | level=info ts=2021-04-22T00:28:21.008Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
promethues | level=info ts=2021-04-22T00:28:21.008Z caller=head.go:745 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=24.518µs wal_replay_duration=292.774µs total_replay_duration=360.306µs
promethues | level=info ts=2021-04-22T00:28:21.009Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
promethues | level=info ts=2021-04-22T00:28:21.009Z caller=main.go:799 fs_type=EXT4_SUPER_MAGIC
promethues | level=info ts=2021-04-22T00:28:21.010Z caller=main.go:802 msg="TSDB started"
promethues | level=info ts=2021-04-22T00:28:21.010Z caller=main.go:928 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
promethues | level=info ts=2021-04-22T00:28:21.011Z caller=main.go:959 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=1.487753ms remote_storage=2.162µs web_handler=883ns query_engine=1.467µs scrape=420.502µs scrape_sd=113.37µs notify=17.922µs notify_sd=28.211µs rules=550.056µs
promethues | level=info ts=2021-04-22T00:28:21.011Z caller=main.go:751 msg="Server is ready to receive web requests."
Solved. I had the same issue. It was my mistake - and it's true, the documentation is not 100% clear either.
So, I had the snapshot stored in a directory like {DATA}\{XXXX-XXXX}\{YYYY}. My mistake was I was copying the content of directory {XXXX-XXXX}\{YYYY}. I should have copied the content of directory {XXXX-XXXX}. Did that and it works.
If you check the storage documentation (https://prometheus.io/docs/prometheus/latest/storage/) you will understand.

GitLab keeps loading and finally fails when deploying a dockerized node.js app

GitLab Job Log
[0KRunning with gitlab-runner 13.2.0-rc2 (45f2b4ec)
[0;m[0K on docker-auto-scale fa6cab46
[0;msection_start:1595233272:prepare_executor
[0K[0K[36;1mPreparing the "docker+machine" executor[0;m
[0;m[0KUsing Docker executor with image gitlab/dind:latest ...
[0;m[0KStarting service docker:dind ...
[0;m[0KPulling docker image docker:dind ...
[0;m[0KUsing docker image sha256:d5d139be840a6ffa04348fc87740e8c095cade6e9cb977785fdba51e5fd7ffec for docker:dind ...
[0;m[0KWaiting for services to be up and running...
[0;m
[0;33m*** WARNING:[0;m Service runner-fa6cab46-project-18378289-concurrent-0-31a688551619da9f-docker-0 probably didn't start properly.
Health check error:
service "runner-fa6cab46-project-18378289-concurrent-0-31a688551619da9f-docker-0-wait-for-service" timeout
Health check container logs:
Service container logs:
2020-07-20T08:21:19.734721788Z time="2020-07-20T08:21:19.734543379Z" level=info msg="Starting up"
2020-07-20T08:21:19.742928068Z time="2020-07-20T08:21:19.742802844Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
2020-07-20T08:21:19.743943014Z time="2020-07-20T08:21:19.743853574Z" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
2020-07-20T08:21:19.764021012Z time="2020-07-20T08:21:19.763898078Z" level=info msg="libcontainerd: started new containerd process" pid=23
2020-07-20T08:21:19.764159337Z time="2020-07-20T08:21:19.764107864Z" level=info msg="parsed scheme: \"unix\"" module=grpc
2020-07-20T08:21:19.764207629Z time="2020-07-20T08:21:19.764179926Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
2020-07-20T08:21:19.764319635Z time="2020-07-20T08:21:19.764279612Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 <nil>}] <nil>}" module=grpc
2020-07-20T08:21:19.764371375Z time="2020-07-20T08:21:19.764344798Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
2020-07-20T08:21:19.969344247Z time="2020-07-20T08:21:19.969193121Z" level=info msg="starting containerd" revision=7ad184331fa3e55e52b890ea95e65ba581ae3429 version=v1.2.13
2020-07-20T08:21:19.969863044Z time="2020-07-20T08:21:19.969784495Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
2020-07-20T08:21:19.970042302Z time="2020-07-20T08:21:19.969997665Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:19.970399514Z time="2020-07-20T08:21:19.970336671Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
2020-07-20T08:21:19.970474776Z time="2020-07-20T08:21:19.970428684Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:20.019585153Z time="2020-07-20T08:21:20.019421401Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
2020-07-20T08:21:20.019709540Z time="2020-07-20T08:21:20.019668899Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:20.019934319Z time="2020-07-20T08:21:20.019887606Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:20.020299876Z time="2020-07-20T08:21:20.020218529Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:20.021038477Z time="2020-07-20T08:21:20.020887571Z" level=info msg="skip loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
2020-07-20T08:21:20.021162370Z time="2020-07-20T08:21:20.021121663Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
2020-07-20T08:21:20.021406797Z time="2020-07-20T08:21:20.021348536Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
2020-07-20T08:21:20.021487917Z time="2020-07-20T08:21:20.021435946Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin"
2020-07-20T08:21:20.021581245Z time="2020-07-20T08:21:20.021533539Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
2020-07-20T08:21:20.030531741Z time="2020-07-20T08:21:20.030427430Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
2020-07-20T08:21:20.030639854Z time="2020-07-20T08:21:20.030604536Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
2020-07-20T08:21:20.030779501Z time="2020-07-20T08:21:20.030736875Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.030865060Z time="2020-07-20T08:21:20.030833703Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.030955439Z time="2020-07-20T08:21:20.030912981Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.031027842Z time="2020-07-20T08:21:20.030998003Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.031132325Z time="2020-07-20T08:21:20.031083782Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.031202966Z time="2020-07-20T08:21:20.031174445Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.031286993Z time="2020-07-20T08:21:20.031253528Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.031370557Z time="2020-07-20T08:21:20.031312376Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
2020-07-20T08:21:20.031709756Z time="2020-07-20T08:21:20.031650044Z" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
2020-07-20T08:21:20.031941868Z time="2020-07-20T08:21:20.031897088Z" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
2020-07-20T08:21:20.032929781Z time="2020-07-20T08:21:20.032846588Z" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
2020-07-20T08:21:20.033064279Z time="2020-07-20T08:21:20.033014391Z" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
2020-07-20T08:21:20.034207198Z time="2020-07-20T08:21:20.034120505Z" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034316027Z time="2020-07-20T08:21:20.034279582Z" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034402334Z time="2020-07-20T08:21:20.034369239Z" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034482782Z time="2020-07-20T08:21:20.034452282Z" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034564724Z time="2020-07-20T08:21:20.034533365Z" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034645756Z time="2020-07-20T08:21:20.034617060Z" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034722695Z time="2020-07-20T08:21:20.034689037Z" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034800005Z time="2020-07-20T08:21:20.034770572Z" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.034873069Z time="2020-07-20T08:21:20.034837050Z" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
2020-07-20T08:21:20.036608424Z time="2020-07-20T08:21:20.036525701Z" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.036722927Z time="2020-07-20T08:21:20.036684403Z" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.036799326Z time="2020-07-20T08:21:20.036769392Z" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.036876692Z time="2020-07-20T08:21:20.036844684Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
2020-07-20T08:21:20.037291381Z time="2020-07-20T08:21:20.037244979Z" level=info msg=serving... address="/var/run/docker/containerd/containerd-debug.sock"
2020-07-20T08:21:20.037493736Z time="2020-07-20T08:21:20.037445814Z" level=info msg=serving... address="/var/run/docker/containerd/containerd.sock"
2020-07-20T08:21:20.037563487Z time="2020-07-20T08:21:20.037522305Z" level=info msg="containerd successfully booted in 0.069638s"
2020-07-20T08:21:20.087933162Z time="2020-07-20T08:21:20.087804902Z" level=info msg="Setting the storage driver from the $DOCKER_DRIVER environment variable (overlay2)"
2020-07-20T08:21:20.088415387Z time="2020-07-20T08:21:20.088327506Z" level=info msg="parsed scheme: \"unix\"" module=grpc
2020-07-20T08:21:20.088533804Z time="2020-07-20T08:21:20.088465157Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
2020-07-20T08:21:20.088620947Z time="2020-07-20T08:21:20.088562235Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 <nil>}] <nil>}" module=grpc
2020-07-20T08:21:20.088709546Z time="2020-07-20T08:21:20.088654016Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
2020-07-20T08:21:20.092857445Z time="2020-07-20T08:21:20.092749940Z" level=info msg="parsed scheme: \"unix\"" module=grpc
2020-07-20T08:21:20.092962469Z time="2020-07-20T08:21:20.092914347Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
2020-07-20T08:21:20.093060327Z time="2020-07-20T08:21:20.093013905Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock 0 <nil>}] <nil>}" module=grpc
2020-07-20T08:21:20.093142744Z time="2020-07-20T08:21:20.093102173Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
2020-07-20T08:21:20.149109416Z time="2020-07-20T08:21:20.148965236Z" level=info msg="Loading containers: start."
2020-07-20T08:21:20.159351905Z time="2020-07-20T08:21:20.159146135Z" level=warning msg="Running modprobe bridge br_netfilter failed with message: ip: can't find device 'bridge'\nbridge 167936 1 br_netfilter\nstp 16384 1 bridge\nllc 16384 2 bridge,stp\nip: can't find device 'br_netfilter'\nbr_netfilter 24576 0 \nbridge 167936 1 br_netfilter\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n, error: exit status 1"
2020-07-20T08:21:20.280536391Z time="2020-07-20T08:21:20.280402152Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address"
2020-07-20T08:21:20.337028532Z time="2020-07-20T08:21:20.336889956Z" level=info msg="Loading containers: done."
2020-07-20T08:21:20.435200532Z time="2020-07-20T08:21:20.435033092Z" level=info msg="Docker daemon" commit=48a66213fe graphdriver(s)=overlay2 version=19.03.12
2020-07-20T08:21:20.436386855Z time="2020-07-20T08:21:20.436266338Z" level=info msg="Daemon has completed initialization"
2020-07-20T08:21:20.476621441Z time="2020-07-20T08:21:20.475137317Z" level=info msg="API listen on [::]:2375"
2020-07-20T08:21:20.477679219Z time="2020-07-20T08:21:20.477535808Z" level=info msg="API listen on /var/run/docker.sock"
[0;33m*********[0;m
[0KPulling docker image gitlab/dind:latest ...
[0;m[0KUsing docker image sha256:cc674e878f23bdc3c36cc37596d31adaa23bca0fc3ed18bea9b59abc638602e1 for gitlab/dind:latest ...
[0;msection_end:1595233326:prepare_executor
[0Ksection_start:1595233326:prepare_script
[0K[0K[36;1mPreparing environment[0;m
[0;mRunning on runner-fa6cab46-project-18378289-concurrent-0 via runner-fa6cab46-srm-1595233216-1bd30100...
section_end:1595233330:prepare_script
[0Ksection_start:1595233330:get_sources
[0K[0K[36;1mGetting source from Git repository[0;m
[0;m[32;1m$ eval "$CI_PRE_CLONE_SCRIPT"[0;m
[32;1mFetching changes with git depth set to 50...[0;m
Initialized empty Git repository in /builds/xxx.us/backend/.git/
[32;1mCreated fresh repository.[0;m
[32;1mChecking out 257ffdf2 as stage...[0;m
[32;1mSkipping Git submodules setup[0;m
section_end:1595233333:get_sources
[0Ksection_start:1595233333:restore_cache
[0K[0K[36;1mRestoring cache[0;m
[0;m[32;1mChecking cache for stage node:14.5.0-alpine-2...[0;m
Downloading cache.zip from https://storage.googleapis.com/gitlab-com-runners-cache/project/18378289/stage%20node:14.5.0-alpine-2[0;m
[32;1mSuccessfully extracted cache[0;m
section_end:1595233334:restore_cache
[0Ksection_start:1595233334:step_script
[0K[0K[36;1mExecuting "step_script" stage of the job script[0;m
[0;mln: failed to create symbolic link '/sys/fs/cgroup/systemd/name=systemd': Operation not permitted
time="2020-07-20T08:22:14.844844859Z" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting -tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
time="2020-07-20T08:22:14.846663310Z" level=info msg="libcontainerd: new containerd process, pid: 57"
time="2020-07-20T08:22:14.906788853Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2020-07-20T08:22:14.907996055Z" level=info msg="Loading containers: start."
time="2020-07-20T08:22:14.910877638Z" level=warning msg="Running modprobe bridge br_netfilter failed with message: modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.78-coreos/modules.dep.bin'\nmodprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.78-coreos/modules.dep.bin'\n, error: exit status 1"
time="2020-07-20T08:22:14.912665866Z" level=warning msg="Running modprobe nf_nat failed with message: `modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.78-coreos/modules.dep.bin'`, error: exit status 1"
time="2020-07-20T08:22:14.914201302Z" level=warning msg="Running modprobe xt_conntrack failed with message: `modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.78-coreos/modules.dep.bin'`, error: exit status 1"
time="2020-07-20T08:22:14.989456423Z" level=warning msg="Could not load necessary modules for IPSEC rules: Running modprobe xfrm_user failed with message: `modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/4.19.78-coreos/modules.dep.bin'`, error: exit status 1"
time="2020-07-20T08:22:14.990108153Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2020-07-20T08:22:15.029286773Z" level=info msg="Loading containers: done."
time="2020-07-20T08:22:15.029664106Z" level=info msg="Daemon has completed initialization"
time="2020-07-20T08:22:15.029823541Z" level=info msg="Docker daemon" commit=23cf638 graphdriver=overlay2 version=1.12.1
time="2020-07-20T08:22:15.048665494Z" level=info msg="API listen on /var/run/docker.sock"
time="2020-07-20T08:22:15.049046558Z" level=info msg="API listen on [::]:7070"
# Keeps loading and finally fails after a couple of seconds
gitlab-ci.yml
cache:
key: '$CI_COMMIT_REF_NAME node:14.5.0-alpine'
paths:
- node_modules/
stages:
- release
- deploy
variables:
TAGGED_IMAGE: '$CI_REGISTRY_IMAGE:latest'
.release:
stage: release
image: docker:19.03.12
services:
- docker:dind
variables:
DOCKER_DRIVER: overlay2
DOCKER_BUILDKIT: 1
before_script:
- docker version
- docker info
- echo "$CI_JOB_TOKEN" | docker login --username $CI_REGISTRY_USER --password-stdin $CI_REGISTRY
script:
- docker build --pull --tag $TAGGED_IMAGE --cache-from $TAGGED_IMAGE --build-arg NODE_ENV=$CI_ENVIRONMENT_NAME .
- docker push $TAGGED_IMAGE
after_script:
- docker logout $CI_REGISTRY
.deploy:
stage: deploy
image: gitlab/dind:latest
services:
- docker:dind
variables:
DOCKER_COMPOSE_PATH: '~/docker-composes/$CI_PROJECT_PATH/docker-compose.yml'
before_script:
- 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client -y )'
- eval $(ssh-agent -s)
- echo "$DEPLOY_SERVER_PRIVATE_KEY" | tr -d '\r' | ssh-add -
- mkdir -p ~/.ssh
- chmod 700 ~/.ssh
- ssh-keyscan $DEPLOYMENT_SERVER_IP >> ~/.ssh/known_hosts
- chmod 644 ~/.ssh/known_hosts
script:
- rsync -avR --rsync-path="mkdir -p ~/docker-composes/$CI_PROJECT_PATH/; rsync" ./docker-compose.yml root#$DEPLOYMENT_SERVER_IP:~/docker-composes/$CI_PROJECT_PATH/
- ssh root#$DEPLOYMENT_SERVER_IP "echo "$CI_REGISTRY_PASSWORD" | docker login --username $CI_REGISTRY_USER --password-stdin $CI_REGISTRY; docker-compose -f $DOCKER_COMPOSE_PATH rm -f -s -v $CI_COMMIT_REF_NAME; docker pull $TAGGED_IMAGE; docker-compose -f $DOCKER_COMPOSE_PATH up -d $CI_COMMIT_REF_NAME;"
release_stage:
extends: .release
only:
- stage
environment:
name: staging
deploy_stage:
extends: .deploy
only:
- stage
environment:
name: staging
Dockerfile
ARG NODE_ENV
FROM node:14.5.0-alpine
ARG NODE_ENV
ENV NODE_ENV ${NODE_ENV}
# Set working directory
WORKDIR /var/www/
# Install app dependencies
COPY package.json package-lock.json ./
RUN npm ci --silent --only=production
COPY . ./
# Start the application
CMD [ "npm", "run", "start" ]
docker-compose.yml
version: '3.8'
services:
redis-stage:
container_name: redis-stage
image: redis:6.0.5-alpine
ports:
- '7075:6379'
restart: always
networks:
- my-proxy-net
stage:
container_name: xxx-backend-stage
image: registry.gitlab.com/xxx.us/backend:latest
build: .
expose:
- '7070'
restart: always
networks:
- my-proxy-net
depends_on:
- redis-stage
environment:
VIRTUAL_HOST: backend.xxx.us
VIRTUAL_PROTO: https
LETSENCRYPT_HOST: backend.xxx.us
networks:
my-proxy-net:
external:
name: mynetwork
Update 1
I got a warning on the page claims I have used over 30% of my shared runner minutes. Maybe it is about not having enough minutes.
Update 2
The release stage gets completed successfully.
Update 3
Before I get into this problem, I deployed once successfully. I decided to test that commit once again and see if it succeeds, but it fails!
I fixed the issue. In my case, it was PORT (absolutely) and HOST (maybe) environment variables I defined manually in the GitLab CI/CD Variables section. It seems PORT and maybe HOST are two reserved environment variables for GitLab and/or Docker.
By the way, I couldn't find anything in docs state not using those variables names.

OpenEBS target pod is not able to communicate with its replicas after deleting one of the worker node from the cluster

Having a problem with an OpenEBS data store. Set up is with 3 OpenEBS storage replica on 3 different VMs.
Initially the work pod (postgresql) went into read-only mode, so I deleted first the work node and, after it didn't recover, the openEBS ctrl pod. Now it seems the ctrl pod cannot reconnect with all 3 replicas and keeps showing the message:
level=warning msg="No of yet to be registered replicas are less than 3 , No of registered replicas: 1"
The replica that seems to have managed to connect keeps logging repeatedly:
time="2019-01-22T08:04:12Z" level=info msg="Get Volume info from controller"
time="2019-01-22T08:04:12Z" level=info msg="Register replica at controller"
Target pod logs
"2019-01-22T06:55:46.064Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:46Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:46.065Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:46Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:48.076Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:48Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:48.075Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:48Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:50.085Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:50Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:49.083Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:49Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:50.086Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:50Z"" level=warning msg=busy"
"2019-01-22T06:55:50.085Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:50Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:49.084Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:49Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:53.105Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:53Z"" level=warning msg=busy"
"2019-01-22T06:55:53.104Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:53Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:55.117Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:55Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:54.107Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:54Z"" level=error msg=""Mode: ReadOnly"""
"2019-01-22T06:55:54.107Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-ctrl-6658c7df95-m2hnf","time=""2019-01-22T06:55:54Z"" level=error msg=""Mode: ReadOnly"""
Replica pod which is not yet conencted
"2019-01-22T06:56:24.117Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""Done running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9700 volume-head-010.img.meta]"""
"2019-01-22T06:56:24.866Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""source file size: 12884901888, setting up directIo: true"""
"2019-01-22T06:56:11.390Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:11Z"" level=info msg=""Get Volume Usage"""
"2019-01-22T06:56:23.881Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:23Z"" level=info msg=""Snapshotting [d82c79af-06fd-4bc4-bd67-c54fa636e596] volume, user created false, created time 2019-01-22T06:56:23Z"""
"2019-01-22T06:56:23.924Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","10.233.96.147 - - [22/Jan/2019:06:56:23 +0000] ""POST /v1/replicas/1?action=snapshot HTTP/1.1"" 200 14804"
"2019-01-22T06:56:24.049Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9700 volume-head-010.img.meta]"""
"2019-01-22T06:56:24.828Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9701 volume-snap-6b38fe32-98ab-4f95-8b2d-05ba9aebfe0e.img]"""
"2019-01-22T06:56:24.885Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""The file is a hole: [ 0: 3145728](3145728)"""
"2019-01-22T06:56:24.886Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""Ssync client: exit code 0"""
"2019-01-22T06:56:23.872Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:23Z"" level=info msg=""GetReplica for id 1"""
"2019-01-22T06:56:24.019Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=GetReplica"
"2019-01-22T06:56:24.019Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""GetReplica for id 1"""
"2019-01-22T06:56:24.886Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:24Z"" level=info msg=""Done running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9701 volume-snap-6b38fe32-98ab-4f95-8b2d-05ba9aebfe0e.img]"""
"2019-01-22T06:56:25.607Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:25Z"" level=info msg=""source file size: 112, setting up directIo: false"""
"2019-01-22T06:56:25.614Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:25Z"" level=warning msg=""Failed to open server: 10.233.91.202:9702, Retrying..."""
"2019-01-22T06:56:26.628Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:26Z"" level=info msg=""Ssync client: exit code 0"""
"2019-01-22T06:56:28.353Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9703 volume-snap-a03e749d-31ac-4375-9559-14fb141fc3d7.img]"""
"2019-01-22T06:56:28.419Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""source file size: 12884901888, setting up directIo: true"""
"2019-01-22T06:56:28.428Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""The file is a hole: [ 0: 3145728](3145728)"""
"2019-01-22T06:56:28.431Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""Ssync client: exit code 0"""
"2019-01-22T06:56:29.121Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9704 volume-snap-a03e749d-31ac-4375-9559-14fb141fc3d7.img.meta]"""
"2019-01-22T06:56:29.900Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""Syncing volume-snap-f8771212-06d3-400b-ad12-c063ef8ed827.img to 10.233.91.202:9705...\n"""
"2019-01-22T06:56:29.900Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""source file size: 12884901888, setting up directIo: true"""
"2019-01-22T06:56:29.904Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=warning msg=""Failed to open server: 10.233.91.202:9705, Retrying..."""
"2019-01-22T06:56:25.607Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:25Z"" level=info msg=""Syncing volume-snap-6b38fe32-98ab-4f95-8b2d-05ba9aebfe0e.img.meta to 10.233.91.202:9702...\n"""
"2019-01-22T06:56:25.584Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:25Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9702 volume-snap-6b38fe32-98ab-4f95-8b2d-05ba9aebfe0e.img.meta]"""
"2019-01-22T06:56:28.419Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""Syncing volume-snap-a03e749d-31ac-4375-9559-14fb141fc3d7.img to 10.233.91.202:9703...\n"""
"2019-01-22T06:56:29.215Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""Done running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9704 volume-snap-a03e749d-31ac-4375-9559-14fb141fc3d7.img.meta]"""
"2019-01-22T06:56:29.880Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""Running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9705 volume-snap-f8771212-06d3-400b-ad12-c063ef8ed827.img]"""
"2019-01-22T06:56:28.434Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:28Z"" level=info msg=""Done running ssync [ssync -host 10.233.91.202 -timeout 7 -port 9703 volume-snap-a03e749d-31ac-4375-9559-14fb141fc3d7.img]"""
"2019-01-22T06:56:29.211Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""Ssync client: exit code 0"""
"2019-01-22T06:56:29.183Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=info msg=""source file size: 164, setting up directIo: false"""
"2019-01-22T06:56:29.905Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:29Z"" level=warning msg=""Failed to open server: 10.233.91.202:9705, Retrying..."""
"2019-01-22T06:56:41.391Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:41Z"" level=info msg=GetUsage"
"2019-01-22T06:56:41.392Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","10.233.96.147 - - [22/Jan/2019:06:56:41 +0000] ""GET /v1/replicas/1/volusage HTTP/1.1"" 200 200"
"2019-01-22T06:56:41.390Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:56:41Z"" level=info msg=""Get Volume Usage"""
"2019-01-22T06:59:11.392Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:59:11Z"" level=info msg=GetUsage"
"2019-01-22T06:59:11.392Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T06:59:11Z"" level=info msg=""Get Volume Usage"""
"2019-01-22T07:00:38.050Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:38Z"" level=error msg=""Received EOF: EOF"""
"2019-01-22T07:00:38.050Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:38Z"" level=info msg=""Restart AutoConfigure Process"""
"2019-01-22T07:00:43.232Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","10.233.91.234 - - [22/Jan/2019:07:00:43 +0000] ""POST /v1/replicas/1?action=start HTTP/1.1"" 200 1091"
"2019-01-22T07:00:43.238Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:43Z"" level=info msg=""GetReplica for id 1"""
"2019-01-22T07:00:43.409Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:43Z"" level=info msg=""GetReplica for id 1"""
"2019-01-22T07:00:43.465Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:43Z"" level=info msg=GetReplica"
"2019-01-22T07:00:43.239Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:43Z"" level=info msg=""Got signal: 'open', proceed to open replica"""
"2019-01-22T07:00:43.585Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","10.233.91.234 - - [22/Jan/2019:07:00:43 +0000] ""POST /v1/replicas/1?action=snapshot HTTP/1.1"" 200 15190"
"2019-01-22T07:00:43.666Z","pvc-a2e1d1bf-db64-11e8-9384-fee6a1e98ebe-rep-7bf8ffb665-v5777","time=""2019-01-22T07:00:43Z"" level=info msg=GetReplica"
After going through the logs, I can see that replicas were registered to controller but one of the replica is getting synced with other healthy replica, which might take some time
And after sometime I can see from target pod
level=warning msg="No of yet to be registered replicas are less than 3 , No of registered replicas: 1"
which no longer shows up. I think it is recovering right now. I have a data of 12GiB size.

Resources