Is there a gcloud API to detect when a Compute Engine server is completely up? - node.js

I create an VM instance. I can connect to it as soon ad the SSH Daemon is started. But this is too early because kernel startup is only at approx. 30%. Is there a gcloud or other API to get the VM state when the kernel has finished startup?
Nov 18 10:58:51 image-name google: No startup script found in metadata.
Nov 18 10:58:53 image-name kernel: [ 27.491829] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.703142] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.735867] aufs au_opts_verify:1570:docker[2414]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.771732] aufs au_opts_verify:1570:docker[2260]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:58:53 image-name kernel: [ 27.797540] device vethfa3ab85 entered promiscuous mode
Nov 18 10:58:53 image-name kernel: [ 27.804420] IPv6: ADDRCONF(NETDEV_UP): vethfa3ab85: link is not ready
Nov 18 10:58:53 image-name kernel: [ 28.028306] IPv6: ADDRCONF(NETDEV_CHANGE): vethfa3ab85: link becomes ready
Nov 18 10:58:53 image-name kernel: [ 28.035505] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:58:53 image-name kernel: [ 28.041963] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:58:53 image-name kernel: [ 28.048532] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
Nov 18 10:58:54 image-name kernel: [ 28.980082] IPv6: eth0: IPv6 duplicate address fe80::42:acff:fe11:1 detected!
->>> about here I can SSH to the server
Nov 18 10:59:08 image-name kernel: [ 43.068094] docker0: port 1(vethfa3ab85) entered forwarding state
Nov 18 10:59:53 image-name kernel: [ 87.944452] aufs au_opts_verify:1570:docker[2864]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:59:53 image-name kernel: [ 88.001012] aufs au_opts_verify:1570:docker[2864]: dirperm1 breaks the protection by the permission bits on the lower branch
Nov 18 10:59:53 image-name kernel: [ 88.049510] aufs au_opts_verify:1570:docker[2815]: dirperm1 breaks the protection by the permission bits on the lower branch
->>> I want to know about this point in the startup process
My problem is that I can connect to it using SSH when kernel progress is below 30% and some processes are not yet started. I want to detect somehow if the server has completed startup. Or is there a script that can push to the server (through the GCE APIs) to notify me when a server is completely up?
gcloud compute instances describe image-name does return the same output from the moment the instance is started till the kernel startup is complete.
(In my case I use the Node.js GCE API, but this should not make any difference.)

Presently I am not aware of any such google native API that can provide a progress of instance start.
However this is a quick workaround check if this fits your requirement.
You can either use the Google Startup script or the native linux rc.local. The concept is the same, so explaining it for the case of rc.local [as it is generic and not tied to google]
We know that the last process in a bootup sequence that runs is rc.local. Any command or script or call that is in this rc.local [which is a sh or bash script by itself] will be executed at the end of boot process.
So the idea would be in the google image in case of rc.local, have a script or a call which send your a notification or writes a output to central system like KV or cloud storage the state that bootup is all done.

Similar to Kamran, but here is how I get this done. It depends on using a google startup script and an image where gcloud is installed by default (though you could rework this to just use curl and API calls)
On instance creation/configuration, I set a custom metadata flag: serverready=False
At the end of my google startup script, I have this:
sudo gcloud compute instances add-metadata $(hostname) \
--metadata serverready=True \
--zone $(curl \
"http://metadata.google.internal/computeMetadata/v1/instance/zone" \
-H "Metadata-Flavor: Google"|cut -d/ -f4)
When I run the instance creation, I can just poll the metadata for the serverready key, and set my app to wait until it sees serverready=True

Related

Kubernetes NFS PV: Lock reclaim failed

Configuration:
NFS server and the k8s cluster(single node cluster) run on two machines and use the same OS and NFS software, as below:
[root#test-2 ~]# yum info nfs-utils
Failed to set locale, defaulting to C
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.tuna.tsinghua.edu.cn
* extras: mirrors.bfsu.edu.cn
* updates: mirrors.huaweicloud.com
Installed Packages
Name : nfs-utils
Arch : x86_64
Epoch : 1
Version : 1.3.0
Release : 0.68.el7
Size : 1.1 M
Repo : installed
From repo : base
Summary : NFS utilities and supporting clients and daemons for the kernel NFS server
URL : http://sourceforge.net/projects/nfs
License : MIT and GPLv2 and GPLv2+ and BSD
Description : The nfs-utils package provides a daemon for the kernel NFS server and
: related tools, which provides a much higher level of performance than the
: traditional Linux NFS server used by most users.
:
: This package also contains the showmount program. Showmount queries the
: mount daemon on a remote host for information about the NFS (Network File
: System) server on the remote host. For example, showmount can display the
: clients which are mounted on that host.
:
: This package also contains the mount.nfs and umount.nfs program.
[root#test-2 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[root#test-2 ~]# uname -a
Linux test-2 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root#test-2 ~]# cat /etc/exports
/home/nfs 192.168.0.0/24(rw,sync,no_root_squash,no_subtree_check,insecure)
K8S version: v1.17.9
Problems:
The application(a statefulset) running on k8s is using a PV that was dynamically provisioned by the k8s-nfs-provisioner, the PV is actually backed by a directory on remote NFS server. The application is keeping "CrashLoopBackOff" because it hits "input/output error" constantly when writing some data to the PV after only a few seconds of running.
Meanwhile, I saw a lot of errors in /var/log/messages:
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:11:36 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:12:05 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:41 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
Dec 2 17:21:42 localhost kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!
I took a tcpdump until hit "Lock reclaim failed" in system log, and found there are many NFS errors as below:
NFS4ERR_BADSESSION (10052)
NFS4ERR_STALE_CLIENTID (10022)
NFS4ERR_NO_GRACE (10033)
I'm not sure if they're related to the "lock reclaim failed" or the "input/output" error.
I have encountered this problem on different machines with different machines from time to time and it really annoys me.
Anyone knows the root cause or how to fix it? Big thanks in advance.
Screenshots
application pod log
NFS errors in tcpdump
nfsstate -m output on k8s
nfsstate -c output on k8s, NOTE the high open_noat value.
NFS server configuration (my k8s node is 111.1.30.16)

Weave takes my node offline when I hit a container's ip

I'm running weave with kubernetes/cni . I have a wordpress/mysql pod running on a kube minion. When I hit the url of the wordpress service via the browser, my node goes down (on azure). I upgraded to 8cores and 14gb ram and now when I do hit the wordpress url I find that I can't access the internet i.e. curl google.com which I could do before hitting the wordpress url.
I was curious so I tail -f /var/log/syslog and found the following which may be relevant; Please note I have an nginx pod on the same node and I can access it's url without incident, however the following happens when I hit the wordpress installation page:
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.723678] device vethwepl4fe8074 entered promiscuous mode
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.732271] eth0: renamed from vethwepg4fe8074
Jul 21 16:38:27 sc-minion-1 systemd-udevd[8388]: Could not generate persistent MAC address for vethwepl4fe8074: No such file or directory
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.744321] IPv6: ADDRCONF(NETDEV_UP): vethwepl4fe8074: link is not ready
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.744661] IPv6: ADDRCONF(NETDEV_CHANGE): vethwepl4fe8074: link becomes ready
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.744730] weave: port 3(vethwepl4fe8074) entered forwarding state
Jul 21 16:38:27 sc-minion-1 kernel: [ 6319.744736] weave: port 3(vethwepl4fe8074) entered forwarding state
Jul 21 16:38:27 sc-minion-1 docker[883]: time="2016-07-21T16:38:27.987193241Z" level=error msg="Handler for GET /images/nginx/json returned error: No such image: nginx"
Jul 21 16:38:27 sc-minion-1 kubelet[2088]: I0721 16:38:27.987696 2088 provider.go:91] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
Jul 21 16:38:28 sc-minion-1 kernel: [ 6320.564156] IPv6: eth0: IPv6 duplicate address fe80::3c2b:b1ff:fe47:fe26 detected!
Jul 21 16:38:42 sc-minion-1 kernel: [ 6334.748036] weave: port 3(vethwepl4fe8074) entered forwarding state
Jul 21 16:38:42 sc-minion-1 docker[883]: time="2016-07-21T16:38:42.988502303Z" level=warning msg="Error getting v2 registry: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jul 21 16:38:42 sc-minion-1 docker[883]: time="2016-07-21T16:38:42.988545804Z" level=error msg="Attempting next endpoint for pull after error: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Jul 21 16:39:02 sc-minion-1 docker[883]: time="2016-07-21T16:39:02.989940573Z" level=error msg="Not continuing with pull after error: Network timed out while trying to connect to https://index.docker.io/v1/repositories/library/nginx/images. You may want to check your internet connection or if you are behind a proxy."

After suspend guest OS hangs when using vagrant with nfs

Host OS Ubuntu 15.10
Guest OS Ubuntu 14.10
Using Vagrant with nfs and Virtualbox and static ip on the private network.
It is working perfectly except that after having suspended the host OS, the entire guest OS will be unusable.
This does not happen when using the normal virtualbox shared folders.
It's not only the nfs shared folder that is unusable, the entire OS is hanging.
Even syslog does not seem to see much action.
This is syslog on the guest, from waking up until vagrant halt is completed.
Feb 26 07:15:33 vagrant kernel: [ 8375.252989] e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Feb 26 07:16:11 vagrant kernel: [ 8413.109832] nfs: server 192.168.33.1 not responding, still trying
Feb 26 07:16:38 vagrant kernel: [ 8440.687476] nfs: server 192.168.33.1 not responding, still trying
Feb 26 07:17:01 vagrant CRON[3776]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 26 07:20:33 vagrant rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="753" x-info="http://www.rsyslog.com"] exiting on signal 15.
How can this be fixed?
How should I debug it?

Could not generate persistent MAC address for vethXXXXX: No such file or directory

I'm starting to use CoreOS (on AWS ECS). After having it launch my first container, I see this in journalctl:
Could not generate persistent MAC address for vethXXXX: No such file or directory
Here's more context. I've removed the time and instance information, but this is all within the same second. Note there are two distinct veth entries. I don't know if that means anything.
systemd[1]: Started docker container 1234
systemd[1]: Starting docker container 1234
dockerd[595]: time="2015-07-23T23:30:52Z" level=info msg="GET /v1.17/containers/1234/json"
dockerd[595]: time="2015-07-23T23:30:52Z" level=info msg="+job container_inspect(1234)"
systemd-timesyncd[473]: Network configuration changed, trying to establish connection.systemd-udevd[7501]: Could not generate persistent MAC address for vethYYYY: No such file or directory
kernel: device vethXXXX entered promiscuous mode
kernel: IPv6: ADDRCONF(NETDEV_UP): vethXXXX: link is not ready
systemd-udevd[7508]: Could not generate persistent MAC address for vethXXXX: No such file or directory
systemd-networkd[497]: vethXXXX: Configured
kernel: eth0: renamed from vethYYYY
kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethXXXX: link becomes ready
kernel: docker0: port 2(vethXXXX) entered forwarding state
kernel: docker0: port 2(vethXXXX) entered forwarding state
systemd-networkd[497]: vethXXXX: Gained carrier
I found a discussion of this error on Ubuntu and it comes down to removing a udev rule, which doesn't seem to exist on CoreOS. There's a discussion about iptables with OpenVPN, which again doesn't seem to apply. Here's a bridge rule for LXC on Ubuntu. Again, I don't see how to apply that.
I haven't done anything with the networkd or flannel configuration. If the problems are in that area, I need specific steps on how to fix it for use in AWS ECS.
I had similar issue with a systemd service which creates a dev, then I figured out service type was "simple" fixed by changing this to "forking" (Type=forking)
I found this problem at own ubuntu 18.4 at 2021 year.
Problem was: /etc/docker/daemon.json { "iptables" : false }
after changing to { "iptables" : true } and restart docker - no more problems.

How to detect Openwrt kern.info and daemon.info events?

My background is mostly Windows programming in C and C++. Recently I've had the chance to work with some embedded Linux systems also, but I'm still new at this.
Right now I'm working on a utility for Openwrt that needs to react to network and system events that occur during normal operation.
I've been able to use Hotplug for some events, but others still elude me. I can parse the output of the system log using logread, but that seems primitive and hackish.
In particular I'd like to get a callback similar to what hotplug does for some of the 'kern.info kernel' and 'daemon.info' events. For example:
Mar 31 19:42:32 OpenWrt kern.info kernel: [ 369.540000] device wlan0 left promiscuous mode
Mar 31 19:42:32 OpenWrt kern.info kernel: [ 369.540000] br-lan: port 2(wlan0) entered disabled state
Mar 31 19:42:32 OpenWrt kern.info kernel: [ 369.730000] device wlan1 left promiscuous mode
Mar 31 19:42:32 OpenWrt kern.info kernel: [ 369.730000] br-lan: port 3(wlan1) entered disabled state
Mar 31 19:42:34 OpenWrt kern.info kernel: [ 371.360000] device wlan0 entered promiscuous mode
Mar 31 19:45:56 OpenWrt daemon.info hostapd: wlan0: STA 04:f7:e4:00:00:00 IEEE 802.11: authenticated
Mar 31 19:45:56 OpenWrt daemon.info hostapd: wlan0: STA 04:f7:e4:00:00:00 IEEE 802.11: associated (aid 1)
Mar 31 19:45:56 OpenWrt daemon.info hostapd: wlan0: STA 04:f7:e4:00:00:00 WPA: pairwise key handshake completed (WPA)
Mar 31 19:45:56 OpenWrt daemon.info hostapd: wlan0: STA 04:f7:e4:00:00:00 WPA: group key handshake completed (WPA)
Mar 31 19:45:56 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPREQUEST(br-lan) 10.1.1.51 04:f7:e4:00:00:00
Mar 31 19:45:56 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPNAK(br-lan) 10.1.1.51 04:f7:e4:00:00:00 wrong network
Mar 31 19:46:00 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPDISCOVER(br-lan) 04:f7:e4:00:00:00
Mar 31 19:46:00 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPOFFER(br-lan) 192.168.1.198 04:f7:e4:1c:09:00
Mar 31 19:46:00 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPDISCOVER(br-lan) 04:f7:e4:1c:09:00
Mar 31 19:46:00 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPOFFER(br-lan) 192.168.1.198 04:f7:e4:1c:09:00
Mar 31 19:46:01 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPREQUEST(br-lan) 192.168.1.198 04:f7:e4:1c:09:00
Mar 31 19:46:01 OpenWrt daemon.info dnsmasq-dhcp[5005]: DHCPACK(br-lan) 192.168.1.198 04:f7:e4:1c:09:00 My-iPhone
Log entries like the DHCPOFFER (as seen in your example) are generated individually by the corresponding process (for example, by udhcpc) using the Unix syslog mechanism (kind-of like the Windows Event Logging API)
By default on OpenWRT logging is handled by the syslogd process provided by the busybox package. This is fairly primitive and simply sends messages to the circular buffer you see using logread and/or to a UDP socket.
You can upgrade logging on OpenWRT to use the syslog-ng package. This has a much more advanced configuration and you should be able to use this to send filtered log events to a script that you can write to do what you need with them.
opkg install syslog-ng
syslog-ng is a GPL product but the documentation is now buried beneath a commercial web site, one would hope you can get it from the source code , via http://freecode.com/projects/syslog-ng. Note that OpenWRT seems to provide version 1.6.12 which I had trouble finding the documentation for when I implemented it on my OpenWRT devices, but eventually I found it via the wayback machine: https://web.archive.org/web/20070406054439/http://www.balabit.com/products/syslog_ng/reference-1.6/syslog-ng.html/x731.html (for example)
A configuration file fragment that would pull out those DHCP messages and send them to a standalone log file would look a bit like:
source src { unix-stream("/dev/log"); internal(); };
destination dhcp_messages { file("/var/log/dhcpmessages"); };
filter f_dhcp { match("dnsmasq-dhcp"); };
log {
source(src);
filter(f_dhcp);
destination(dhcp_messages);
};
You might probably find the pipe() or program() destination drivers the most useful for your application. For example, using a program() driver you could send selected messages to a shell script that parses them and saves them into a sqlite database.

Resources