AKS nodes failed provisioning - azure

So I have an AKS cluster in DEV env which was working fine. Today I have noticed that some pods due being removed/uninstalled via helm were stuck in Terminating state.
I found out that none of the 3 nodes are ready. When I stopped the cluster and started again, VMs failed to create in VMMS with associated message:
VM has reported a failure when processing extension 'vmssCSE'. Error message: "Enable failed: failed to execute command: command terminated with exit status=50
According to what I have found might look like the VMs in scale set are missing outbound internet connectivity, however the associated NSG has only the defaults:
When inspecting the VMSS status, it says the following:
VM has reported a failure when processing extension 'vmssCSE'. Error message: "Enable failed: failed to execute command: command terminated with exit status=50 [stdout] [stderr] nc: connect to mcr.microsoft.com port 443 (tcp) failed: Connection timed out Command exited with non-zero status 1 0.00user 0.00system 2:10.07elapsed 0%CPU (0avgtext+0avgdata 2360maxresident)k 0inputs+8outputs (0major+113minor)pagefaults 0swaps " More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot
This troubleshooting doesn't seem to be helpful as it states:
When restricting egress traffic from an AKS cluster, there are required and optional recommended outbound ports / network rules and FQDN / application rules for AKS. If your settings are in conflict with any of these rules, certain kubectl commands won't work correctly. You may also see errors when creating an AKS cluster.
Verify that your settings aren't conflicting with any of the required or optional recommended outbound ports / network rules and FQDN / application rules.
But the default rules have not changed, therefore I'm lost at that point.

Related

Proxmox - migration fails with exit code 255

Attempting to migrate a container between Proxmox nodes failed saying the following command failed with exit code 255:
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=violet' root#172.20.20.1 pvecm mtunnel -migration_network 172.20.20.1/16 -get_migration_ip' failed: exit code 255
Running the command manually shows the given error message:
could not get migration ip: multiple, different, IP address configured for network 172.20.20.1/16
Turns out, I had a second interface on the target host configured on the same network (eno1 had 172.20.0.1 and eno3 had 172.20.20.3, e.g.). Removing/disabling one of these interfaces resolved the issue.

KYPO Deployment failure openstack retuning no valid host found

I have deployed devstack for my OpenStack using the default configuration and trying to deploy kypo. I am running ./create-base.sh and getting the following error
[kypo-proxy-jump-stack]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.kypo-proxy-jump: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
[kypo-proxy-jump-stack.kypo-proxy-jump]: CREATE_FAILED ResourceInError: resources.kypo-proxy-jump: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
My devstack config:
content of local.conf
[[local|localrc]]
#Enable heat services
enable_service h-eng h-api h-api-cfn h-api-cw
[[local|localrc]]
#Enable heat plugin
enable_plugin heat https://opendev.org/openstack/heat
IMAGE_URL_SITE="https://download.fedoraproject.org"
IMAGE_URL_PATH="/pub/fedora/linux/releases/33/Cloud/x86_64/images/"
IMAGE_URL_FILE="Fedora-Cloud-Base-33-1.2.x86_64.qcow2"
IMAGE_URLS+=","$IMAGE_URL_SITE$IMAGE_URL_PATH$IMAGE_URL_FILE
There is a workaround: you need to reduce the kypo-proxy-jump's flavor.
Something like this:
openstack flavor create --ram 2048 --disk 10 --vcpus 1 standard.medium
However, check your Openstack resources and logs, there is probably lack of resource (disk, mem or cpu).

Error: Rotate certificates in Azure Kubernetes Service (AKS)

I used https://learn.microsoft.com/en-us/azure/aks/certificate-rotation this link to rotate certificates in AKS. Certificate got updated but my cluster is in failed state. Because of this my application is down.
I am getting below mentioned error when I am running this command az aks rotate-certs -g $RESOURCE_GROUP_NAME -n $CLUSTER_NAME
ERROR: "error": { "code": "ErrorCodeRotateClusterCertificates", "message": "VMASAgentPoolReconciler retry failed: Category: ClientError; SubCode: OutboundConnFailVMExtensionError; Dependency: Microsoft.Compute/virtualMachines/extensions; OrginalError: Code=\"VMExtensionProvisioningError\" Message=\"VM has reported a failure when processing extension 'cse-agent-0'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=50\\n[stdout]\\n\\n[stderr]\\ncurl: option --proxy-insecure: is unknown\\ncurl: try 'curl --help' or 'curl --manual' for more information\\nCommand exited with non-zero status 2\\n0.00user 0.00system 0:00.00elapsed 100%!!(MISSING)C(string=VMAS agent pools reconciling)PU (0avgtext+0avgdata 7044maxresident)k\\n0inputs+8outputs (0major+372minor)pagefaults 0swaps\\n\\\"\\r\\n\\r\\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \"; AKSTeam: NodeProvisioning, Retriable: false" } }
Kubernetes version: 1.14.8
Please help to resolved this issue.
What version of Ubuntu are you running on your nodes? From that error, guessing Ubuntu 16.04 or older.
I'm not sure if it will work, but instead of trying to rotate certificates, can you try upgrading the nodes?
You might also want to consider just creating a new cluster, and using VMSS instead of VMAS.

Chef-server-ctl reconfigure/ Creating Admin User on chef server

I am fairly new to Linux (and brand new to chef) and I have ran into an issue when setting up my chef server. I am trying to create an admin user with the command
sudo chef-server-ctl user-create admin Admin Ladmin admin#example.com
examplepass -f admin.pem
but after I keep getting this error:
ERROR: Connection refused connecting...
ERROR: Connection refused connecting to https://127.0.0.1/users/, retry 5/5
ERROR: Network Error: Connection refused - Connection refused
connecting to https://..., giving up
Check your knife configuration and network settings
I also noticed that when I ran chef-server-ctl I got this output:
[2016-12-21T13:24:59-05:00] ERROR: Running exception handlers Running
handlers complete
[2016-12-21T13:24:59-05:00] ERROR: Exception
handlers complete Chef Client failed. 0 resources updated in 01 seconds
[2016-12-21T13:24:59-05:00] FATAL: Stacktrace dumped to
/var/opt/opscode/local-mode-cache/chef-stacktrace.out
[2016-12-21T13:24:59-05:00] FATAL: Please provide the contents of the
stacktrace.out file if you file a bug report
[2016-12-21T13:24:59-05:00] FATAL:
Chef::Exceptions::CannotDetermineNodeName: Unable to determine node
name: configure node_name or configure the system's hostname and fqdn
I read that this error is due to a prerequisite mistake but I'm uncertain as to what it means or how to fix it. So any input would be greatly appreciated.
Your server does not have a valid FQDN (aka full host name). You'll have to fix this before installing Chef server.

Fail to connect to master with Spark on Google Compute Engine

I am trying hadoop/spark cluster in Google Compute Engine through "Launch click-to-deploy software" feature .
I have created 1 master and 2 slave node and i can launch spark-shell on the cluster but when i want to launch spark-shell since my computer, i failed.
I launch :
./bin/spark-shell --master spark://IP or Hostname:7077
And i have this stackTrace :
15/04/09 10:58:06 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster#IP or Hostname:7077/user/Master...
15/04/09 10:58:06 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster#IP or Hostname:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#IP or Hostname:7077
15/04/09 10:58:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#IP or Hostname:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: IP or Hostname: unknown error
please let me know how to overcome this problem .
See comment from Daniel Darabos. By default, all incoming connections are blocked except for SSH, RDP and ICMP. To be able to connect from the Internet to the hadoop master instance, you must open port 7077 for 'hadoop-master' tag in your project first:
gcloud compute --project PROJECT firewall-rules create allow-spark \
--allow TCP:7077 \
--target-tags hadoop-master
See Firewalls, Adding a firewall and gcloud compute firewall-rules create at GCE public documentation for further details and all the possibilities.

Resources