AKS + ACI: ProviderFailed "Duplicate volumes 'kube-api-access-XXXXX' are found" - azure

I'm using AKS with virtual nodes (ACI). Until today everything worked fine.
Today when I deploy my pod again, I get this odd error:
Reason: ProviderFailed
Message: api call to https://management.azure.com/subscriptions/blabla/resourceGroups/blabla/providers/Microsoft.ContainerInstance/containerGroups/blabla?api-version=2021-07-01: got HTTP response status code 400 error code "DuplicateVolumes": Duplicate volumes 'kube-api-access-XXXXX' are found.
Of course, the pod is failing in ACI. If I run it on master node (non-virtual node), it works as always.
The image is the same, the code is the same, nothing changed. I also tried a commit from last week (I know it worked) and fails anyway.
I checked, no volume with the same name...
I have opened a ticket about Container Group quota and today they asked me to test it since it would be "completed"... I didn't work and it coincides with the first time I have this error...
Any ideas?
Thanks in advance!

Related

Azure Container Instance Says its running Okay but the URL doesnt show my app in the browser

I'm trying to have a translator service (https://github.com/LibreTranslate/LibreTranslate) running within either azure container apps(ACA) or azure container instance(ACI). i tried within ACA but the image is too large as ACA only allows images up to 4Gb and the image is about 7Gb. so that didnt work out. so i tried in ACI and there it shows its running, no errors or anything that i can find but when i go to the URL provided after the container is running it gives me nothing except a stream timeout after a while.
does anyone know what i am doing wrong?
i tried following the how to provided by azure and microsoft but this didnt help so far
I have tried to repro the same in my lab environment and got the below results. 
Step 1: Clone the repo. 
Step 2: To test creating and running an image in the local machine, run the shell script run.sh . Kindly refer to the READ.md file in the source code repo for more information.  Once the script ran successfully, Kindly check for images and containers provisioned.  
Step 3: Create a tag for image libretranslate/libretranslate and push it to ACR  
Step 4: Create an ACI instance with the below configuration. CPU cores: 2
Memory: 2 GiB
Port: 5000, TCP   Once Container Instance is provisioned, wait for at  least 2-5 mins and then verify :5000  

Airflow can't reach logs from webserver due to 403 error

I use Apache Airflow for daily ETL jobs. I installed it in Azure Kubernetes Service using the provided Helm chart. It's been running fine for half a year, but since recently I'm unable to access the logs in the webserver (this used to always work fine).
I'm getting the following error:
*** Log file does not exist: /opt/airflow/logs/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** Fetching from: http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log'
For more information check: https://httpstatuses.com/403
What have I tried:
I've made sure that the log file exists (I can exec into the airflow-worker-0 pod and read the file on command line in the location specified in the error).
I've rolled back my deployment to an earlier commit from when I know for sure it was still working, but it made no difference.
I was using webserverSecretKeySecretName in the values.yaml configuration. I changed the secret to which that name was pointing (deleted it and created a new one, as described here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key) but it didn't work (no difference, same error).
I changed the config to use a webserverSecretKey instead (in plain text), no difference.
My thoughts/observations:
The error states that the log file doesn't exist, but that's not true. It probably just can't access it.
The time is the same in all pods (I double checked be exec-ing into them and typing date in the command line)
The webserver secret is the same in the worker, the scheduler, and the webserver (I double checked by exec-ing into them and finding the corresponding env variable)
Any ideas?
Turns out this was a known bug with the latest release (2.4.0) of the official Airflow Helm chart, reported here:
https://github.com/apache/airflow/discussions/26490
Should be resolved in version 2.4.1 which should be available in the next couple of days.

"Error: Key not loaded" in h2o deployed through a K3s cluster, using python3 client

I can confirm the 3-replica cluster of h2o inside K3s is correctly deployed, as executing in the Python3 interpreter h2o.init(ip="x.x.x.x") works as expected. I followed the instructions noted here: https://www.h2o.ai/blog/running-h2o-cluster-on-a-kubernetes-cluster/
Nevertheless, I had to modify the service.yaml and comment out the line which says clusterIP: None, as K3s was complaining about something related to its inability to set the clusterIP to None. But even though, I can certify it is working correctly, and I am able to use an external IP to connect to the cluster.
If I try to load the dataset using the h2o cluster inside the K3s cluster using the exact same steps as described here http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html, this is the output that I get:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Key not loaded: Key<Frame> https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv
Request: POST /3/ParseSetup
data: {'check_header': '0', 'source_frames': '["https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv"]'}
The same error occurs if I use the h2o.upoad_file("x.csv") method.
There is a clue about what may be happening here: Key not loaded: Key<Frame> while POSTing source frame through ParseSetup in H2O API call but I am not using curl, and I can not find any parameter that could help me overcome this issue: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=import_file#h2o.import_file
I need to use the Python client inside the same K3s cluster due to different technical reasons, so I am not able to kick off nor Flow nor Firebug to know what may be happening.
I can confirm it is working correctly when I simply issue a h2o.init(), using the local Java instance.
UPDATE 1:
I have tried in different K3s clusters without success. I changed the service.yaml to a NodePort, and now this is the error traceback:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Job is missing
Request: GET /3/Jobs/$03010a2a016132d4ffffffff$_a2366be93ec99a78d7bc161de8c54d67
UPDATE 2:
I have tried using different services (NodePort, LoadBalancer, ClusterIP) and none of them work. I also have tried using Minikube with the official image, and with a custom image made by me, without success. I suspect this is something related to either h2o itself, or the clustering between pods. I will keep digging and let's think there will be some gold in it.
UPDATE 3:
I also found out that the post about running H2O in Docker is really outdated https://www.h2o.ai/blog/h2o-docker/ nor is working the Dockerfile present at GitHub (I changed it to uncomment the ENTRYPOINT section without success): https://github.com/h2oai/h2o-3/blob/master/Dockerfile
Even though, I tried with the custom image I built for h2o-k8s and it is working seamlessly in pure Docker. I am wondering why it is still not working in K8s...
UPDATE 4:
I have tried modifying the environment variable called H2O_KUBERNETES_SERVICE_DNS without success.
In the meantime, the cluster started to be unavailable, that is, the readinessProbe's would not successfully complete. No matter what I change now, it does not work.
I spinned up a K3d cluster in local to see what happened, and surprisingly, the readinessProbe's were not failing, using v3.30.0.6. But now I started testing it with R instead of Python. I am glad I tried, because I may have pinpointed what was wrong. There is a version mismatch between the client and the server. So I updated accordingly the image to v3.30.0.1.
But now again, the readinessProbe is not working in my k3d cluster, so I am unable to test it.
It seems it is working now. R client version 3.30.0.1 with server version 3.30.0.1. Also tried with Python version 3.30.0.7 and server version 3.30.0.7 and it started working. Marvelous. The problem was caused by a version mismatch between the client and the server, as the python client was updated to 3.30.0.7 while the latest server for docker was 3.30.0.6.

Azure: what could be the cause of the error "Unable to edit or replace deployment"?

When I recreate my VM I got the following error:
Problem occurred during request to Azure services. Cloud provider details: Unable to edit or replace deployment 'VM-Name': previous deployment from '8/20/2019 6:20:33 AM' is still active (expiration time is '8/27/2019 5:17:41 AM'). Please see https://aka.ms/arm-deploy for usage details.
Help me please to understand.
What could be the cause of the error ?
UPDATED:
This deployment has not been started previously.
Prior to this, errors were received during creation:
Azure is not available now. Please Try again later
There were several such errors one at a time and then I got that error related to:
Unable to edit or replace deployment
My assumptions about this.
Tell me, am I right or not ?
I launched the image, then after some time I recreated it.
Creation began, but at that moment the connection with Azure was lost.
Then, when the connection was restored, we tried to make a deployment that was not removed in the previous attempt (because there was no connection with Azure).
As a result, we got such an error.
Does this theory make sense?
exactly what it says, there is another deployment with the same name going on at this time, either change the name of the deployment you are trying to queue or wait for the other deployment to finish\fail
This can also occur if you use Bicep templates for your ARM deployement and multiple modules or resources in the template have the same name:
module fooModule '../modules/foo.bicep' = {
name: 'foo'
}
module barModule '../modules/bar.bicep' = {
name: 'foo'
}
I got the same error initially pipeline was working but when retriggered pipeline took more time so i canceled the deployment and made a fresh rerun it encounters. i think i need wait until that deployment filed.

Errors trying to resize instance group with aws cli

I have a standing EMR cluster and a daily job I want to run. I was trying to use the aws cli to resize the cluster, with the plan of adding this to the crontab so the cluster would grow and then shrink later. (I don't have ability for auto-scaling, so that's out)
I have read the Amazon documentation and the examples they give don't work. I've tried the natural variations, but end up getting nowhere.
According to the documentation, the command is
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
However, when I try this with my own instance ID, i get:
Error parsing parameter '--instance-groups': Expected: '<second>', received: '<none>' for input:InstanceGroupId=ig-31JXXXXXXBTO,
I've tried doing things like removing the instance count, hoping for more documentation...
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-WCXEP0AXCGJS
which gives the response
An error occurred (ValidationException) when calling the ModifyInstanceGroups operation: Please provide either an instance count or a list of EC2 instance ids to terminate.
I've tried several variations without luck. Any ideas? Thanks.
I ended up submitting a trouble ticket through Amazon.
The resize command requires that no space occur after the comma. The trouble shooter has reported this behavior and unhelpful error to the developers.
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
will work, as long as there is no space after the comma. hopefully they'll either fix that or provide a better error message.

Resources