Airflow can't reach logs from webserver due to 403 error - azure

I use Apache Airflow for daily ETL jobs. I installed it in Azure Kubernetes Service using the provided Helm chart. It's been running fine for half a year, but since recently I'm unable to access the logs in the webserver (this used to always work fine).
I'm getting the following error:
*** Log file does not exist: /opt/airflow/logs/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** Fetching from: http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log'
For more information check: https://httpstatuses.com/403
What have I tried:
I've made sure that the log file exists (I can exec into the airflow-worker-0 pod and read the file on command line in the location specified in the error).
I've rolled back my deployment to an earlier commit from when I know for sure it was still working, but it made no difference.
I was using webserverSecretKeySecretName in the values.yaml configuration. I changed the secret to which that name was pointing (deleted it and created a new one, as described here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key) but it didn't work (no difference, same error).
I changed the config to use a webserverSecretKey instead (in plain text), no difference.
My thoughts/observations:
The error states that the log file doesn't exist, but that's not true. It probably just can't access it.
The time is the same in all pods (I double checked be exec-ing into them and typing date in the command line)
The webserver secret is the same in the worker, the scheduler, and the webserver (I double checked by exec-ing into them and finding the corresponding env variable)
Any ideas?

Turns out this was a known bug with the latest release (2.4.0) of the official Airflow Helm chart, reported here:
https://github.com/apache/airflow/discussions/26490
Should be resolved in version 2.4.1 which should be available in the next couple of days.

Related

Rhel osbuild-composer system repository override is not working

As per document (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/composing_a_customized_rhel_system_image/managing-repositories_composing-a-customized-rhel-system-image) tried to override the system repository with custom base url . But blueprint depsolve is showing error as below
##composer-cli blueprints depsolve Test1-blueprint
2022-06-09 08:06:58,841: Test1-blueprint: This system does not have any valid subscriptions. Subscribe it before specifying rhsm: true in sources.
And with next service restart osbuild-composer does not start
ERROR: Info Error: Get "http://localhost/api/v1/projects/source/info/appstream": dial unix /run/weldr/api.socket: connect: connection refused
Am I missing something here ?
Having all manner of issues with this myself. A trawl of my /var/log/messages file, and it looks like, for me at least, osbuild-composer is failing to start due to the non existence of /etc/osbuild-composer/osbuild-composer.toml. Actual error is permission denied, but it doesnt exist..
This is on RHEL 8.5, and just updated to 8.6 this morning, and same problem
/edit Ive removed everything, and reverted to using the lorax backend, as per chapter 2.2 in the doc linked (same one I was following). My 'composer-cli compose types' command now at least works. Fingers crossed..

"Error: Key not loaded" in h2o deployed through a K3s cluster, using python3 client

I can confirm the 3-replica cluster of h2o inside K3s is correctly deployed, as executing in the Python3 interpreter h2o.init(ip="x.x.x.x") works as expected. I followed the instructions noted here: https://www.h2o.ai/blog/running-h2o-cluster-on-a-kubernetes-cluster/
Nevertheless, I had to modify the service.yaml and comment out the line which says clusterIP: None, as K3s was complaining about something related to its inability to set the clusterIP to None. But even though, I can certify it is working correctly, and I am able to use an external IP to connect to the cluster.
If I try to load the dataset using the h2o cluster inside the K3s cluster using the exact same steps as described here http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html, this is the output that I get:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Key not loaded: Key<Frame> https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv
Request: POST /3/ParseSetup
data: {'check_header': '0', 'source_frames': '["https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv"]'}
The same error occurs if I use the h2o.upoad_file("x.csv") method.
There is a clue about what may be happening here: Key not loaded: Key<Frame> while POSTing source frame through ParseSetup in H2O API call but I am not using curl, and I can not find any parameter that could help me overcome this issue: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=import_file#h2o.import_file
I need to use the Python client inside the same K3s cluster due to different technical reasons, so I am not able to kick off nor Flow nor Firebug to know what may be happening.
I can confirm it is working correctly when I simply issue a h2o.init(), using the local Java instance.
UPDATE 1:
I have tried in different K3s clusters without success. I changed the service.yaml to a NodePort, and now this is the error traceback:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Job is missing
Request: GET /3/Jobs/$03010a2a016132d4ffffffff$_a2366be93ec99a78d7bc161de8c54d67
UPDATE 2:
I have tried using different services (NodePort, LoadBalancer, ClusterIP) and none of them work. I also have tried using Minikube with the official image, and with a custom image made by me, without success. I suspect this is something related to either h2o itself, or the clustering between pods. I will keep digging and let's think there will be some gold in it.
UPDATE 3:
I also found out that the post about running H2O in Docker is really outdated https://www.h2o.ai/blog/h2o-docker/ nor is working the Dockerfile present at GitHub (I changed it to uncomment the ENTRYPOINT section without success): https://github.com/h2oai/h2o-3/blob/master/Dockerfile
Even though, I tried with the custom image I built for h2o-k8s and it is working seamlessly in pure Docker. I am wondering why it is still not working in K8s...
UPDATE 4:
I have tried modifying the environment variable called H2O_KUBERNETES_SERVICE_DNS without success.
In the meantime, the cluster started to be unavailable, that is, the readinessProbe's would not successfully complete. No matter what I change now, it does not work.
I spinned up a K3d cluster in local to see what happened, and surprisingly, the readinessProbe's were not failing, using v3.30.0.6. But now I started testing it with R instead of Python. I am glad I tried, because I may have pinpointed what was wrong. There is a version mismatch between the client and the server. So I updated accordingly the image to v3.30.0.1.
But now again, the readinessProbe is not working in my k3d cluster, so I am unable to test it.
It seems it is working now. R client version 3.30.0.1 with server version 3.30.0.1. Also tried with Python version 3.30.0.7 and server version 3.30.0.7 and it started working. Marvelous. The problem was caused by a version mismatch between the client and the server, as the python client was updated to 3.30.0.7 while the latest server for docker was 3.30.0.6.

Unable to start Kudu master

While starting kudu-master, I am getting the below error and unable to start kudu cluster.
F0706 10:21:33.464331 27576 master_main.cc:71] Check failed: _s.ok() Bad status: Invalid argument: Unable to initialize catalog manager: Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ. Their symmetric difference is: :0, hadoop-master:7051, slave2:7051, slave3:7051
It is a cluster of 8 nodes and i have provided 3 masters as given below in master.gflagfile on master nodes.
--master_addresses=hadoop-master,slave2,slave3
TL;DR
If this is a new installation, working under the assumption that master ip addresses are correct, I believe the easiest solution is to
Stop kudu masters
Nuke the <kudu-data-dir>/master directory
Start kudu masters
Explanation
I believe the most common (if not only) cause of this error (Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ.) is when a kudu master node gets added incorrectly. The error suggests that kudu-master thinks it's running on a single node rather than 3-node cluster.
Maybe you did not intend to "add a node", but that's most likely what happened. I'm saying this because I had the same problem; after some googling and debugging, I discovered that during the installation, I started kudu-master before putting the correct IP address in master.gflagfile, so that kudu-master was spun up thinking it was running on a single node, not 3 node. Using steps above to clean install kudu-master again, my problem was solved.

Can't backup to S3 with OpsCenter 5.2.1

I upgraded OpsCenter from 5.1.3 to 5.2.0 (and then to 5.2.1). I had a scheduled backup to local server and an S3 location configured before the upgrade, which worked fine with OpsCenter 5.1.3. I made to no changes to the scheduled backup during or after the upgrade.
The day after the upgrade, the S3 backup failed. In opscenterd.log, I see these errors:
2015-09-28 17:00:00+0000 [local] INFO: Instructing agents to start backups at Mon, 28 Sep 2015 17:00:00 +0000
2015-09-28 17:00:00+0000 [local] INFO: Scheduled job 458459d6-d038-41b4-9094-7d450e4bac6f finished
2015-09-28 17:00:00+0000 [local] INFO: Snapshots started on all nodes
2015-09-28 17:00:08+0000 [] WARN: Marking request d960ad7b-2ccd-40a4-be7e-8351ac038c53 as failed: {'sstables': {u'solr_admin': {u'solr_resources': {'total_size': 155313, 'total_files': 12, 'done_files': 0, 'errors': [u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', shortened for brevity.
The S3 location no longer appears in OpsCenter when I edit the scheduled backup job. When I try to re-add the S3 location, using the same bucket and credentials as before, I get the following error:
Location validation error: Call to /local/backups/destination_validate timed out.
Also, I don't know if this is related, but for completeness, I see some of these errors in the opscenterd.log as well:
WARN: No http agent exists for definition file update. This is likely due to SSL import failure.
I get this behavior with either DataStax Enterprise 4.5.1 or 4.7.3.
I have been having the exact same problem since updating to OpsCenter 5.2.x and just was able to get it working properly.
I removed all the settings suggested in the previous answer and then created new buckets in us-west-1, us-west-2 and us-standard. After this I was able to successfully able to add all of those as destinations quickly and easily.
It appears to me that the problem is that OpsCenter may be trying to list the objects in the bucket that you configure initially, which in my case for the 2 existing ones we were using had 11TB and 19GB of data in them respectively.
This could explain why increasing the timeout for some worked and not others.
Hope this helps.
Try adding the remote_backup_region property to the cluster configuration file under the [agents] heading in "cluster-name".conf. Valid values are: us-standard, us-west-1, us-west-2, eu-west-1, ap-northeast-1, ap-southeast-1
Does that help?
The problem was resolved by a combination of 2 things.
Delete the entire contents of the existing S3 bucket (or create a new bucket as previously suggested by #kaveh-nowroozi).
Edit /etc/datastax-agent/datastax-agent-env.sh and increase the heap size to 512M as suggested by a DataStax engineer. The default was set at 128M and I kept doubling it until backups became successful.

MsDeploy remoting executing manifest twice

I have:
Created a manifest for msdeploy to:
Stop, Uninstall, Copy over, Install, and Start a Windows service.
Created a package from the manifest
Executed msdeploy against the package against a remote server.
Problem: It executes the entire manifest twice.
Tried: I have tinkered with the waitInterval and waitAttempts thinking it was timing out and starting over, but that hasn't helper.
Question: What might be making it execute twice?
The Manifest:
<sitemanifest>
<runCommand path="net stop TestSvc"
waitInterval="240000"
waitAttempts="1"/>
<runCommand
path="C:\Windows\Microsoft.NET\Framework\v4.0.30319\installutil.exe /u
C:\msdeploy\TestSvc\TestSvc\bin\Debug\TestSvc.exe"
waitInterval="240000"
waitAttempts="1"/>
<dirPath path="C:\msdeploy\TestSvc\TestSvc\bin\Debug" />
<runCommand
path="C:\Windows\Microsoft.NET\Framework\v4.0.30319\installutil.exe
C:\msdeploy\TestSvc\TestSvc\bin\Debug\TestSvc.exe"
waitInterval="240000"
waitAttempts="1"/>
<runCommand path="net start TestSvc"
waitInterval="240000"
waitAttempts="1"/>
</sitemanifest>
The command issued to package it:
"C:\Program Files\IIS\Microsoft Web Deploy V2\msdeploy"
-verb:sync
-source:manifest=c:\msdeploy\custom.xml
-dest:package=c:\msdeploy\package.zip
The command issued to execute it:
"C:\Program Files\IIS\Microsoft Web Deploy V2\msdeploy"
-verb:sync
-source:package=c:\msdeploy\package.zip
-dest:auto,computername=<computerNameHere>
I am running as a domain user who has administrative access on the box. I have also tried passing credentials - it is not a permissions issue, the commands are succeeding, just executing twice.
Edit:
I enabled -verbose and found some interesting lines in the log:
Verbose: Performing synchronization pass #1.
...
Verbose: Source filePath (C:\msdeploy\MyTestWindowsService\MyTestWindowsService\bin\Debug\MyTestWindowsService.exe) does not match destination (C:\msdeploy\MyTestWindowsService\MyTestWindowsService\bin\Debug\MyTestWindowsService.exe) differing in attributes (lastWriteTime['11/08/2011 23:40:30','11/08/2011 23:39:52']). Update pending.
Verbose: Source filePath (C:\msdeploy\MyTestWindowsService\MyTestWindowsService\bin\Debug\MyTestWindowsService.pdb) does not match destination (C:\msdeploy\MyTestWindowsService\MyTestWindowsService\bin\Debug\MyTestWindowsService.pdb) differing in attributes (lastWriteTime['11/08/2011 23:40:30','11/08/2011 23:39:52']). Update pending.
After these lines, files aren't copied the first time, but are copied the second time
...
Verbose: The dependency check 'DependencyCheckInUse' found no issues.
Verbose: Received response from agent (HTTP status 'OK').
Verbose: The current synchronization pass is missing stream content for 2 objects.
Verbose: Performing synchronization pass #2.
...
High Level
Normally I deploy a freshly built package with newer bits than are on the server.
During pass two, it duplicates everything that was done in pass one.
In pass 1, it will:
Stop, Uninstall, (delete some log files created by the service install), Install, and Start a Windows service
In pass 2, it will:
Stop, Uninstall, Copy files over, Install, and Start a Windows service.
I have no idea why it doesn't copy over the files in pass 1, or why pass 2 is triggered.
If I redeploy the same package instead of deploying fresh bits, it will run all the steps in pass 1, and not run pass 2. Probably because the files have the same time stamp.
There is not enough information in the question to really reproduce the problem to give a specific answer... but there are several things to check/change/try to make this work:
runCommand needs specific privileges
waitInterval="240000" and waitAttempt="1" (double quotes instead of single quotes)
permissions for the deployment service / deployment agent regarding directories etc. on the target machine
use tempAgent feature
work through the troubleshooting section esp. the logs and try the -whatif and -verbose options
EDIT - after the addition of -verboseoutput:
I see these possibilities:
Time
Both machines have a difference in time (either one of them is just a bit off or some timezone issue...)
Filesystem
If one of the filesystems is FAT this could lead to problems (timestamp resolution...)
EDIT 2 - as per comments:
In my last EDIT I wrote about timestamp because my suspicion is that something goes wrong when these are compared... that can be for example differring clocks between both machines (even a difference of 30 sec can have an impact) and/or some timezone issues...
I wrote about filesystem esp. FAT since the timestamp resolution of FAT is someabout 2 seconds while NTFS has much higher resolution, again this could have an impact when comparing timestamps...
From what you describe I would suggest the following workarounds:
use preSync and postSync for the Service handling parts (i.e. preSync for stop + uninstall and postSync for install + start) and do only the pure sync in the manifest or commandline
OR
use a script for the runCommand parts
EDIT 3 - as per comment from Merlyn Morgan-Graham the result for future reference:
When using the runCommand provider, use batch files. For some reason this made it stop running two passes.
The problem with this solution is that one can't specify the installation directory of the service via a SetParameters.xml file (same for dontUseCommandExe / preSync / postSync regarding SetParameters.xml).
EDIT 4 - as per comment from Merlyn Morgan-Graham:
The timeout params apply to whether to kill that specific command, not to the closing of the Windows Service itself... in this case it seems that the Windows Service takes rather long to stop and thus only the runCommands get executed without the copy/sync and a new try for the whole run is initiated...
I had the same problem, but I don't make package.zip file.
I perform synchronization directly in one step.
The preSync/postSync solution helped me a lot and there is no need to use manifest files.
You can try the following command in your case:
"C:\Program Files\IIS\Microsoft Web Deploy V2\msdeploy"
-verb:sync
-preSync:runCommand="net stop TestSv && C:\Windows\Microsoft.NET\Framework\v4.0.30319\installutil.exe /u
C:\msdeploy\TestSvc\TestSvc\bin\Debug\TestSvc.exe",waitInterval=240000,waitAttempts=1
-source:dirPath="C:\msdeploy\TestSvc\TestSvc\bin\Debug"
-dest:auto,computername=<computerNameHere>
-postSync:runCommand="C:\Windows\Microsoft.NET\Framework\v4.0.30319\installutil.exe
C:\msdeploy\TestSvc\TestSvc\bin\Debug\TestSvc.exe && net start TestSvc",waitInterval=240000,waitAttempts=1
"-verb:sync" parameter means you synchronize data between a source and a destination. In your case your case, first time you perform synchronization between the "C:\msdeploy\TestSvc\TestSvc\bin\Debug" folder and the "package.zip". Plus, you are using manifest file, so when you perform second synchronization between the "package.zip" and the destination "computername", msbuild uses previously provided manifest twice for the destination and for the source, so each manifest operation runs twice.
I used the && trick to perform several commands in one command line.
Also, in my case, I had to add timeout operation to be sure the service were completely stopped ("ping -n 30 127.0.0.1 > nul").

Resources