Can't backup to S3 with OpsCenter 5.2.1 - cassandra

I upgraded OpsCenter from 5.1.3 to 5.2.0 (and then to 5.2.1). I had a scheduled backup to local server and an S3 location configured before the upgrade, which worked fine with OpsCenter 5.1.3. I made to no changes to the scheduled backup during or after the upgrade.
The day after the upgrade, the S3 backup failed. In opscenterd.log, I see these errors:
2015-09-28 17:00:00+0000 [local] INFO: Instructing agents to start backups at Mon, 28 Sep 2015 17:00:00 +0000
2015-09-28 17:00:00+0000 [local] INFO: Scheduled job 458459d6-d038-41b4-9094-7d450e4bac6f finished
2015-09-28 17:00:00+0000 [local] INFO: Snapshots started on all nodes
2015-09-28 17:00:08+0000 [] WARN: Marking request d960ad7b-2ccd-40a4-be7e-8351ac038c53 as failed: {'sstables': {u'solr_admin': {u'solr_resources': {'total_size': 155313, 'total_files': 12, 'done_files': 0, 'errors': [u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', u'{:type :opsagent.backups.destinations/destination-not-found, :message "Destination missing: 62f5a26abce7463bad9deb7380979c4a"}', shortened for brevity.
The S3 location no longer appears in OpsCenter when I edit the scheduled backup job. When I try to re-add the S3 location, using the same bucket and credentials as before, I get the following error:
Location validation error: Call to /local/backups/destination_validate timed out.
Also, I don't know if this is related, but for completeness, I see some of these errors in the opscenterd.log as well:
WARN: No http agent exists for definition file update. This is likely due to SSL import failure.
I get this behavior with either DataStax Enterprise 4.5.1 or 4.7.3.

I have been having the exact same problem since updating to OpsCenter 5.2.x and just was able to get it working properly.
I removed all the settings suggested in the previous answer and then created new buckets in us-west-1, us-west-2 and us-standard. After this I was able to successfully able to add all of those as destinations quickly and easily.
It appears to me that the problem is that OpsCenter may be trying to list the objects in the bucket that you configure initially, which in my case for the 2 existing ones we were using had 11TB and 19GB of data in them respectively.
This could explain why increasing the timeout for some worked and not others.
Hope this helps.

Try adding the remote_backup_region property to the cluster configuration file under the [agents] heading in "cluster-name".conf. Valid values are: us-standard, us-west-1, us-west-2, eu-west-1, ap-northeast-1, ap-southeast-1
Does that help?

The problem was resolved by a combination of 2 things.
Delete the entire contents of the existing S3 bucket (or create a new bucket as previously suggested by #kaveh-nowroozi).
Edit /etc/datastax-agent/datastax-agent-env.sh and increase the heap size to 512M as suggested by a DataStax engineer. The default was set at 128M and I kept doubling it until backups became successful.

Related

Can't change RDS Postgres major version from the AWS console?

I have an RDS Postgres database, currently sitting at version 14.3.
I want to schedule a major version upgrade to 14.5 to happen during the maintenance window.
I want to do this manually via the console, because last time I did a major version of the Postgres version by changing the CDK definition, the deploy command applied the DB version change immediately, resulting in a few minutes downtime of the database (manifesting as connection errors in the application connecting to the database).
When I go into the AWS RDS console, do a "modify" action and select the "DB Engine Version" - it only shows one option, which is the current DB version: "14.3".
According to the RDS doco 14.4, 14.5. and 14.6 are all valid upgrade targets: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html#USER_UpgradeDBInstance.PostgreSQL.MajorVersion
Also, when I do aws rds --profile raido-prod describe-db-engine-versions --engine postgres --engine-version 14.3 it shows those versions in the ValidUpgradeTarget collection.
Using CDK version 2.63.0
Database CDK code:
// starting with 14.3 to test the manual upgrade process
const engineVersion14_3 = PostgresEngineVersion.VER_14_3;
const dbParameterGroup14_3 = new ParameterGroup(this, 'Postgres_14_3', {
description: "RaidoDB postgres " + engineVersion14_3.postgresFullVersion,
engine: DatabaseInstanceEngine.postgres({
version: engineVersion14_3,
}),
parameters: {
// none we need right now
},
});
/* Note that even after this stack has been deployed, this param group
will not be created, I guess it will only be created when it's attached
to an instance? */
const engineVersion14_5 = PostgresEngineVersion.VER_14_5;
// CDK strips out underbars from name, hoping periods will remain
const dbParameterGroup14_5 = new ParameterGroup(this, 'Postgres.14.5.', {
description: "RaidoDB postgres " + engineVersion14_3.postgresFullVersion,
engine: DatabaseInstanceEngine.postgres({
version: engineVersion14_5,
}),
parameters: {
// none we need right now
},
});
this.pgInstance = new DatabaseInstance(this, 'DbInstance', {
databaseName: this.dbName,
instanceIdentifier: this.dbName,
credentials: Credentials.fromSecret(
this.raidoDbSecret,
this.userName,
),
vpc: props.vpc,
vpcSubnets: {
subnetGroupName: props.subnetGroupName,
},
publiclyAccessible: false,
subnetGroup: dbSubnetGroup,
multiAz: false,
availabilityZone: undefined,
securityGroups: [props.securityGroup],
/* Should we size a bigger instance for prod?
Plan is to wait until its needed - there will be some downtime for
changing these. There's also the "auto" thing. */
allocatedStorage: 20,
instanceType: InstanceType.of(InstanceClass.T4G, InstanceSize.SMALL),
engine: DatabaseInstanceEngine.postgres({
version: engineVersion14_3,
}),
parameterGroup: dbParameterGroup14_3,
/* Not sure what this does, changing it to true didn't allow me to change
the version in the console. */
allowMajorVersionUpgrade: true,
/* 14.3.x -> 14.3.x+1 will happen automatically in maintenance window,
with potential downtime. */
autoMinorVersionUpgrade: true,
// longer in prod
backupRetention: Duration.days(30),
/* This enables DB termination protection.
When stack is destroyed, db will be detached from stack but not deleted.
*/
removalPolicy: RemovalPolicy.RETAIN,
// explain and document the threat model before changing this
storageEncrypted: false,
/* "Enhanced monitoring"."
I turned this on while trying to figure out how to change the DB version.
I still don't think we should have it enabled until we know how/when we'll
use it - because it costs money in CloudWatch Logging, Metric fees and
performance (execution and logging of the metrics from the DB server).
*/
monitoringInterval: Duration.minutes(1),
monitoringRole: props.monitoringRole,
/* Useful for identifying expensive queries and missing indexes.
Retention default of 7 days is fine. */
enablePerformanceInsights: true,
// UTC
preferredBackupWindow: '11:00-11:30',
preferredMaintenanceWindow: 'Sun:12:00-Sun:13:00',
});
So, the question: What do I need to do in order to be able to schedule the DB version upgrade in the maintenance window?
I made a lot of changes during the day trying to diagnose the issue before I posted this question, thinking I must be doing something wrong.
When I came back to work the next day, the modify screen DB Engine Version field contained the upgrade options I was originally expecting.
Below is my documentation of the issue (unfortunately, our CDK repo is not public):
Carried out by STO, CDK version was 2.63.0.
This page documents my attempt to manually schedule the DB version upgrade
using the console for application during the maintenance window.
We need to figure out how to do this since the DB upgrade process results in a
few minutes of downtime, so we'd prefer to avoid doing it in-hours.
It's preferred that we figure out how to schedule the upgrade - if the choice
comes down to asking team members to work outside or hours or accept
downtime - we will generally choose to have the downtime during business hours.
Note that that Postgres instance create time is:
Sat Jan 28 2023 10:34:32 GMT+1000
Action plan
in the RDS console
Modify DB instance: raido
DB engine version: change 14.3 to 14.5
save and select to "schedule upgrade for maintenance window" (as
opposed to "apply immediately")
Actions taken
2023-02-02
When I tried to just go in and change it manually in the AWS console, the only
option presented on the "modify" screen was 14.3 - there was nothing to
change the version to.
I tried creating a 14.5 param group in CDK, but it just ignored me, didn't
create the param group I'm guessing because it's expected to actually be
attached to a DB.
Tried copying and creating a new param group to change the version, but there's
no "version" param in the param group.
Tried manually create a db of version 14.5 "sto-manual-14-5", but after the
DB was reported successfully created (as 14.5) - "14.3" is still the only
option in the "modify" screen for raido-db.
Tried creating a t3.micro in case t4g ws the problem - no good.
Tried disabling minor version auto-upgrade - no good.
Note that 14.6 is the current most recent version, both manually created 14.3
and 14.5 databases presented no other versions to upgrade to - so this problem
doesn't seem to be related to the CDK.
List available upgrades: aws rds --profile raido-prod describe-db-engine-versions --engine postgres --engine-version 14.3
Shows 14.4, 14.5 and 14.6 as valid target versions.
This page also shows the the versions should be able to be upgraded to, as at
2023-02-02: https://docs.amazonaws.cn/en_us/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html#USER_UpgradeDBInstance.PostgreSQL.MajorVersion
After all this, I noticed that we had the instance declared as
allowMajorVersionUpgrade: false, so I changed that to true and deployed,
but still can't select any other versions.
Also tried aws rds --profile raido-prod describe-pending-maintenance-actions
but it showed no pending actions.
I found this SO answer talking about RDS problems with T3 instances (note, I
have previously upgraded major versions of T3 RDS postgres instances):
https://stackoverflow.com/a/69295017/924597
On the manually created t3.micro instance, I tried upgrading the instance size
to a standard large instance. Didn't work.
Found this SO answer talking about the problem being related to having
outstanding "recommended actions" entries:
https://stackoverflow.com/a/75236812/924597
We did have an entry talking about "enhanced monitoring".
Tried enabling "enhanced monitoring" via the CDK, because it was listed as
a "recommended action" on the RDS console.
After deploying the CDK stack, the console showed the enhanced monitoring was
enabled, but the "recommended action" to enable it was still listed.
At this point, the console still showed 14.3 as the only option in the list on
the modify screen.
Posted to StackOverflow: Can't change RDS Postgres major version from the AWS console?
Posted to AWS repost: https://repost.aws/questions/QU4zuJeb9OShGfVISX6_Kx4w/
Stopped work for the day.
2023-02-03
In the morning, the RDS console no longer shows the "recommended action" to
enabled enhanced monitoring.
The modify screen now shows "14.3, 14.4, 14.5 and 14.6" as options for the
DB Engine Verison (as expected and orginally desired).
Given the number of changes I tried above, I'm not sure what, if any of them
may have caused the console to start displaying the correct options.
It may have been a temporary issue with RDS, or AWS support may have seen my
question on the AWS repost forum and done something to the account.
Note that I did not raise a formal support request via the AWS console.
I wanted to try and confirm if the enhanced monitoring was the cause of the
issue, so I changed the CDK back (there is no "enhance monitoring" flag, I
just commented out the code that set the monitoring role and interval).
After deploying the CDK stack, there was no change to the RDS instance -
enhanced monitoring was still enabled.
I did a manual modify via the RDS console to disable enhanced monitoring.
The change did apply and was visible in the consle, but the "recommended
actions" list did not have any issues.
At this point I had to attend a bunch of meetings, lunch, etc.
When I came back after lunch, the "recommended actions" list now shows an
"enhanced monitoring" entry.
But the modify console page still shows the 14.3 - 14.6 DB engine options, so
I don't think "enhanced monitoring" was the cause of the problem.
I scheduled the major version upgrade (14.3 -> 14.5, because 14.6 is not yet
supported by the CDK) for the next maintenance window.
Analysis
My guess is that the issue was caused by having allowMajorVersionUpgrade set
to false. I think changing it to true is what caused the other
version options to eventually show up on the modify page. I think the
reason the options didn't show up on the modify page after depoloying the
CDK change is down to waiting for an eventual consistency conflict to converge.

Airflow can't reach logs from webserver due to 403 error

I use Apache Airflow for daily ETL jobs. I installed it in Azure Kubernetes Service using the provided Helm chart. It's been running fine for half a year, but since recently I'm unable to access the logs in the webserver (this used to always work fine).
I'm getting the following error:
*** Log file does not exist: /opt/airflow/logs/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** Fetching from: http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log
*** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
****** See more at https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#secret-key
****** Failed to fetch log file from worker. Client error '403 FORBIDDEN' for url 'http://airflow-worker-0.airflow-worker.default.svc.cluster.local:8793/dag_id=analytics_etl/run_id=manual__2022-09-26T09:25:50.010763+00:00/task_id=copy_device_table/attempt=18.log'
For more information check: https://httpstatuses.com/403
What have I tried:
I've made sure that the log file exists (I can exec into the airflow-worker-0 pod and read the file on command line in the location specified in the error).
I've rolled back my deployment to an earlier commit from when I know for sure it was still working, but it made no difference.
I was using webserverSecretKeySecretName in the values.yaml configuration. I changed the secret to which that name was pointing (deleted it and created a new one, as described here: https://airflow.apache.org/docs/helm-chart/stable/production-guide.html#webserver-secret-key) but it didn't work (no difference, same error).
I changed the config to use a webserverSecretKey instead (in plain text), no difference.
My thoughts/observations:
The error states that the log file doesn't exist, but that's not true. It probably just can't access it.
The time is the same in all pods (I double checked be exec-ing into them and typing date in the command line)
The webserver secret is the same in the worker, the scheduler, and the webserver (I double checked by exec-ing into them and finding the corresponding env variable)
Any ideas?
Turns out this was a known bug with the latest release (2.4.0) of the official Airflow Helm chart, reported here:
https://github.com/apache/airflow/discussions/26490
Should be resolved in version 2.4.1 which should be available in the next couple of days.

CouchDB v1.7.1 database replication to CouchDB v2.3.0 database fails

In Fauxton, I've setup a replication rule from a CouchDB v1.7.1 database to a new CouchDB v2.3.0 database.
The source does not have any authentication configured. The target does. I've added the username and password to the Job Configuration.
It looks like the replication got stuck somewhere in the process. 283.8 KB (433 documents) are present in the new database. The source contains about 18.7 MB (7215 docs) of data.
When restarting the database, I'm always getting the following error:
[error] 2019-02-17T17:29:45.959000Z nonode#nohost <0.602.0> --------
throw:{unauthorized,<<"unauthorized to access or create database
http://my-website.com/target-database-name/">>}:
Replication 5b4ee9ddc57bcad01e549ce43f5e31bc+continuous failed to
start "https://my-website.com/source-database-name/ "
-> "http://my-website.com/target-database-name/ " doc
<<"shards/00000000-1fffffff/_replicator.1550593615">>:<<"1e498a86ba8e3349692cc1c51a00037a">>
stack:[{couch_replicator_api_wrap,db_open,4,[{file,"src/couch_replicator_api_wrap.erl"},{line,114}]},{couch_replicator_scheduler_job,init_state,1,[{file,"src/couch_replicator_scheduler_job.erl"},{line,584}]}]
I'm not sure what is going on here. From the logs I understand there's an authorization issue. But the database is already present (hence, it has been replicated partially already).
What does this error mean and how can it be resolved?
The reason for this error is that the CouchDB v2.3.0 instance was being re-initialized on reboot. It required me to fill-in the cluster configuration again.
Therefore, the replication could not continue until I had the configuration re-applied.
The issue with having to re-apply the cluster configuration has been solved in another SO question.

Unable to start Kudu master

While starting kudu-master, I am getting the below error and unable to start kudu cluster.
F0706 10:21:33.464331 27576 master_main.cc:71] Check failed: _s.ok() Bad status: Invalid argument: Unable to initialize catalog manager: Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ. Their symmetric difference is: :0, hadoop-master:7051, slave2:7051, slave3:7051
It is a cluster of 8 nodes and i have provided 3 masters as given below in master.gflagfile on master nodes.
--master_addresses=hadoop-master,slave2,slave3
TL;DR
If this is a new installation, working under the assumption that master ip addresses are correct, I believe the easiest solution is to
Stop kudu masters
Nuke the <kudu-data-dir>/master directory
Start kudu masters
Explanation
I believe the most common (if not only) cause of this error (Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ.) is when a kudu master node gets added incorrectly. The error suggests that kudu-master thinks it's running on a single node rather than 3-node cluster.
Maybe you did not intend to "add a node", but that's most likely what happened. I'm saying this because I had the same problem; after some googling and debugging, I discovered that during the installation, I started kudu-master before putting the correct IP address in master.gflagfile, so that kudu-master was spun up thinking it was running on a single node, not 3 node. Using steps above to clean install kudu-master again, my problem was solved.

WebLogic crashes

I am running Liferay 6.2 on WebLogic 12c server.
Out of nowhere it just stopped working.
This is the last thing I see before it throws a flurry of exceptions
<Jan 10, 2014 2:53:28 PM EST> <Notice> <LoggingService> <BEA-320400> <The log fi
le C:\Oracle_2\Middleware\user_projects\domains\liferay\servers\AdminServer\logs
\AdminServer.log will be rotated. Reopen the log file if tailing has stopped. Th
is can happen on some platforms, such as Windows.>
<Jan 10, 2014 2:53:28 PM EST> <Notice> <LoggingService> <BEA-320401> <The log fi
le has been rotated to C:\Oracle_2\Middleware\user_projects\domains\liferay\serv
ers\AdminServer\logs\AdminServer.log00369. Log messages will continue to be logg
ed in C:\Oracle_2\Middleware\user_projects\domains\liferay\servers\AdminServer\l
ogs\AdminServer.log.>
The errors are shown here http://www.pastebin.ca/2532946
Anyone have any ideas on this?
As you can see in the log files (see below excerpt of your log file), Liferay is not able to either get a handle to the HSQL database or the HSQL db might be corrupted when you updated it.
13:11:16,769 WARN [C3P0PooledConnectionPoolManager[identityToken->uArzPQ2m]-HelperThread-#4][BasicResourcePool:1851] com.mchange.v2.resourcepool.BasicResourcePool$ScatteredAcquireTask#933b16 -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (3). Last acquisition attempt exception:
java.sql.SQLException: error in script file line: 15 unexpected token: AVG
So you need to answer below questions:
Did you use any Client tool to make changes to your HSQL db?
If yes, did you close the connection to HSQL database before starting Liferay?
If not, Liferay won't be able to acquire lock on your db and fail to start.
If not, did you make DB changes directly in the HSQL db file?
This is NOT Recommended. Rollback your changes and try to use HSQL client to make your db changes
HTH!
P.S. Is this issue duplicate of: https://stackoverflow.com/questions/21052236/weblogic-wont-start. If so, please delete that one.

Resources