ArangoDB - Help diagnosing database corruption after system restart - arangodb

I've been working with Arango for a few months now within a local, single-node development environment that regularly gets restarted for maintenance reasons. About 5 or 6 times now my development database has become corrupted after a controlled restart of my system. When it occurs, the corruption is subtle in that the Arango daemon seems to start ok and the database structurally appears as expected through the web interface (collections, documents are there). The problems have included the Foxx microservice system failing to upload my validated service code (generic 500 service error) as well as queries using filters not returning expected results (damaged indexes?). When this happens, the only way I've been able to recover is by deleting the database and rebuilding it.
I'm looking for advice on how to debug this issue - such as what to look for in log files, server configuration options that may apply, etc. I've read most of the development documentation, but only skimmed over the deployment docs, so perhaps there's an obvious setting I'm missing somewhere to adjust reliability/resilience? (this is a single-node local instance).
Thanks for any help/advice!

please note that issues like this should rather be discussed on github.

Related

Inconsistency Errors in kombu using celery and redis with the key '_kombu.binding.reply.celery.pidbox'

I have two Django sites (archive and test-archive) on one machine. Each has its own virtual environment and different celery queues and daemons, using Python 3.6.9 on Ubuntu 18.04, Django 3.0.2, Redis v 4.0.9, celery v 4.3, and Kombu v4.6.3. This server has 16 GB of RAM, and under load there is at least 10GB free and swap is minimal.
I keep getting this error in my logs:
kombu.exceptions.InconsistencyError:
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.
I tried
downgrading Kombu to 4.5 for both sites per some stackoverflow posts
and setting maxmemory=2GB and maxmemory-policy=allkeys-lru in redis.conf per celery docs (https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/redis.html#broker-redis); originally the settings were the defaults of unlimited memory and noeviction and these errors were present for both versions of kombu
I still get those errors when one site is under load (i.e. doing something like uploading a set of images and processing them) and the other site is idle.
What is a little strange is that on some test runs using test-archive, test-archive will not have any errors, while archive will show those errors, even though the archive site is not doing anything. On other identical test runs using test-archive, test-archive will generate the errors and archive will not.
I know this is a reported bug in kombu/celery, so I am wondering if anyone has a work around that works more often than not for this configuration. What versions of celery, kombu, redis, etc. seem to work more often than not? I am happy to share my config files or log files, but there are so many I thought it would be best to start this discussion with the problem statement and my setup and see what else is needed.
Thanks!

Why is Azure Scale-Out Crashing the Web App

I'm currently running Umbraco on a web app for Microsoft Azure. Anytime I enable scaling out and the web app starts scaling out, I get the error:
"Process cannot access the file, Examine Indexes write.lock because it is being used by another file.
The website then needs to be restarted before it becomes fully functioning again. Is there a setting on Umbraco that I'm missing?
Or is it something that happens with Azure Web Apps Auto Scaling features?
This sounds like an issue with the indexes. Your index appears to be getting locked when scaling out. Ideally if you're running on a load balanced environment, you should have a single index for all environments instead of on a per instance basis. I've used Azure Search in the past and it's worked perfectly, swapping out the index isn't too difficult with Umbraco, plenty of information available online. Good example here
In the future you shouldn't need to restart the entire site, rebuilding the indexes should be fine.
Also, what version of Umbraco are you running? This may be of some help, I encountered some similar issues a few months ago - unrelated to scaling though.
https://issues.umbraco.org/issue/U4-10735
Sounds like you need to isolate your index files so they aren’t shared across the difference instances and don’t lock each other out. There’s a few ways to do this based on the version you are running, but in 7.3, i think you update the index file location to include the instance name like ~/App_Data/TEMP/ExamineIndexes/{machinename}/Internal/
For more details, see https://our.umbraco.com/documentation/getting-started/setup/server-setup/load-balancing/flexible#if-you-plan-on-using-auto-scaling

gitlab runner errors occasionally

I have gitlab setup with runners on dedicated VM machine (24GB 12 vCPUs and very low runner concurrency=6).
Everything worked fine until I've added more Browser tests - 11 at the moment.
These tests are in stage browser-test and start properly.
My problem is that, it sometimes succeeds and sometimes not, with totally random errors.
Sometimes it cannot resolve host, other times unable to find element on page..
If I rerun these failed tests, all goes green always.
Anyone has an idea on what is going wrong here?
BTW... I've checked, this dedicated VM is not overloaded...
I have resolved all my initial issues (not tested with full machine load so far), however, I've decided to post some of my experiences.
First of all, I was experimenting with gitlab-runner concurrency (to speed things up) and it turned out, that it really quickly filled my storage space. So for anybody experiencing storage shortcomings, I suggest installing this package
Secondly, I was using runner cache and artifacts, which in the end were cluttering my tests a bit, and I believe, that was the root cause of my problems.
My observations:
If you want to take advantage of cache in gitlab-runner, remember that by default it is accessible on host where runner starts only, and remember that cache is retrieved on top of your installation, meaning it overrides files from your project.
Artifacts are a little bit more flexible, cause they are stored/fetched from your gitlab installation. You should develop your own naming convention (using vars) for them to control, what is fetched/cached between stages and to make sure all is working, as you would expect.
Cache/Artifacts in your tests should be used with caution and understanding, cause they can introduce tons of problems, if not used properly...
Side note:
Although my VM machine was not overloaded, certain lags in storage were causing timeouts in the network and finally in Dusk, when running multiple gitlab-runners concurrently...
Update as of 2019-02:
Finally, I have tested this on a full load, and I can confirm my earlier side note, about machine overload is more than true.
After tweaking Linux parameters to handle big load (max open files, connections, sockets, timeouts, etc.) on hosts running gitlab-runners, all concurrent tests are passing green, without any strange, occasional errors.
Hope it helps anybody with configuring gitlab-runners...

Foxx apps debugging workflow?

What is the recommended workflow to debug Foxx applications?
I am currently working on a pretty big application and it seems to me I am doing something wrong, because the way I am proceeding does not seem to be maintanable at all:
Do your changes in Foxx app (eg new endpoints).
Upload your foxx app to ArangoDB.
Test your changes (eg trigger API calls).
Check the logs to see if something went wrong.
Go to 1.
i experienced great time savings, shifting more of the development workflow to the terminal client 'arangosh'. Especially when debugging more complex endpoints, you can isolate queries and functions and debug each individually in the terminal. When done debugging, you merge your code in Foxx app and mount it. Require modules as you would do in Foxx, just enter variables as arguments for your functions or queries.
You can use arangosh either directly from the terminal or via the embedded terminal in the Arangodb frontend.
You may also save some time switching to dev mode, which allows you to have changes in your code directly reflected in the mounted app without fetching, mounting and unmounting each time.
That additional flexibility costs some performance, so make sure to switch back to production mode once your Foxx app is ready for deployment.
When developing a Foxx App, I would suggest using the development mode. This also helps a lot with debugging, as you have faster feedback. This works as follows:
Start arangod with the dev-app-path option like this: arangod --javascript.dev-app-path /PATH/TO/FOXX_APPS /PATH/TO/DB, where the path to foxx apps is the folder that contains a database folder that contains your foxx apps sorted by database. More information can be found here.
Make your changes, no need to deploy the app or anything. The app now automatically reloads on every request. Change, try out, change try out...
There's currently no debugging capabilities. We are planning to add more support for unit testing of Foxx apps in the near future, so you can have a more TDD-like workflow.

FluentMigrator Migration From Application_Start

I am currently changing our database deployment strategy to use FluentMigration and have been reading up on how to run this. Some people have suggested that it can be run from Application_Start, I like this idea but other people are saying no but without specifying reasons so my questions are:
Is it a bad idea to run the database migration on application start and if so why?
We are planning on moving our sites to deploying to azure cloud services and if we don't run the migration from application_start how should/when should we run it considering we want to make the deployment as simple as possible.
Where ever it is run how do we ensure it is running only once as we will have a website and multiple worker roles as well (although we could just ensure the migration code is only called from the website but in the future we may increase to 2 or more instances, will that mean that it could run more than once?)
I would appreciate any insight on how others handle the migration of the database during deployment, particularly from the perspective of deployments to azure cloud services.
EDIT:
Looking at the comment below I can see the potential problems of running during application_start, perhaps the issue is I am trying to solve a problem with the wrong tool, if FluentMigrator isn't the way to go and it may not be in our case as we have a large number of stored procedures, views, etc. so as part of the migration I was going to have to use SQL scripts to keep them at the right version and migrating down I don't think would be possible.
What I liked about the idea of running during Application_Start was that I could build a single deployment package for Azure and just upload it to staging and the database migration would be run and that would be it, rather thank running manual scripts, and then just swap into production.
Running migrations during Application_Start can be a viable approach. Especially during development.
However there are some potential problems:
Application_Start will take longer and FluentMigrator will be run every time the App Pool is recycled. Depending on your IIS configuration this could be several times a day.
if you do this in production, users might be affected i.e. trying to access a table while it is being changed will result in an error.
DBA's don't usually approve.
What happens if the migrations fail on startup? Is your site down then?
My opinion ->
For a site with a decent amount of traffic I would prefer to have a build script and more control over when I change the database schema. For a hobby (or small non-critical project) this approach would be fine.
An alternative approach that I've used in the past is to make your migrations non-breaking - that is you write your migrations in such a way they can be deployed before any code changes and work with the existing code. This way both code and migrations both can be deployed independently 95% of the time. For example instead of changing an existing stored procedure you create a new one or if you want to rename a table column you add a new one.
The benefits of this are:
Your database changes can be applied before any code changes. You're then free to roll back any breaking code changes or breaking migrations.
Breaking migrations won't take the existing site down.
DBAs can run the migrations independently.

Resources