How to optimize knex migration? - node.js

i'm working on a project that has been using bookshelfjs (with knexjs migration system) since its beginning (1 year and a half).
We now have a little bit less than 80 migrations and it's starting to take a lot of time (more than 2 minutes) to run all migrations. We are deploying using continuous integration so the migrations have to be run in the test process and in the deployment process.
I'd like to know how to optimize that. Is that possible to start from a clean state ? I don't care about losing rollback possibilities. The project is much more mature right now and we don't need to iterate much anymore on the data structure part.
Is there any best practice ? I'm coming from the Doctrine (PHP) world and it's really different.
Thanks for your advice !

Create database dump from your current database state.
Always use that dump to initialize new database for tests
Run migrations on top of already initialized database
In that way migration system applies only newly added migrations to top of existing initial dump.

When using knex.schema.createTable to create table with foregin keys from another table, and later when you run knex migrate:latest, the table with foreign keys should be processed before the one using the foreign keys. For example, table1 has foreign key key1 from talbe2, to make sure table2 is processed first, you can add numbers before the name of the table. Then in your migrations folder, there will be 1table2.js, 2table1.js. This looks hacky and not pretty, but it works!

Related

Cassandra Prepared Statement and adding new columns

We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables
We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application

How to delete all collections and documents in ArangoDb

I am trying to put together a unit test setup with Arango. For that I need to be able to reset the test database around every test.
I know we can directly delete a database from the REST API but it is mentioned in the documentation that creation and deletion can "take a while".
Would that be the recommended way to do that kind of setup or is there an AQL statement to do something similar ?
After some struggling with similar need I have found this solution:
for (let col of db._collections()) {
if (!col.properties().isSystem) {
db._drop(col._name);
}
}
You can for example retrieve the list of all collections (excluding system ones) and drop or truncate them. The latter will remove all documents and keep indexes. Alternatively you can use AQL REMOVE statement.
Creation of databases may indeed take a while (a few seconds). If that's too expensive in a unit test setup that sets up and tears down the environment for each single test, there are the following options:
create and drop a dedicated test database only once per test suite (that contains multiple tests), and create/drop the required collections per test. This has turned out to be fast enough in many cases, but it depends on how many tests are contained in each test suite.
do not create and drop a dedicated test database, but only have each test create and drop the required collections. This is the fastest option, and should be good enough if you start each test run in a fresh database. However it requires the tests to clean everything up properly. This is normally no problem, because the tests will normally use dedicated collections anyway. An exception is there for graph data: creating a named graph will store the graph description in the _graphs collection, and the graph must be deleted from there again.
Execute the following AQL query deletes all documents in the collection yourcollectionname:
FOR u IN yourcollectionname
REMOVE u IN yourcollectionname
https://docs.arangodb.com/3.0/AQL/Operations/Remove.html

Weird issue with Azure SQL Database v12: the database is always slow on the first insert or delete execution, but not with V11

We are using MVC4, ASP.NET 4.5, Entity Framework 6.
When we used Azure SQL Database v11, initial record inserts and deletes via EF, worked fine and quickly. However now, on v12, I notice that initial inserts and deletes can be very slow, especially if we choose a new value when inserting. If we insert a new record with the same value, the response is rapid. The delay I am talking about can be about 30 on S1, 15 secs on S2, 7 secs on S3.
As I say, we never encountered this on v11.
Any ideas gratefully received.
EDIT1
Just been doing some diagnostics and it seems that a view that I was using now runs very slowly first time:
db.ExecuteStoreCommand("DELETE FROM Vw_Widget where Id={0}", ID);
Do I need to rejig views in anyway for Azure SQL Database v12?
EDIT2
Looking at the Code a little more I see that I have added a delete trigger to the View, so basically I have set up a view so I can use this trigger code in certain situations. I am now trying to take out the trigger code and run it from the app, which does run alot quicker. Perhaps this code should be a stored procedure.
Definitely you need to do some diagnostics for your view to check the performance of your query and you may need to tune your query. The time measures you are saying is so high to perform any operation. Please make sure to do insert or deletes on your target tables and not views. The best practice is not to use views to insert or delete.
You can use views only in select statements.
I had a similar problem when make a migration of sql database v2 to v12. Actually i was working with business model and I tried to migrate to S0. The performance of the DB was not good. After sometime i discover that dtu model has particular views to monitor what type of provison model do you need. If is on the first time the problem, probably your application are making a lot of queries to load data in memory and these can be affecting the performance of your CRUD statement.
SELECT end_time
, (SELECT Max(v)
FROM (VALUES (avg_cpu_percent)
, (avg_data_io_percent)
, (avg_log_write_percent)
) AS value(v)) AS [avg_DTU_percent]
FROM sys.dm_db_resource_stats
ORDER BY end_time DESC;
more information about that, can be found on these page:
https://azure.microsoft.com/en-us/documentation/articles/sql-database-upgrade-server-portal/

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Resources