I'm supplying client drivers to a database I am maintaining. The DB has lots of tables with well defined schemas. (Cassandra in this case)
From time to time there will be some breaking changes (stemming from product and system requirements) and the clients will "break" in the sense that the queries they were performing until now will not be correct in regards to the newer schemas.
I'm curious to know if there is a good clean way to "version" the clients to work with the corresponding tables?
For instance a naive implementation could add the version number to the table name, i.e. for every table in the db , append a version number to the table name.
The clients would always query tables that match this naming convention. Newer breaking versions would change the table name to match the newer version and clients would be upgraded accordingly.
Is there a better way to handle this?
It's also possible to add 1 version for you DB and 1 version that is stored on your client, when a breaking change is made you update the database version.
When the client starts a version check is performed and if the version missmatches an auto upgrade can be done.
I came across the same problem few months ago. We have to load the schema according to the Version in which our client should support. The solution we found is as follows:
Along with the schema, one more table will be created which contains the following fields ---> version_no, ks_name, table_name, column_name, add/drop, is_loaded, primary key(version_no,(ks_name, table_name, column_name)). Note:if you have single keyspace, you can remove that column or table name can be itself written as ks_name.table_name.
Then, whenever we want to load a new version, we will log the changes in that table and when we load the previous schema again, the script will make sure that the old alterations are effected such that it will roll back to the same previous version of schema. Make sure that you update the is_loaded field as it is the only way to differentiate if a schema is half loaded or script failed such that it will not rise further more errors. Hope it helps!!
Related
I want to add a lastUpdated column on my cassandra tables, and have this column autopopulated by cassandra whenever a row is updated.
Does cassandra have triggers that could do this?
The goal is not to have to update code base, just a db migration to make this work.
Cassandra stores the write time for each column/row/partition. In fact, this is what the coordinators use to determine the latest version of the data and returns it to the clients.
Depending on what you're after, you can use this metadata for your purposes without having to store it separately. There is a built-in function WRITETIME() that returns the timestamp in microseconds for when the data was written (details here).
On the subject of database triggers, they exist in Cassandra but there hasn't been much work done for this functionality for 5 (maybe 7) years so I personally wouldn't recommend their use. I think triggers were always considered experimental and never moved on. In fact, they were deprecated in DataStax Enterprise around 5 years ago. Cheers!
We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables
We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application
I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.
After adding a pair of columns in schema, I want to select them via select *. Instead select * returns old set of columns and none new.
By documentation recommendation, I use {prepare: true} to smooth JavaScript floats and Cassandra ints/bigints difference (I don't really need the prepared statement here really, it is just to resolve this ResponseError : Expected 4 or 0 byte int issue and I also don't want to bother myself with query hints).
So on first execution of select * I had 3 columns. After this, I added 2 columns to schema. select * still returns 3 columns if is used with {prepare: true} and 5 columns if used without it.
I want to have a way to reliably refresh this cache or make cassandra driver prepare statements on each app start.
I don't consider restarting database cluster a reliable way.
This is actually an issue in Cassandra that was fixed in 2.1.3 (CASSANDRA-7910). The problem is that on schema update, the prepared statements are not evicted from the cache on the Cassandra side. If you are running a version less than 2.1.3 (which is likely since 2.1.3 was released last week), there really isn't a way to work around this unless you create another separate prepared statement that is slightly different (like extra spaces or something to cause a separate unique statement).
When running with 2.1.3 and changing the table schema, C* will properly evict the relevant prepared statements from the cache, and when the driver sends another query using that statement, Cassandra will respond with an 'UNPREPARED' message, which should provoke the nodejs driver to reprepare the query and resend the request for you.
On the Node.js driver, you can programatically clear the prepared statement metadata:
client.metadata.clearPrepared();
We're currently using the Microsoft Sync Framework 2.1 to sync data between a cloud solution and thick clients. Syncing is initiated by the clients and is download only. Both ends are using SQL Server and I'm using the SqlSyncScopeProvisioning class to provision scopes. We cannot guarantee that the clients will be running the latest version of our software, but we have full control of the cloud part and this will always be up to date.
We're considering supporting versioning of scopes so that if, for example, I modify a table to have a new column then I can keep any original scope (e.g. 'ScopeA_V1'), whilst adding another scope that covers all the same data as the first scope but also with the new column (e.g. 'ScopeA_V2'). This would allow older versions of the client to continue syncing until they had been upgraded.
In order to achieve this I'm designing data model changes in a specific way so that I can only add columns and tables, never remove, and all new columns must be nullable. In theory I think this should allow older versions of my scopes to carry on functioning even if they aren't syncing the new data.
I think I'm almost there but I've hit a stumbling block. When I provision the new versions of existing scopes I'm getting the correctly versioned copies of my SelectChanges stored procedure, but all the table specific stored procedures (not specific to scopes - i.e. tableA_update, tableA_delete, etc) are not being updated as I think the provisioner sees them as existing and doesn't think they need updated.
Is there a way I can get the provisioner to update the relevant stored procedures (_update, _insert, etc) so that it adds in the new parameters for the new columns with default values (null), allowing both the new and old versions of the scopes to use them?
Also if I do this then when the client is upgraded to the newer version, will it resync the new columns even though the rows have already been synced (albeit with nulls in the new columns)?
Or am I going about this the completely wrong way? Is there another way to make scopes backwards compatible with older versions?
Sync Framework out-the-box dont support updating scope definitions to accomodate schema changes.
and creating new scope via the SetCreateProceduresForAdditionalScopeDefault will only create a new scope and a new _selectchanges stored procedure and but will re-use all the other stored procedures, tracking tables, triggers and UDT.
i wrote a series of blog posts on what needs to be changed to accomodate schema changes here: http://jtabadero.wordpress.com/2011/03/21/modifying-sync-framework-scope-definition-part-1-introduction/
the subsequent posts to that shows some ways to hack the provisioning scripts.
to answer your other question if the addition of a new column will resync that column or the row, the answer is no. first, change tracking is at the row level. second, adding a column will not fire the triggers that update the tracking tables that indicates if there are changes to be synched.