Cassandra Prepared Statement and adding new columns - cassandra

We are using cached PreparedStatement for queries to DataStax Cassandra. But if we need to add new columns to a table, we need to restart our application server to recache the prepared statement.
I came across this bug in cassandra, that explains the solution
https://datastax-oss.atlassian.net/browse/JAVA-420
It basically gives a work around to not use "SELECT * FROM table" in the query, but use "SELECT column_names FROM table"
But now we came across the same issue with Delete statements. After adding a new column to a table, the Delete prepared statement does not delete a record.
I don't think we can use the same work around as mentioned in the ticket for Select statement, as * or column_names does not make sense with Deleting a row.
Any help would be appreciated. We basically want to avoid having to restart our application server for any additions to database tables

We basically want to avoid having to restart our application server for any additions to database tables
Easy solution that require a little bit of coding: use JMX
Let me explain.
In your application code, keep a cache (you can use Guava cache implementation for example) of all prepared statement. The key to access the cache can be, for example, the query string.
Now, expose a JMX method to clear the cache and force the application to re-prepare again the queries.
Every time you update a schema, just call the appropriate method(s) to clean the cache, you don't need to restart your application

Related

Is Celery still necessary in Django

I'm creating a web app with Django 3.1 and there's lots of DB interactions mostly between three tables. The queries mostly use results that are recent inputs. So query1 will run and update table1, query2 will use table1 to update2 table2 and query3 will use the column updated by query2 to update other columns of table2. All these run every time users input or update info.
Perhaps a visual will be clearer.
query1 = Model1.objects.filter(...).annotate(...)
query2 = Model2.objects.filter(...).update(A=query1)
query3 = Model2.objects.filter(...).update(B=A*C)
I'm beginning to worry about speed between python and PostgreSQL and can lose data when multiple users start using it same time. I read about celery and Django Asynchronous support, but it's not clear if I need celery or not.
This is a very simplified version but you get the gist. Can someone help me out here please.
You would consider using Celery if your Django view has a long running task and you don't want the user to wait for completion or for the application server to time out. If the database updates are quick then you probably do not need it. PostgreSQL is a multi-user database so you do not need to worry too much about users clobbering other users' changes.

Is there an idiomatic way of versioning clients to a database?

I'm supplying client drivers to a database I am maintaining. The DB has lots of tables with well defined schemas. (Cassandra in this case)
From time to time there will be some breaking changes (stemming from product and system requirements) and the clients will "break" in the sense that the queries they were performing until now will not be correct in regards to the newer schemas.
I'm curious to know if there is a good clean way to "version" the clients to work with the corresponding tables?
For instance a naive implementation could add the version number to the table name, i.e. for every table in the db , append a version number to the table name.
The clients would always query tables that match this naming convention. Newer breaking versions would change the table name to match the newer version and clients would be upgraded accordingly.
Is there a better way to handle this?
It's also possible to add 1 version for you DB and 1 version that is stored on your client, when a breaking change is made you update the database version.
When the client starts a version check is performed and if the version missmatches an auto upgrade can be done.
I came across the same problem few months ago. We have to load the schema according to the Version in which our client should support. The solution we found is as follows:
Along with the schema, one more table will be created which contains the following fields ---> version_no, ks_name, table_name, column_name, add/drop, is_loaded, primary key(version_no,(ks_name, table_name, column_name)). Note:if you have single keyspace, you can remove that column or table name can be itself written as ks_name.table_name.
Then, whenever we want to load a new version, we will log the changes in that table and when we load the previous schema again, the script will make sure that the old alterations are effected such that it will roll back to the same previous version of schema. Make sure that you update the is_loaded field as it is the only way to differentiate if a schema is half loaded or script failed such that it will not rise further more errors. Hope it helps!!

MemSQL - why can't I do a cross-database insert into .. select

I'm trying to do a simple insert with a field list from a table in one database to a table in another.
insert into db_a.target_table (field1,field2,field3) select field1,field2,field3 from db_b.source_table;
The error message seems straight-forward..
MemSQL does not support this type of query: Cross-database INSERT ... SELECT
Oddly enough, this example does work:
insert into db_a.target_table select * from db_b.source_table;
But this seems like such a common scenario. Has anyone run into a similar issue, and were you able to work around it?
Unfortunately, this isn't allowed because it is difficult to keep such queries transactional; multi-statement transactions are used internally to guarantee transactionality of the single insert-select (if one partition fails (dup key or something), we want to rollback everything!). Since we don't have cross-db multi-statement transactions (yet!), we don't have cross-db insert-select (yet!).
Stay tuned for nicer solutions.
However, if you REAAALY want to do this, here is what you do. However,
PROCEED AT YOUR OWN RISK. THIS IS NOT A SUPPORTED PROCEEDURE.
But it should work.
1) On db_b, create a table with the same columns as source_table, but make the shard key SHARD().
2) On db_a, run SHOW PARTITIONS.
3) For each of those partitions, create a connection to db_a_<ordinal> on the host and port listed in SHOW PARTITIONS. Run SHOW DATABASES on that connection and you'll see some databases called db_b_<another>. Pick one, doesn't matter which. Run INSERT INTO db_b<another>.source_table SELECT * from db_a_<ordinal>.source_table.
3.5) At this point, you haven't yet written to a table you care about, but now we will. Look at db_b.source_table. Is everything correct? Is all the data there? Run SHOW CREATE TABLE and double check the shard key is SHARD KEY () (it should be in comments). Everything look good? Ok, we can proceed.
4) After you're done doing this for EVERY partition, you can do INSERT INTO db_b.target_table (cols) SELECT cols from db_b.source_table, or whatever you want.
Good luck!

How do production Cassandra DBA's do table changes & additions?

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?
We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

Cassandra nodejs DataStax driver don't return newly added columns via prepared statement execution

After adding a pair of columns in schema, I want to select them via select *. Instead select * returns old set of columns and none new.
By documentation recommendation, I use {prepare: true} to smooth JavaScript floats and Cassandra ints/bigints difference (I don't really need the prepared statement here really, it is just to resolve this ResponseError : Expected 4 or 0 byte int issue and I also don't want to bother myself with query hints).
So on first execution of select * I had 3 columns. After this, I added 2 columns to schema. select * still returns 3 columns if is used with {prepare: true} and 5 columns if used without it.
I want to have a way to reliably refresh this cache or make cassandra driver prepare statements on each app start.
I don't consider restarting database cluster a reliable way.
This is actually an issue in Cassandra that was fixed in 2.1.3 (CASSANDRA-7910). The problem is that on schema update, the prepared statements are not evicted from the cache on the Cassandra side. If you are running a version less than 2.1.3 (which is likely since 2.1.3 was released last week), there really isn't a way to work around this unless you create another separate prepared statement that is slightly different (like extra spaces or something to cause a separate unique statement).
When running with 2.1.3 and changing the table schema, C* will properly evict the relevant prepared statements from the cache, and when the driver sends another query using that statement, Cassandra will respond with an 'UNPREPARED' message, which should provoke the nodejs driver to reprepare the query and resend the request for you.
On the Node.js driver, you can programatically clear the prepared statement metadata:
client.metadata.clearPrepared();

Resources