How do production Cassandra DBA's do table changes & additions? - cassandra

I am interested in how the Cassandra production DBA's processes change when using Cassandra and performing many releases over a year. During the releases, columns in tables would change frequently and so would the number of Cassandra tables, as new features and queries are supported.
In the relational DB, in production, you create the 'view' and BOOM you get the data already there - loaded from the view's query.
With Cassandra, does the DBA have to create a new Cassandra table AND have to write/run a script to copy all the required data into that table? Can a production level Cassandra DBA provide some pointers on their processes?

We run a small shop, so I can tell you how I manage table/keyspace changes, and that may differ from how others get it done. First, I keep a text .cql file in our (private) Git repository that has all of our tables and keyspaces in their current formats. When changes are made, I update that file. This lets other developers know what the current tables look like, without having to use SSH or DevCenter. This also has the added advantage of giving us a file that allows us to restore our schema with a single command.
If it's a small change (like adding a new column) I'll try to get that out there just prior to deploying our application code. If it's a new table, I may create that earlier, as a new table without code to use it really doesn't hurt anything.
However, if it is a significant change...such as updating/removing an existing column or changing a key...I will create it as a new table. That way, we can deploy our code to use the new table(s), and nobody ever knows that we switched something behind the scenes. Obviously, if the table needs to have data in it, I'll have export/import scripts ready ahead of time and run those right after we deploy.
Larger corporations with enterprise deployments use tools like Chef to manage their schema deployments. When you have a large number of nodes or clusters, an automated deployment tool is really the best way to go.

Related

Redis and Postgresql synchronization (online users status)

In an NodeJS application I have to maintain a "who was online in the last N minutes" state. Since there is potentially thousands of online users - for performance reasons - I decided to not update my Postgresql user table for this task.
I choosed to use Redis to manage the online status. It's very easy and efficient.
But now I want to make complex queries to the user table, sorted by the online status.
I was thinking of creating a online table filled every minute from a Redis snapshot, but I'm not sure it's the best solution.
Following the table filling, will the next query referencing the online table take a big hit caused by the new indexes creation or loading?
Does anyone know a better solution?
I had to solve almost this exact same issue, but I took a different approach because I Didn't like the issues caused by trying to mix Redis and Postgres.
My solution was to collect the online data in a queue (Zero MQ in my case) but any queueing system should work, or a stream processing facility like Amazon Kinesis (The alternative I looked at.) I then inserted the data in batches into a second table (not the users table). I don't delete or update that table, only inserts and queries are allowed.
Doing things this way preserved the ability to do joins between the last online data and the users table without bogging down the database or creating many updates on the user tables. It has the side effect of giving us a lot of other useful data.
One thing to note that I have though about when thinking of other solutions to this problem is that your users table in transactional data(OLTP) while the latest online information is really analytics data (OLAP), so if you have a data warehouse, data lake, big data, or whatever term of the week you want to use for storing this type of data and querying against it that may be a better solution.

How to optimize knex migration?

i'm working on a project that has been using bookshelfjs (with knexjs migration system) since its beginning (1 year and a half).
We now have a little bit less than 80 migrations and it's starting to take a lot of time (more than 2 minutes) to run all migrations. We are deploying using continuous integration so the migrations have to be run in the test process and in the deployment process.
I'd like to know how to optimize that. Is that possible to start from a clean state ? I don't care about losing rollback possibilities. The project is much more mature right now and we don't need to iterate much anymore on the data structure part.
Is there any best practice ? I'm coming from the Doctrine (PHP) world and it's really different.
Thanks for your advice !
Create database dump from your current database state.
Always use that dump to initialize new database for tests
Run migrations on top of already initialized database
In that way migration system applies only newly added migrations to top of existing initial dump.
When using knex.schema.createTable to create table with foregin keys from another table, and later when you run knex migrate:latest, the table with foreign keys should be processed before the one using the foreign keys. For example, table1 has foreign key key1 from talbe2, to make sure table2 is processed first, you can add numbers before the name of the table. Then in your migrations folder, there will be 1table2.js, 2table1.js. This looks hacky and not pretty, but it works!

Alternative of Cassandra for storing User data with high IO

We are looking for a technology stack which will have the following criteria.
We will be having around 10 million customer.
Each customer will be having around 20MB+ of data.
Data of each user will be updated everyday.
We need to store the data for more than six months.
We may need to query on the data any time within the time span of six months.
Currently we are thinking to use Cassandra, but the limitation of maximum storage per node in Cassandra should be less than 3TB, we are looking for other alternatives to use with or without Cassandra.
Well, I don't know if my suggestion applies for your case. We had a similar case with one of our products. There was created a blob field to record binary data, as pdf documents, that made the database grew considerably.
The solution we made was to create a second database, as a repository for records older then one year. At the application server there's a service running which:
1) Copies the records, from specific tables, older then one year to this second database;
2) Deletes records from the main database, once we have a copy in the other side;
3) Queries that need data older then one year are directed to this second database;
Sure, we had to do some implementations on the code to adapt to this situation, but is running good so far.
You can try ScyllaDB. It's a C++ reimplementation of Cassandra at 10x the speed. Scylla supports 10TB/node and there are examples of larger amounts per node. Proper disclosure - I work there but am speaking from experience.
You can definitely consider just to store the metadata itself in the database and the blobs on a separate nodes outside but it's complex and Scylla can store it all altogether. Such a similar system is already in production and we hope that user will eventually open source it

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?
Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

SQL Azure Elastic Scale - reference table

I am not sure if I understand the idea of reference tables correctly. In my mind it is a table that contains the same data in every shard. Am I wrong? I am asking because I have no idea how should I insert data to the reference table to make the data multiply in every shard. Or maybe it is impossible? Can anyone clarify this issue?
Yes, the idea of a Reference Table is that the same data is contained on every shard. If you have small numbers of shards and data changes are rare, you can open multiple connections in your application and apply the changes to multiple DBs simultaneously. Or you can construct a management script that iterates periodically across all shards to update reference data or performs a bulk-insert of a fresh image of rows.
The new feature previewing in Azure SQL Database called Elastic DB Jobs, which allows you to define a SQL script for operations that you want to take place on all shards, and then runs the script asynchronously with eventual completion guarantees. You can potentially use this to update reference tables. Details on the feature are here.

Resources