Data analysis of prod Scylla DB without hitting prod? - cassandra

Is there a way to extract data from a Scylla database for use in data analysis without directly querying from the production DB?
I want to do large queries against the data but don't want to take down production

The typical way folks accomplish this, is to build an “analytics DC” in the same physical DC.
So if (one of) your prod DCs is named “dc-west” you would create a new one named “dc-west-analytics” or something like that. Once the new DCs nodes are out there, change the keyspace to replicate to it. Run a repair on the new DC, and it should be ready for use.
On the app side or wherever the queries are running from, make sure it uses the LOCAL consistency levels and points to “dc-west-analytics” as its “local” DC.

In ScyllaDB Enterprise, a feature called Workload Prioritization allows you to assign CPU and I/O shares to your analytics and production workloads, isolating them from each other.

Related

How to migrate data between Apache Pulsar clusters?

How do folks migrate data between Pulsar environments either for disaster recovery or blue-green deployment are they copying data to a new AWS region or K8S namespace?
One of the easiest approaches is to rely on geo-replication to replicate the data across different clusters.
This PIP was just created to address this exact scenario of blue-green deployments and to address issues with a cluster becoming non-functioning or corrupted in case an upgrade or update (perhaps of experimental setting changes) goes wrong: https://github.com/pkumar-singh/pulsar/wiki/PIP-95:-Live-migration-of-producer-consumer-or-reader-from-one-Pulsar-cluster-to-another
Until then, aside from using geo-replication, another potential way around this is having an active-active setup (with two live clusters) with a DNS endpoint that producers and consumers use that can be cut over from the old cluster to the new cluster just before taking the old cluster goes down. If it's not acceptable to have the blip in latency this approach would cause, you could replicate just the ingest topics from the old cluster to the new cluster, cut over your consumers, and then cut over your producers. Depending on how your flows are designed, you may need to replicate additional topics to make that happen. Keep in mind that on the cloud there can be different cost implications depending on how you do this. (Network traffic across regions is typically much more expensive than traffic within a single zone.)
If you want to have the clusters completely separate (meaning avoiding using Pulsar geo-replication), you can splice a Pulsar replicator function into each of your flows to manually produce your messages to the new cluster until the traffic is cut over. If you do that, just be mindful that the function must have its own subscription on each topic you deploy it to ensure that you're not losing messages.

Azure Cosmos DB: How to create read replicas for a specific container

In Azure Cosmos DB, is it possible to create multiple read replicas at a database / container / partition key level to increase read throughput? I have several containers that will need more than 10K RU/s per logical partition key, and re-designing my partition key logic is not an option right now. Thus, I'm thinking of replicating data (eventual consistency is fine) several times.
I know Azure offers global distribution with Cosmos DB, but what I'm looking for is replication within the same region and ideally not a full database replication but a container replication. A container-level replication will be more cost effective since I don't need to replicate most containers and I need to replicate the others up to 10 times.
Few options available though:
Within same region no replication option but you could use the Change Feed to replicate to another DB with the re-design in mind just for the purpose of using with read queries. Though it might be a better idea to either use the serverless option which is in preview or use the auto scale option as well. But you can also look at provisioned throughput and reserve the provisioned RUs for 1 year or 3 year and pay monthly just as you would in PAYG model but with a huge discount. One option would also be to do a latency test from the VM in the region where your main DB and app is running and finding out the closest region w.r.t Latency (ms) and then if the latency is bearable then you can use global replication to that region and start using it. I use this tool for latency tests but run it from the VM within the region where your app\DB is running.
My guess is your queries are all cross-partition and each query is consuming a ton of RU/s. Unfortunately there is no feature in Cosmos DB to help in the way you're asking.
Your options are to create more containers and use change feed to replicate data to them in-region and then add some sort of routing mechanism to route requests in your app. You will of course only get eventual consistency with this so if you have high concurrency needs this won't work. Your only real option then is to address the scalability issues with your design today.
There is something that can help is this live data migrator. You can use this to keep a second container in sync with your original that will allow you to eventually migrate off your first design to another that scales better.

Which MongoDB scaling strategy (Sharding, Replication) is suitable for concurrent connections?

Consider scenario that
I have multiple devclouds (remote workplace for developers), they are all virtual machines running on the same bare-metal server.
In the past, they used their own MongoDB containers running on Docker. So that number of MongoDB containers can add up to over 50 instances across devclouds.
The problem becomes apparent that while 50 instances is running at the same time, but only 5 people actually perform read/write operations against their own instances. So other 45 running instances waste the server's resources.
Should I use only one MongoDB cluster by combining a set of MongoDB instances ,for everyone so that they can connect to 1 endpoint only (via internal network) to avoid wasting resources.
I am considering the sharding strategy, but the problem is there are chances that if one node taken down (one VM shut down), is that ok for availability (redundancy)?
I am pretty new to sharding and replication, looking forward to know your solutions. Thank you
If each developer expects to have full control over their database deployment, you can't combine the deployments. Otherwise one developer can delete all data in the deployment, etc.
If each developer expects to have access to one database, you can deploy a single replica set serving all developers and assign one database per developer (via authentication).
Sharding in MongoDB sense (a sharded cluster) is not really going to help in this scenario since an application generally uses all of the shards. You can of course "shard manually" by setting up multiple replica sets.

CouchDB V2.0 required hardware and software configuration

I want to deploy a couchDB V2 to manage a DataBase with 30 Terabyte. Can you please suggest me the minimal hardware configuration ?
- Number of server
- Number of nodes
- Number of cluster
- Number of replication
- Size of disk per couchDB instance
- etc.
Thanks !
You want 3 servers minimum due to quorum. Other than that, I would recommend at least 2 clusters of 3. If you want to be geographically dispersed, then you want a cluster of 3 in each location. I think those are the basic rules.
If its a single database with 30 TB, I think there needs to some way to avoid it... Here are some ideas:
Look at the nature of the docs stored in it and see if you can move out the type of doc that are frequently accessed to a different db and change the application for using it.
As suggested by fred above, the 3 servers and multiple clusters
Backup and recovery - If the database is 30TB, the backup would also take the same space. You might want the backup normally in a different datacenter. Replication for 30 TB will take a lot of time.
Read the docs of CouchDB on how deletion happens, you might want to use the filtered replication which will again take more space.
Keeping the above points in mind you might want 3 servers as suggested by fred to run couchdb for your business and more servers to maintain backups and doc deletions over a long time.

Regarding process related issues in node js and mongo db

I am a novice programmer in Node JS. I have a few queries regarding process related issues like locking and race conditions in Node JS and Mongo DB.
My codes are working perfectly in local environment,but when I am moving to production and come across large number of requests,I might encounter certain issues.
How do we avoid write level race conditions for mongo slaves located in different regions? ie say one piece of data is being written locally but the true value for it is being written remotely that is delayed
Consider we have node processes located regionally would it need to hit mongo master located in another region which then routes the request to a regional slave? This considerably increases the latency of each write - how do we avoid this? Can we have direct writes to regional slaves from local processes and some kind of replication to maintain data consistency?
I use a Node REST api and use mongoose as the Mongo DB driver.Any help would be deeply appreciated .Thank you .
MongoDB's automatic failover and high availability features are provided by what's called replication. The standard MongoDB terms are "primary" for master and "secondary" for slave, so I'll use those terms to be consistent with the documentation and the user base at large. I think both of your questions are answered by one fact: in a replica set, the primary is the only member that accepts writes from clients, ever. The secondaries get the data replicated to them asynchronously a short time later. To answer the questions directly:
No writes to slaves except internal replication of writes from the primary, so no "race condition" with writes can arise.
All writes must go to the primary. The replication system will distribute to data to the secondaries asynchronously. You can read from secondaries, but it isn't a best practice despite its occasional utility. I'd suggest reading about replica set read preference and reading Asya Kamsky's blog post about scaling with replica sets before deciding to read from secondaries.

Resources