Is it possible to create two or more datacentre in yugabyte-db.
Each datacentre having it's own RF and datacentres may be asynchronously replicated.
We are currently working on a distributed databases where we read or write to a local datacentre if local datacentre fail to served us in that case only geo-datacentre is queried.
Is this sort of solutions supported in yugabyte. If not then we may face latency in write due to nodes distribution among differently geographical location.
This feature is currently on our roadmap and scheduled for a beta release in 4Q 2019. You can read more about it here: YugaByte Multi-Region 2DC deployment
Related
Is there a way to extract data from a Scylla database for use in data analysis without directly querying from the production DB?
I want to do large queries against the data but don't want to take down production
The typical way folks accomplish this, is to build an “analytics DC” in the same physical DC.
So if (one of) your prod DCs is named “dc-west” you would create a new one named “dc-west-analytics” or something like that. Once the new DCs nodes are out there, change the keyspace to replicate to it. Run a repair on the new DC, and it should be ready for use.
On the app side or wherever the queries are running from, make sure it uses the LOCAL consistency levels and points to “dc-west-analytics” as its “local” DC.
In ScyllaDB Enterprise, a feature called Workload Prioritization allows you to assign CPU and I/O shares to your analytics and production workloads, isolating them from each other.
How do folks migrate data between Pulsar environments either for disaster recovery or blue-green deployment are they copying data to a new AWS region or K8S namespace?
One of the easiest approaches is to rely on geo-replication to replicate the data across different clusters.
This PIP was just created to address this exact scenario of blue-green deployments and to address issues with a cluster becoming non-functioning or corrupted in case an upgrade or update (perhaps of experimental setting changes) goes wrong: https://github.com/pkumar-singh/pulsar/wiki/PIP-95:-Live-migration-of-producer-consumer-or-reader-from-one-Pulsar-cluster-to-another
Until then, aside from using geo-replication, another potential way around this is having an active-active setup (with two live clusters) with a DNS endpoint that producers and consumers use that can be cut over from the old cluster to the new cluster just before taking the old cluster goes down. If it's not acceptable to have the blip in latency this approach would cause, you could replicate just the ingest topics from the old cluster to the new cluster, cut over your consumers, and then cut over your producers. Depending on how your flows are designed, you may need to replicate additional topics to make that happen. Keep in mind that on the cloud there can be different cost implications depending on how you do this. (Network traffic across regions is typically much more expensive than traffic within a single zone.)
If you want to have the clusters completely separate (meaning avoiding using Pulsar geo-replication), you can splice a Pulsar replicator function into each of your flows to manually produce your messages to the new cluster until the traffic is cut over. If you do that, just be mindful that the function must have its own subscription on each topic you deploy it to ensure that you're not losing messages.
Consider scenario that
I have multiple devclouds (remote workplace for developers), they are all virtual machines running on the same bare-metal server.
In the past, they used their own MongoDB containers running on Docker. So that number of MongoDB containers can add up to over 50 instances across devclouds.
The problem becomes apparent that while 50 instances is running at the same time, but only 5 people actually perform read/write operations against their own instances. So other 45 running instances waste the server's resources.
Should I use only one MongoDB cluster by combining a set of MongoDB instances ,for everyone so that they can connect to 1 endpoint only (via internal network) to avoid wasting resources.
I am considering the sharding strategy, but the problem is there are chances that if one node taken down (one VM shut down), is that ok for availability (redundancy)?
I am pretty new to sharding and replication, looking forward to know your solutions. Thank you
If each developer expects to have full control over their database deployment, you can't combine the deployments. Otherwise one developer can delete all data in the deployment, etc.
If each developer expects to have access to one database, you can deploy a single replica set serving all developers and assign one database per developer (via authentication).
Sharding in MongoDB sense (a sharded cluster) is not really going to help in this scenario since an application generally uses all of the shards. You can of course "shard manually" by setting up multiple replica sets.
I want to deploy a couchDB V2 to manage a DataBase with 30 Terabyte. Can you please suggest me the minimal hardware configuration ?
- Number of server
- Number of nodes
- Number of cluster
- Number of replication
- Size of disk per couchDB instance
- etc.
Thanks !
You want 3 servers minimum due to quorum. Other than that, I would recommend at least 2 clusters of 3. If you want to be geographically dispersed, then you want a cluster of 3 in each location. I think those are the basic rules.
If its a single database with 30 TB, I think there needs to some way to avoid it... Here are some ideas:
Look at the nature of the docs stored in it and see if you can move out the type of doc that are frequently accessed to a different db and change the application for using it.
As suggested by fred above, the 3 servers and multiple clusters
Backup and recovery - If the database is 30TB, the backup would also take the same space. You might want the backup normally in a different datacenter. Replication for 30 TB will take a lot of time.
Read the docs of CouchDB on how deletion happens, you might want to use the filtered replication which will again take more space.
Keeping the above points in mind you might want 3 servers as suggested by fred to run couchdb for your business and more servers to maintain backups and doc deletions over a long time.
I'm investigating two alternatives for using a Hadoop cluster, the first one is using HDInsight (with either Blob or HDFS storage) and the second alternative is deploying a powerful Windows Server on Microsoft Azure and run HDP (Hortonwork Data Processing) on it (using virtualization). The second alternative gives me more flexibility, however what I'm interested in is investigating the overhead of each alternative. Any ideas on that? Particularly how is the effect of Blob storage in the efficiency?
This is a pretty broad question, so an answer of "it depends," is appropriate here. When I talk with customers, this is how I see them making the tradeoff. It's a spectrum of control at one end, and convenience on the other. Do you have specific requirements on which Linux distro or Hadoop distro you deploy? Then you will want to go with IaaS and simply deploy there. That's great, you get a lot of control, but patching and operations are still your responsibility.
We refer to HDInsight as a managed service, and what we mean by that is that we take care of running it for you (eg, there is an SLA we provide on the cluster itself, and the apps running on it, not just "can I ping the vm"). We operate that cluster, patch the OS, patch Hadoop, etc. So, lots of convenience there, but, we don't let you choose which Linux distro or allow you to have an arbitrary set of Hadoop bits there.
From a perf perspective, HDInsight can deploy on any Azure node size, similar to IaaS VM's (this is a new feature launched this week). On the question of Blob efficiency, you should try both out and see what you think. The nice part about Blob store is you get more economic flexibility, you can deploy a small cluster on a massive volume of data if that cluster only needs to run on a small chunk of data (as compared to putting it all in HDFS, where you need all of the nodes running all of the time to fit all of your data).