Whether replicationFactor=2 makes sense in SolrCloud?

Whether replicationFactor=2 makes sense in SolrCloud? - search

We are trying to building our solr cloud servers, we want to increase replicationFactor, but don't want to set it as 3 as we have a lot of data.
So I am wondering whether it makes sense to set replicationFactor as 2, and what's the impact, whether this will cause problem for replica leader election such as split brain etc?
Thanks

replicationFactor will not affect whether a split brain situation arises or not. The cluster details are stored in Zookeeper. As long as you have a working Zookeper ensemble Solr will not have this issue. This means you should make sure you have 2xF+1 zookeper nodes (minimum 3) .
From zookeeper documentation:
For the ZooKeeper service to be active, there must be a majority of
non-failing machines that can communicate with each other.
To create a deployment that can tolerate the failure of F machines,
you should count on deploying 2xF+1 machines.
Here are some links explaining it further:
http://lucene.472066.n3.nabble.com/SolrCloud-and-split-brain-tp3989857p3989868.html
http://lucene.472066.n3.nabble.com/Whether-replicationFactor-2-makes-sense-tp4300204p4300206.html

Related

Which MongoDB scaling strategy (Sharding, Replication) is suitable for concurrent connections?

Consider scenario that
I have multiple devclouds (remote workplace for developers), they are all virtual machines running on the same bare-metal server.
In the past, they used their own MongoDB containers running on Docker. So that number of MongoDB containers can add up to over 50 instances across devclouds.
The problem becomes apparent that while 50 instances is running at the same time, but only 5 people actually perform read/write operations against their own instances. So other 45 running instances waste the server's resources.
Should I use only one MongoDB cluster by combining a set of MongoDB instances ,for everyone so that they can connect to 1 endpoint only (via internal network) to avoid wasting resources.
I am considering the sharding strategy, but the problem is there are chances that if one node taken down (one VM shut down), is that ok for availability (redundancy)?
I am pretty new to sharding and replication, looking forward to know your solutions. Thank you

If each developer expects to have full control over their database deployment, you can't combine the deployments. Otherwise one developer can delete all data in the deployment, etc.
If each developer expects to have access to one database, you can deploy a single replica set serving all developers and assign one database per developer (via authentication).
Sharding in MongoDB sense (a sharded cluster) is not really going to help in this scenario since an application generally uses all of the shards. You can of course "shard manually" by setting up multiple replica sets.

Logically seperate azure kubernetes deployments

I have a kubernetes cluster created and deployed with app.
First if I deployed with firstapp.yaml which created a pod and a service to expose the pod externally .
If i have two nodes in the cluster and then make another deployment with secondapp.yaml .
I noticed ,that the second deployment went to different node. Although this is desired behaviour for logical seperation .
Is it something that's provided by kubernetes. How will it manage deployments made using different files? will they always go on seperate nodes (if there are nodes provisioned) ?
If not, what is the practice to be followed if i want logical seperation between two nodes which i want to behave as two environments , let's say dev and qa environment.

No, they will not necessary go to different nodes. Scheduler determines where to put the pod based on different criteria.
As for your last question - it makes no sense. You can use namespaces\network policies to separate environments, you shouldn't care on which node(s) your pods are. Thats the whole point of having a cluster.
You can use placement constraints to achieve what you ask for, but it makes no sense at all.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Agreed with #4c74356b41.
As an addition. It does not matter where your pods are, you can have multiple replicas of your application split between say 50 nodes, they can still communicate with each other (services, service discovery, network CNI) and share resources etc.
And yes, this is a default behavior of Kubernetes, which you can influence by taints, toleration's, resources, limits Node affinity and anti-affinity (you can find a lot of information about each of those in documentation or simply googling it). Also where the Pods are scheduled is dependent on Node capacity. Your pod has been set to a particular Node because Scheduler calculated it had best score, first taking into account mentioned conditions. You can find details about the process here.
Again as #4c74356b41 mentions, if you want to split your cluster into multiple environments, lets say for different teams or as you mention for dev and qa environments you can use namespaces for that. They are basically making a smaller clusters in your cluster (note that this is more of a logical separation and not a separation from security perspective, until you add other components like
for example roles)
You can just add a namespace field to your deployment YAML to specify into which namespace you want to deploy your pods - still does not matter on which nodes they are. Depending on your use case.
Please note that what I wrote is oversimplified and I didn't mention many things in between, which you can easily find in most Kubernetes tutorials.

Multiple instances of Cassandra on each node in the cluster

Is it possible to have a cluster in Cassandra where each of the server is running multiple instances of Cassandra(each instance is part of the same cluster).
I'm aware that if there's a single server in the cluster, then it's possible to run multiple instances of Cassandra on it, but is it also possible to have multiple such servers in the cluster. If yes, how will the configuration look like(listen address,ports etc)?
Even if it was possible, I understand that there might not be any performance benefits at all, just wanted to know if it's theoretically possible.

Yes, it's possible & such setup is often used for testing, for example, using CCM, although it creates multiple interfaces on loopback (127.0.0.2, ...). DataStax Enterprise also has so-called Multi-instance.
You need carefully configure your instances separating ports, etc. Right now, potentially using the Docker could be the simpler solution to implement it.
But why do you need to do it? Until you have really biffy machine, with a lot of RAM & multiple SSDs, this won't bring you additional performance.

Yes, it is possible even i have worked with 5 instance running in one server in production cluster.
Trust me still it is still running but the generic issues i had is high GC all the time, dropped mutations and high latency so of course it is not good to have this kind of setup.
but for your questions's answer yes it is possible and can be in production also.

How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API?

I have an Asp.Net core 2.0 Wen API that has a relatively simple logic (simple select on a SQL Azure DB, return about 1000-2000 records. No joins, aggregates, functions etc.). I have only 1 GET API. which is called from an angular SPA. Both are deployed in service fabric as as stateless services, hosted in Kestrel as self hosting exes.
considering the number of users and how often they refresh, I've determined there will be around 15000 requests per minute. in other words 250 req/sec.
I'm trying to understand the different settings when creating my Service Fabric cluster.
I want to know:
How many Node Types? (I've determined as Front-End, and Back-End)
How many nodes per node type?
What is the VM size I need to select?
I have ready the azure documentation on cluster capacity planning. while I understand the concepts, I don't have a frame of reference to determine the actual values i need to provide to the above questions.

In most places where you read about the planning of a cluster they will suggest that this subject is part science and part art, because there is no easy answer to this question. It's hard to answer it because it depends a lot on the complexity of your application, without knowing the internals on how it works we can only guess a solution.
Based on your questions the best guidance I can give you is, Measure first, Measure again, Measure... Plan later. Your application might be memory intensive, network intensive, CPU, Disk and son on, the only way to find the best configuration is when you understand it.
To understand your application before you make any decision on SF structure, you could simply deploy a simple cluster with multiple node types containing one node of each VM size and measure your application behavior on each of them, and then you would add more nodes and span multiple instances of your service on these nodes and see which configuration is a best fit for each service.
1.How many Node Types?
I like to map node types as 1:1 to roles on your application, but is not a law, it will depend how much resource each service will consume, if the service consume enough resource to make a single VM(node) busy (Memory, CPU, Disk, IO), this is a good candidate to have it's own node type, in other cases there are services that are light-weight that would be a waste of resources provisioning an entire VM(node) just for it, an example is scheduled jobs, backups, and so on, In this case you could provision a set of machines that could be shared for these services, one important thing you have to keep in mind when you share a node-type with multiple service is that they will compete for resources(memory, CPU, network, disk) and the performance measures you took for each service in isolation might not be the same anymore, so they would require more resources, the option is test them together.
Another point is the number of replicas, having a single instance of your service is not reliable, so you would have to create replicas of it(the right number I describe on next answer), in this case you end up with a service load split in to multiple nodes, making this node-type under utilized, is where you would consider joining services on same node-type.
2.How many nodes per node type?
As stated before, it will depend on your service resource consumption, but a very basic rule is a minimum of 3 per node type.
Why 3?
Because 3 is the lowest number where you could have a rolling update and guarantee a quorum of 51% of nodes\service\instances running.
1 Node: If you have a service running 1 instance in a node-type of 1 node, when you deploy a new version of your service, you would have to bring down this instance before the new comes up, so you would not have any instance to serve the load while upgrading.
2 Nodes: Similar to 1 node, but in this case you keep only 1 node running, in case of failure, you wouldn't have a failover to handle the load until the new instance come up, it will worse if you are running a stateful service, because you will have only one copy of your data during the upgrade and in case of failure you might loose data.
3 Nodes: During a update you still have 2 nodes available, when the one being updated get back, the next one is put down and you still have 2 nodes running, in case of failure of one node, the other node can support the load until a new node is deployed.
3 nodes does not mean the your cluster will be highly reliable, it means the chances of failure and data loss will be lower, you might be unlucky a loose 2 nodes at same time. As suggested in the docs, in production is better to always keep the number of nodes as 5 or more, and plan to have a quorum of 51% nodes\services available. In this case I would recommend 5, 7 or 9 nodes in cases you really need higher uptime 99.9999...%
3.What is the VM size I need to select?
As said before, only measurements will give this answer.
Observations:
These recommendations does not take into account the planning for primary node types, it is recommended to have at least 5 nodes on primary Node Types, it is where SF system services are placed, they are responsible to manage the
cluster, so they must be highly reliable, otherwise you risk losing control of your cluster. If you plan to share these nodes with your application services, keep in mind that your services might impact them, so you have to always monitor them to check for any impact it might cause.

CouchDB V2.0 required hardware and software configuration

I want to deploy a couchDB V2 to manage a DataBase with 30 Terabyte. Can you please suggest me the minimal hardware configuration ?
- Number of server
- Number of nodes
- Number of cluster
- Number of replication
- Size of disk per couchDB instance
- etc.
Thanks !

You want 3 servers minimum due to quorum. Other than that, I would recommend at least 2 clusters of 3. If you want to be geographically dispersed, then you want a cluster of 3 in each location. I think those are the basic rules.

If its a single database with 30 TB, I think there needs to some way to avoid it... Here are some ideas:
Look at the nature of the docs stored in it and see if you can move out the type of doc that are frequently accessed to a different db and change the application for using it.
As suggested by fred above, the 3 servers and multiple clusters
Backup and recovery - If the database is 30TB, the backup would also take the same space. You might want the backup normally in a different datacenter. Replication for 30 TB will take a lot of time.
Read the docs of CouchDB on how deletion happens, you might want to use the filtered replication which will again take more space.
Keeping the above points in mind you might want 3 servers as suggested by fred to run couchdb for your business and more servers to maintain backups and doc deletions over a long time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Whether replicationFactor=2 makes sense in SolrCloud? - search

Related

Which MongoDB scaling strategy (Sharding, Replication) is suitable for concurrent connections?

Logically seperate azure kubernetes deployments

Multiple instances of Cassandra on each node in the cluster

How do I determine the number of Node Types, Number of nodes and VM size in Service Fabric cluster for a relatively simple but high throughput API?

CouchDB V2.0 required hardware and software configuration

Categories

Resources