find the proper number of shards in elasticsearch - node.js

I've started working with elasticsearch. I don't have any cluster nor shard nor replica. I just have some node that are not in a cluster.
First, I want to improve search for my site with elasticsearch. Now imagine I have 4 nodes,I want to know how many shards I should have in only one node ?
I don't want the default 5 shards. My requirements are the following :
qps=50
size of document=300k
size of ram for one node is 5G
How many shards are needed in one node in elasticsearch when we have no cluster ?

It's recommended to set the number of shardes relatively to number of nodes, meaning,if you have 1 node, you need 1 or two shardes, but it's also depend of the number of the documents , it's need to be plus minus 1 million documents per shard.
In conclusion, use one of two shardes, but if you have more than 2 millions documents, you need more nodes

Related

How does cassandra scale for a single key read

How does cassandra handle large amount of reads for a single key? Think about a very popular celebrity whose twitter page is hit consistently.
you will usually have multiple replicas of each shard. Lets say your replica count is 3. Then reads for a single key can be spread over the nodes hosting those replicas. But that's the limit of the parallelism - adding more nodes to your cluster would not increase the number of replicas and hence the traffic would still have to talk to just those 3 nodes. There's various tricks people use for such cases (e.g. caching in the web server so it doesn't have to keep going back to the database or denormalizing the data so it is spread over more nodes).

Solr improve Search speed

in solr search how to optimizing to improve Solr search Speed. I try with different Cache mechanism but not work.we are using 65 million record to search using solr search.it takes approx. 45 sec. to search. but i want to search 65 million record approx. 5-10 sec. so friend suggest me to reduce the search time.
i am using Apache Solr (Ver. 5.2.1) .
You can create multiple core where in you can split your data into different cores. As the data gets divided/split in different cores, the search is limited to the core and limited indexed data which could improve your search speed.
In my case I have data of different category so created the cores for each category. Cores are created by category name. When a search request comes for a category, the search request is made only to that category.
The second approach is you can do the sharding which will again split the data into different shard. Here each shard will hold the index data.
When data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.
It is highly recommended that you use SolrCloud when needing to scale up or scale out.
Below are the links which will help you on the solrCloud
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Brand new to Cassandra, having trouble understanding replication topology

So I'm taking over our Cassandra cluster after the previous admin left so I'm busy trying to learn as much as I can about it. I'm going through all the documentation on Datastax's site as we're using their product.
That said, on the replication factor part I'm having a bit of trouble understanding why I wouldn't have the replication factor set to the number of nodes I have. I have four nodes currently and one datacenter, all nodes are located in the same physical location as well.
What, if any, benefit would there be to having a replication factor of less than 4?
I'm just thinking that it would be beneficial from a fault tolerance standpoint if each node had its own copy/replica of the data, not sure why I would want less replicas than the number of nodes I have. Are there performance tradeoffs or other reasons? Am I COMPLETELY missing the concept here (entirely possible)?
There are a few reasons why you might not want to increase your RF from 3 to 4:
Increasing your RF effectively multiplies your original data volume
by that amount. Depending on your data volume and data density you
may not want to incur the additional storage hit. RF > number of nodes will help you scale beyond one node's capacity.
Depending on your consistency level you could experience a performance hit. I.E. when writing with quorum consistency level (CL) to an RF of 3 you wait for 2 nodes to come back before confirming the write to the client. In RF of 4 you would be waiting for 3 nodes to come back.
Regardless of the CL, every write will eventually be going to every node. This is more activity on your cluster and may not perform well if your nodes aren't scaled for that workload.
You mentioned fault tolerance. With an RF of 4 and reads on CL one, you can absorb up to 3 of your servers being down simultaneously and your app will still be up. From a fault tolerance perspective this is pretty impressive, but also unlikely. My guess would be if you have 3 nodes down at the same time in the same dc, the 4th is probably also down (natural disaster, flood, who knows...).
At the end of the day it all depends on your needs and C* is nothing if not configurable. An RF of 3 is very common among Cassandra implementations
Check out this deck by Joe Chu
The reason why your RF is often less than the number of nodes in the cluster is explained in the post: Cassandra column family bigger than nodes drive space. This post provides insight into this interesting aspect of Cassandra replication. Here's a summary of the post:
QUESTION: . .. every node has 2Tb drive space and column family is replicated on every node so every node contains a full copy of it . . . after some years that column family will exceed 2Tb . . .
Answer: RF can be less than the number of nodes and does not need to scale if you add more nodes.
For example, if you today had 3 nodes with RF 3, each node will
contain a copy of all the data, as you say. But then if you add 3 more
nodes and keep RF at 3, each node will have half the data. You can
keep adding more nodes so each node contains a smaller and smaller
proportion of the data . . . no limit in principle to
how big your data can be.

Regarding maxIndexingThreads config in solrconfig.xml

I have a solr cluster with 8 server(4 shards with one replica for each). I
have 80 client threads indexing to this cluster. Client is running on a
different machine. I am trying to figure out optimal number of indexing
threads.
Now, solrconfig.xml have a config for maxIndexingThreads:
"The maximum number of simultaneous threads that may be indexing documents
at once in IndexWriter; if more than this many threads arrive they will wait
for others to finish. Default in Solr/Lucene is 8. "
I want to know whether this configuration is per solr instance or per
core(or collection).
Also is there a way to specify number of threads used by queries?

Practical Limits of ElasticSearch + Cassandra

I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?
See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.
Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.
AS DNA mentioned, 1700 shards, but it is not 1700 shards but there are 1700 indexes each with 1 shard and 1 replica. So it is quite possible that these 1700 indexes are not present on single machine but are split around multiple machines.
So this is never a problem
I am currently starting working with Elisandra (Elasticsearch + Cassandra)
I am also, having problems to index Cassandra with elasticsearch. My problem is basically the node configuration.
Doing $ nodetool status you can see Host ID and then ruining:
curl -XGET http://localhost:9200/_cluster/state/?pretty=true
You can check that one of the node: is the same name as Host ID

Resources