Regarding maxIndexingThreads config in solrconfig.xml - multithreading

I have a solr cluster with 8 server(4 shards with one replica for each). I
have 80 client threads indexing to this cluster. Client is running on a
different machine. I am trying to figure out optimal number of indexing
threads.
Now, solrconfig.xml have a config for maxIndexingThreads:
"The maximum number of simultaneous threads that may be indexing documents
at once in IndexWriter; if more than this many threads arrive they will wait
for others to finish. Default in Solr/Lucene is 8. "
I want to know whether this configuration is per solr instance or per
core(or collection).
Also is there a way to specify number of threads used by queries?

Related

Cassandra concurrent read and write

I am trying to understand the Cassandra concurrent read and writes. I come across the property called
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
So as per the definition, Correct me If am wrong, 4 threads can access the database concurrently. So let's say I am trying to run the following query,
SELECT max(column1) from 'testtable' WHERE duration = 'month';
I am just trying to execute this query, What will be the use of concurrent read in executing this query?
Thats how many active reads can run at a single time per host. This is viewable if you type nodetool tpstats under the read stage. If the active is at pegged at the number of concurrent readers and you have a pending queue it may be worth trying to increase this. Its pretty normal for people to have this at ~128 when using decent sized heaps and SSDs. This is very hardware dependent so defaults are conservative.
Keep in mind that the activity on this thread is very fast, usually measured in sub ms but assuming they take 1ms even with only 4, given little's law you have a maximum of 4000 (local) reads per second per node max (1000/1 * 4), with RF=3 and quorum consistency that means your doing a minimum of 2 reads per request so can divide in 2 to think of a theoretical (real life is ickier) max throughput.
The aggregation functions (ie max) are processed on the coordinator, after fetching the data of the replicas (each doing a local read and sending response) and are not directly impacted by the concurrent reads since handled in the native transport and request response stages.
From cassandra 2.2 onward, the standard aggregate functions min, max, avg, sum, count are built-in. So, I don't think concurrent_reads will have any effect on your query.

find the proper number of shards in elasticsearch

I've started working with elasticsearch. I don't have any cluster nor shard nor replica. I just have some node that are not in a cluster.
First, I want to improve search for my site with elasticsearch. Now imagine I have 4 nodes,I want to know how many shards I should have in only one node ?
I don't want the default 5 shards. My requirements are the following :
qps=50
size of document=300k
size of ram for one node is 5G
How many shards are needed in one node in elasticsearch when we have no cluster ?
It's recommended to set the number of shardes relatively to number of nodes, meaning,if you have 1 node, you need 1 or two shardes, but it's also depend of the number of the documents , it's need to be plus minus 1 million documents per shard.
In conclusion, use one of two shardes, but if you have more than 2 millions documents, you need more nodes

Will Elasticsearch survive this much load or simply die?

We have Elasticsearch Server with 1 cluster 3 Nodes, we are expecting that queries fired per second will be 800-1000, so we want to know if we get load like 1000 queries per second then will the elasticsearch server respond with delays or it will simply stop working ?
Queries are all query_string, fuzzy (prefix & wildcard queries are not used).
There's a few factors to consider assuming that your network has the necessary throughput:
What's the CPU speed and number of cores for each node?
Should have 2GHZ quad cores at the very least. Also the nodes should be dedicated to ELK, so they aren't busy with other tasks.
How much ram do your nodes have?
Probably want to be north of 10GB at least
Are your logs filtered and indexed?
Having your logs filtered will greatly reduce the work load generated by the queries. Additionally, filtered logs can make it so that you don't have to query as much with wild cards (which are very expensive).
Hope that helps point in a better direction :)
One immediate suggestion: if you are expecting sustained query rates of 800 - 1K/sec you do not want the nodes storing the data (which will be handling indexing of new records, merging and shard rebalancing) to also be having to deal with query scatter/gather operations. Consider a client + data node topology where you keep your 3 nodes and add n client nodes (data and master set to false in their configs.) The actual value for n will vary based on your actual performance; this will be something you'll want to determine via experimentation.
Other factors equal or unknown, abundant memory is a good resource to have. Review the Elastic team's guidance on hardware and be sure to link through to the discussion on heap.

How does partitions map to tasks in Spark?

If I partition an RDD into say 60 and I have a total of 20 cores spread across 20 machines, i.e. 20 instances of single core machines, then the number of tasks is 60 (equal to the number of partitions). Why is this beneficial over having a single partition per core and having 20 tasks?
Additionally, I have run an experiment where I have set the number of partitions to 2, checking the UI shows 2 tasks running at any one time; however, what has surprised me is that it switches instances on completion of tasks, e.g. node1 and node2 do the first 2 tasks, then node6 and node8 do the next set of 2 tasks etc. I thought by setting the number of partitions to less than the cores (and instances) in a cluster then the program would just use the minimum number of instances required. Can anyone explain this behaviour?
For the first question: you might want to have more granular tasks than strictly necessary in order to load less into memory at the same time. Also, it can help with error tolerance, as less work needs to be redone in case of failure. It is nevertheless a parameter. In general the answer depends on the kind of workload (IO bound, memory bound, CPU bound).
As for the second one, I believe version 1.3 has some code to dynamically request resources. I'm unsure in which version the break is, but older versions just request the exact resources you configure your driver with. As for how comes a partition moves from one node to another, well, AFAIK it will pick the data for a task from the node that has a local copy of that data on HDFS. Since hdfs has multiple copies (3 by default) of each block of data, there are multiple options to run any given piece).

Optimal number of actor instances(threads) needed to load document on solr cloud

I have a situation where I need to load documents from my app (in millions) into *solr cloud with zookeeper as a configuration synchronization service *. I am stuck with the performance issues due to lot of incoming document flux. Let's say I have two shards of solr running and two instances of zookeeper host for each shard. So my approach is something like this :
var rtr = system.actorOf(Props(new solrCloudActor(zkHost,core)).withRouter(SmallestMailboxRouter(nrOfInstances = 8)))
//router vector created globally with 8 instances based on some black box tests that single solr instance can utilize 8 threads in parallel for loading.
.
..
...
val doc:SolrInputDocument = new SolrInputDocument() //repeated million times depending on number of documents and creating docs here
doc.addfield("key","value")
.
...
rtr ! loadDoc(doc) // broadcasting the doc here
class solrCloudActor(zkHost:String,solrCoreName:String) extends Actor{
val server:CloudSolrServer = new CloudSolrServer(zkHost)
server.setDefaultCollection(solrCoreName)
def recieve{
case loadDoc(d:SolrInputDocument) => server.add(d)
}
}
My few concerns here :
Is this approach correct .Actually this made sense when I had single instance of solr and created 8 router vector instances of httpclient actor instead of solrcloud with zookeeper .
What is the optimal number of threads needed to make the solr loading at its peak when I have millions of documents in queue.Is it numofshards x some_optimal_number or the number of threads depends on per shard per core basis or is it the average :(numofshards x some_optimal_number + numberofcore)/numberofcore ..
Do I even need to worry about parallelism ? Can the single solrcloud server instance to which I initiate by providing all comma separated zookeeper host takes care of the distribution of docs.
If at all I am going in complete wrong direction please suggest a better way to improve performance.
Number of Actors and number of threads is not the same thing. Actors use threads from a pool as and when they have work to do.
The number of threads that can be running concurrently is limited to the pool size which (unless otherwise specifically configured) is dynamic, but typically matches the number of cores.
So the ideal number of pooled actors is roughly the same as the number of pooled threads.
The number of pooled threads, in an ideal world, is the number of cores.
But... we don't live in an ideal world. An ideal world has no blocking operations, no network or other IO latency, no other processes competing for resources on the machine, etc. etc.
In a non-ideal (a.k.a real) world. The best number depends on your codebase and your specific environment. Only you and your profiler can answer that one.

Resources