We suffer from "slow down of solr's indexing performance over time"
We try to measure indexing throughput of SolrCloud. 8 SolrJ clients sends update request to a solr and each client has 20 threads. (Total 160 threads sends request to solr) We believe that throughput depends on disk I/O performance of solr machine.
The results are below :
At first few time, solr uses full disk write ability. But IO amounts per time is getting lower over time. As the result indexing speed also slow down.
Test environment:
1 physical machine,
256GB RAM,
RAID-0 data volume (11disk),
16core CPU.
1 collection,
1 shard,
1 rf,
60gb HeapSize,
java8
Zookeeper is running on same machine.
CPU and network, zookeeper are not bottleneck.
Related
I deployed a Cassandra 2.2 ring composed by 4 nodes in the cloud with 8 vCPU and 8GB of ram. I am running some tests now with cassandra-stress and YCSB tools to test its performance. I am mainly interested in read requests with a small amount of write requests (95%/5%).
Running the experiments, I noticed that even setting a high number of threads (or clients) the CPU (and disk) does not saturate, but still always around the 60% of utilisation.
I am trying to figure out where is the bottleneck in my system. From the hardware point of view it seems all ok to me.
dstat
I also looked into the Cassandra configuration file to see if there are some tuning parameters to increase the system throughput. I increase the value of concurrent_read/write parameter, but it doesn't increase the performance.
The log file also does not contain any warning.
What it could be that is limiting my system?
Thanks
You might want to consider running cassandra-stress from outside the cluster and on multiple instances as described in
Usage of the Cassandra tool cassandra-stress
I have successfully installed a multi-node Cassandra cluster with 10nodes,
The nodetool status command shows every node is UP and NORMAL.
but the Performance I am getting is very bad.
here are my results:
Operations /seconds = 4000
Read Latency = 13ms
write Latency = 10ms
I am using YCSB to measure performance
Tuning that I have done till now:
Consistency level = 1
Replication Factor = 3
Heap size = 4GB
My Hardware:
Each node is a VM with CentOS
2GHZ CPU with 8 cores
8GB RAM
1GB/ps N/W
Please let me know what more settings I can tweak to get maximum performance out of my cluster.
If you have 1 system with 10 VMs running on it and 1 disk, the performance of any (not in-memory) database will be bad. Especially with spinning disks (no matter how expensive they are) is going to be a major contention point. With a really good SSD you may be able to pull off a few instances, but performance stress testing will likely always hit either that or a CPU bottleneck (if things configured correctly for system).
Pretty good chance with 4gb heaps and a stress workload you are going to be hitting GC and memory issues, do you have any monitoring around that? Can use visualvm and connect to the ip:7199 (ip set in cassandra-env.sh).
8gb of ram per vm is on the minimum spec end. You want at least 8gb of JVM heap with space for the offheap stuff and OS. A 16gb system is likely sufficient. Once again the shared disk will kill performance so it will only go so far. but should be able to do far better than 4k/sec.
This is a follow-on question to Why is my cassandra throughput not improving when I add nodes?. I have configured my client and nodes as closely as I could to what is recommended here: http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html. The whole setup is not exactly world class (the client is on a laptop with 32G of RAM and a modern'ish processor, for example). I am more interested in developing an intuition for the cassandra infrastructure at this point.
I notice that if I shut down all but one of the nodes in the cluster and run my test client against it, I get a throughput of ~120-140 inserts/s and a CPU utilization of ~30-40%. When I crank up all 6 nodes and run this one client against them, I see a throughput of ~110-120 inserts/s and my CPU utilization gets to between ~80-100%.
All my tests are run with a clean DB (I completely delete all DB files and restart from scratch) and I insert 30M rows.
My test client is multi-threaded and each thread writes exclusively to one partition using unlogged batches, as recommend by various sources for a schema like mine (e.g. https://lostechies.com/ryansvihla/2014/08/28/cassandra-batch-loading-without-the-batch-keyword/).
Is this CPU spike expected behavior?
We have a solr cloud setup of 4 shards (one shard per physical machine) having ~100 million documents. Zookeeper is on one of those 4 machines. We encounter complex queries having wild cards and proximity searches together and it sometimes takes more than 15 secs to get top 100 documents. Query traffic is very very low at the moment (2-3 queries every minute). 4 Servers hosting cloud have following specs:
(2 servers -> 64 GB RAM, 24 CPU cores, 2.4 GHz) + (2 servers -> 48 GB RAM, 24 CPU cores, 2.4GHz).
We are providing 8 GB JVM memory per shard. Our 510GB index on SSDs per machine (totalling to 4*510 GB = 2.4TB) is mapped into OS disk cache on remaining RAM on each server. So I suppose RAM is not an issue for us.
Now Interesting thing to note is: When a query is fired to the cloud, only one CPU core is utilized to 100% and rest are all at 0%. Same behaviour is replicated on all the machines. No other processes are running on these machines.
Shouldn't solr be doing multi-threading of some-kind to utilize the CPU cores? Can I anyhow increase CPU consumption for each queries as traffic is not a problem. If so, how?
A single request to a Solr shard is largely processed single-threaded (you can set threads for faceting on multiple fields). Rule of thumb is to keep document count for shards to no more than a very few hundreds of millions. You are well below that with 25M/shard, but as you say, your queries are complex. What you see is simple the effect of single-threaded processing.
The solution to your problem is to use more shards, as all shards are queried in parallel. As you have a lot of free CPU cores and very little traffic, you might want to try running 10 shards on each machine. It is not a problem for SolrCloud to use 40 shards in total and the increased merging overhead should be insignificant compared to your heavy queries.
I suspect the answer is "it depends", but is there any general guidance about what kind of hardware to plan to use for Presto?
Since Presto uses a coordinator and a set of workers, and workers run with the data, I imagine the main issues will be having sufficient RAM for the coordinator, sufficient network bandwidth for partial results sent from workers to the coordinator, etc.
If you can supply some general thoughts on how to size for this appropriately, I'd love to hear them.
Most people are running Trino (formerly PrestoSQL) on the Hadoop nodes they already have. At Facebook we typically run Presto on a few nodes within the Hadoop cluster to spread out the network load.
Generally, I'd go with the industry standard ratios for a new cluster: 2 cores and 2-4 gig of memory for each disk, with 10 gigabit networking if you can afford it. After you have a few machines (4+), benchmark using your queries on your data. It should be obvious if you need to adjust the ratios.
In terms of sizing the hardware for a cluster from scratch some things to consider:
Total data size will determine the number of disks you will need. HDFS has a large overhead so you will need lots of disks.
The ratio of CPU speed to disks depends on the ratio between hot data (the data you are working with) and the cold data (archive data). If you just starting your data warehouse you will need lots of CPUs since all the data will be new and hot. On the other hand, most physical disks can only deliver data so fast, so at some point more CPUs don't help.
The ratio of CPU speed to memory depends on the size of aggregations and joins you want to perform and the amount of (hot) data you want to cache. Currently, Presto requires the final aggregation results and the hash table for a join to fit in memory on a single machine (we're actively working on removing these restrictions). If you have larger amounts of memory, the OS will cache disk pages which will significantly improve the performance of queries.
In 2013 at Facebook we ran our Presto processes as follows:
We ran our JVMs with a 16 GB heap to leave most memory available for OS buffers
On the machines we ran Presto we didn't run MapReduce tasks.
Most of the Presto machines had 16 real cores and used processor affinity (eventually cgroups) to limit Presto to 12 cores (so the Hadoop data node process and other things could run easily).
Most of the servers were on a 10 gigabit networks, but we did have one large old crufty cluster using 1 gigabit (which worked fine).
We used the same configuration for the coordinator and the workers.
In recent times, we ran the following:
The machines had 256 GB of memory and we ran a 200 GB Java heap
Most of the machines had 24-32 real cores and Presto was allocated all cores.
The machines had only minimal local storage for logs, with all table data remote (in a proprietary distributed file system).
Most servers had a 25 gigabit network connection to a fabric network.
The coordinators and workers had similar configurations.