Elasticsearch bad indexing time - azure

I am trying to migrate (copy) 35 million documents (which is a standard amount, not too big) between couchbase to elasticsearch.
My elasticsearch (version 1.3) cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS Severs on Microsoft Azure (each server equals to a large server on Amazon)..
I used "timing data flow" indexing to store the docuemnts. each index represents a month and composed by 3 shards and 2 replicas.
when i start the migration script i see that the insertion time is becoming very slow (about 10 documents per second) and the load average of each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)
Does anyone faced these kind of indexing problems in elasticsearch?
I would like to know if there are any parameters that i should be aware about to extend java memory?
is my cluster specifications good enough to handle 100 indexing per second.
is the indexing time depends on how big is the index? and should it be that slow?
Thnx Niv

I am quoting an answer I got in google group (link)
A couple of suggestions:
Disable replicas before large amounts of inserts (set replica count to 0), and only enable it afterwards again.
Use batching, actual batch size would depends on many factors (doc sizes, network, instances strengths)
Follow ES's advice on node setup, e.g. allocate 50% of the available memory size to the Java heap of ES, don't run anything else
on that machine, and disable swappiness.
Your index is already sharded, try spreading it out to 3 different servers instead of having them on one server ("virtual shards"). This
will help fan out the indexing load.
If you don't specify the document IDs yourself, make sure you use the latest ES, there's a significant improvement there in the ID
generation mechanism which could help speeding up things.
I applied points 1 & 3 and it seems that the problems solved :)
now i am indexing in rate of 80 docs per second and the load avg is low (0.7 at max)
I have to give the credit to Itamar Syn-Hershko that posted this reply.

Related

poor s3 upload performance using spark

I want to verify the results of
http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
but using spark. So, far increase the number of partitions results in same or worse upload speed. not even close to the author's 1 GB/sec. Granted, my instance is M1.xlarge, not optimized for network, but still it is rated at 1Gb / sec. And, for this purpose, I am only trying to verify the benefit of partitioning RDD and parallel save of each partition.
My hunch is the limit of concurrent connections, which the article stated to be 2 for Windows server. I am using Amazon linux, for which I have seen numbers like 20 concurrent connections by default. If that is true, I should see increase if throughput with the experimental parameters I used. Is there a way to verify this number. Or, if its low, how would I increase it?
Ok, there is apparently a bad problem with Spark - S3 interface. I repeated the experiment using aws client and threads, just as described in the article, and got a clear performance gain with increasing thread count, And the boost in speed is up to 10 times.

Will Elasticsearch survive this much load or simply die?

We have Elasticsearch Server with 1 cluster 3 Nodes, we are expecting that queries fired per second will be 800-1000, so we want to know if we get load like 1000 queries per second then will the elasticsearch server respond with delays or it will simply stop working ?
Queries are all query_string, fuzzy (prefix & wildcard queries are not used).
There's a few factors to consider assuming that your network has the necessary throughput:
What's the CPU speed and number of cores for each node?
Should have 2GHZ quad cores at the very least. Also the nodes should be dedicated to ELK, so they aren't busy with other tasks.
How much ram do your nodes have?
Probably want to be north of 10GB at least
Are your logs filtered and indexed?
Having your logs filtered will greatly reduce the work load generated by the queries. Additionally, filtered logs can make it so that you don't have to query as much with wild cards (which are very expensive).
Hope that helps point in a better direction :)
One immediate suggestion: if you are expecting sustained query rates of 800 - 1K/sec you do not want the nodes storing the data (which will be handling indexing of new records, merging and shard rebalancing) to also be having to deal with query scatter/gather operations. Consider a client + data node topology where you keep your 3 nodes and add n client nodes (data and master set to false in their configs.) The actual value for n will vary based on your actual performance; this will be something you'll want to determine via experimentation.
Other factors equal or unknown, abundant memory is a good resource to have. Review the Elastic team's guidance on hardware and be sure to link through to the discussion on heap.

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

cassandra read performance for large number of keys

Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?
Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family
Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so
After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.

Practical Limits of ElasticSearch + Cassandra

I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?
See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.
Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.
AS DNA mentioned, 1700 shards, but it is not 1700 shards but there are 1700 indexes each with 1 shard and 1 replica. So it is quite possible that these 1700 indexes are not present on single machine but are split around multiple machines.
So this is never a problem
I am currently starting working with Elisandra (Elasticsearch + Cassandra)
I am also, having problems to index Cassandra with elasticsearch. My problem is basically the node configuration.
Doing $ nodetool status you can see Host ID and then ruining:
curl -XGET http://localhost:9200/_cluster/state/?pretty=true
You can check that one of the node: is the same name as Host ID

Resources