Elasticsearch indexing speed with Nodejs

Elasticsearch indexing speed with Nodejs - node.js

I have an Elasticsearch indice with a large amount of documents. I've been using javascript up until this point with Node.js. I made a cron job to run every 24 hours to update the documents individually based on any metadata changes. As some of you may know, this is probably the slowest possible way to do it. Single threaded Node.js with individual indexing on Elasticsearch. When I run the cron job, it runs at a snails pace. I can update a document every maybe 1.5-2 seconds. This means it would take around 27 hours to update all the docuemnts. I'm using a free-tier AWS ES instance so I don't have access to certain features that would help me speed up the process.
Does anyone know of a way to speed up the indexing? If I were to call for a bulk update, how would that manifest in javascript? If I were to use another language to multi-thread it, what would be the fastest option?

I did not understand your question "If I were to call for a bulk update, how would that manifest in javascript?".
Bulk update should be the best solution irrespective of any language/framework. Of-course, you can explore other languages like Ruby to leverage threads to make bulk update more distributed and fast.
From experience, a bulk update with a batch size b/w 4-7k works just fine. You may want to fine tune the size in this range.
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

Related

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?

I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool

Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.

just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?

Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

Inserting mappings for a large amount of existing data

I am currently testing inserting a large amount of mappings for existing data into a shardmap using Elastic Scale. Turns out the whole process is time consuming. It's inserting around ~10 mappings/second. Is there anyway to speed up the insertion e.g. by inserting batches of mappings or directly via stored procedures?

We know from our own testing that inserting mappings is time consuming. Here are a couple of options I'd suggest you try:
You can run multiple parallel threads inserting the mappings.
You can increase the service level objective for the shard map database for the time where you do the bulk load.
I understand why you would want to load mappings in bulk for test scenarios. However, I am not sure I understand the reason why you will need so many mappings that this becomes a problem. Could you explain a bit more?
Thanks,
Torsten

Coming back to this question, since now we have published ShardManagement PowerShell module along with some sample scripts here: https://gallery.technet.microsoft.com/scriptcenter/Azure-SQL-DB-Elastic-731883db. This should help you in settings and querying the existing range/list mappings quickly.

dynamodb scan with delay (or limited provision) - nodejs

So i am just starting and learning dynamodb and encountered the following problem,
I am using connect-dynamodb to implement a session database with dynamodb, and while developing and learning at the same time i learned that scans are expensive, however - connect-dynamo (as any db framework) uses a reap interval to clean expired sessions, and scans the table every X interval.
i found a nice solution here, but this is using a java class - and was wondering if there is any similar parallel solution with nodejs.
if not, would be glad to hear about any other good solution for a infrequently schedule read burst. like a scan with "delay" to avoid exceeding read capacity.
Thanks.

I'm using node and dynamoDB allot and just looked inside the module connect-dynamo.
The main problem with this modul is that it uses a table of type "HASH".
It should be a "RANGE" table with expires as range key.
Then it is possible to do a query instead of a scan, wtich is way cheaper.
So my advice is not to use this module ;-)
Or fork it and change it to RANGE table!

Using Solr with frequently updated data

I have a site search I would like to implement using Solr. Unfortunately, I also have a lot of frequently updated dynamic data in my MySQL database from cron jobs, which I would also like to be searchable.
I would automatically assume that constantly updating records in Solr is not a good idea so is there a workable solution to give me the text-search power of Solr as well as being able to filter based on these frequently updated fields?

I think this depends what "frequently" means and how long your tolerated Solr-lag is.
In my case, i update Solr twice every minute, which works fine.
..based on an MySql DB with some hundred updates a Minute.
In this situation it's important NOT to run an optimize on every Solr update/commit. Better run an optimize every n hoers.
So finally, all the new MySQL stuff will be visible in Solr with max. 30 sec. delay.
It depends on your situation if this is acceptable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string