ElasticSearch - Update or new index? - search

Requirements:
A single ElasticSearch index needs to be constructed from a bunch of flat files that gets dropped every week
Apart from this weekly feed, we also get intermittent diff files, providing additional data that was not a part of the original feed (insert or update, no delete)
The time to parse and load these files (weekly full feed or the diff files) into ElasticSearch is not very huge
The weekly feeds received in two consecutive weeks are expected to have significant differences (deletes, additions, updates)
The index is critical for the apps to function and it needs to have close to zero downtime
We are not concerned about the exact changes made in a feed, but we need to have the ability to rollback to the previous version in case the current load fails for some reason
To state the obvious, searches need to be fast and responsive
Given these requirements, we are planning to do the following:
For incremental updates (diff) we can insert or update records as-is using the bulk API
For full updates we will reconstruct a new index and swap the alias as mentioned in this post. In case of a rollback, we can revert to the previous working index (backups are also maintained if the rollback needs to go back a few versions)
Questions:
Is this the best approach or is it better to CRUD documents on the previously created index using the built-in versioning, when re-constructing an index?
What is the impact of modifying data (delete, update) to the underlying lucene indices/shards? Can modifications cause fragmentation or inefficiency?

At first glance, I'd say that your overall approach is sound. Creating a new index every week with the new data and swapping an alias is a good approach if you need
zero downtime and
to be able to rollback to the previous indices for whatever reason
If you were to keep only one index and CRUD your documents in there, you'd not be able to rollback if anything goes wrong and you could end up in a mixed state with data from the current week and data from the week earlier.
Every time you update (even one single field) or delete a document, the previous version will be flagged as deleted in the underlying Lucene segment. When the Lucene segments have grown sufficiently big, ES will merge them and wipe out the deleted documents. However, in your case, since you're creating an index every week (and eventually delete the index from the week prior), you won't land into a situation where you'll have space and/or fragmentation issues.

Related

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?
Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.
You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

Can ElasticSearch delete all and insert new documents in a single query?

I'd like to swap out all documents for a specific index's type. I'm thinking about this like a database transaction, where I'd:
Delete all documents inside of the type
Create new documents
Commit
It appears that this is possible with ElasticSearch's bulk API, but is there a more direct way?
Based on the following statement, from the elasticsearch Delete by Query API Documentation:
Note, delete by query bypasses versioning support. Also, it is not recommended to delete "large chunks of the data in an index", many times, it’s better to simply reindex into a new index.
You might want to reconsider removing entire types and recreating them from the same index. As this statement suggests, it is better to simply reindex. In fact I have a scenario where we have an index of manufacturer products and when a manufacturer sends an updated list of products, we load the new data into our persistent store and then completely rebuild the entire index. I have implemented the use of Index Aliases to allow for masking the actual index being used. When products changes occur a process is started to rebuild the new index in the background (a process that currently takes about 15 minutes) and then switch the alias to the new index once the data load is complete and delete the old index. So this is completely seamless and does not cause any downtime for our users.

Running query on database after a document/row is of certain age

What is the best practice for running a database-query after any document in a collection become of certain age?
Let's say this is a node.js web-system with mongoDB, with a collection of posts. After a new post is inserted, it should be updated with some data after 60 minutes.
Would a cron-job that checks all posts with (age < one hour) every minute or two be the best solution? What would be the least stressing solution if this system has >10.000 active users?
Some ideas:
Create a second collection as a queue with a "time to update" field which would contain the time at which the source record needs to be updated. Index it, and scan through looking for values older than "now".
Include the field mentioned above in the original document and index it the same way
You could just clear the value when done or reset it to the next 60 minutes depending on behavior (rather than inserting/deleting/inserting documents into the collection).
By keeping the update-collection distinct, you have a better chance of always keeping the entire working set of queued updates in memory (compared to storing the update info in your posts).
I'd kick off the update not as a web request to the same instance of Node but instead as a separate process so as to not block user-requests.
As to how you schedule it -- that's up to you and your architecture and what's best for your system. There's no right "best" answer, especially if you have multiple web servers or a sharded data system.
You might use a capped collection, although you'd run the risk of potentially losing records needing to be updated (although you'd gain performance)

Is it possible to make conditional inserts with Azure Table Storage

Is it possible to make a conditional insert with the Windows Azure Table Storage Service?
Basically, what I'd like to do is to insert a new row/entity into a partition of the Table Storage Service if and only if nothing changed in that partition since I last looked.
In case you are wondering, I have Event Sourcing in mind, but I think that the question is more general than that.
Basically I'd like to read part of, or an entire, partition and make a decision based on the content of the data. In order to ensure that nothing changed in the partition since the data was loaded, an insert should behave like normal optimistic concurrency: the insert should only succeed if nothing changed in the partition - no rows were added, updated or deleted.
Normally in a REST service, I'd expect to use ETags to control concurrency, but as far as I can tell, there's no ETag for a partition.
The best solution I can come up with is to maintain a single row/entity for each partition in the table which contains a timestamp/ETag and then make all inserts part of a batch consisting of the insert as well as a conditional update of this 'timestamp entity'. However, this sounds a little cumbersome and brittle.
Is this possible with the Azure Table Storage Service?
The view from a thousand feet
Might I share a small tale with you...
Once upon a time someone wanted to persist events for an aggregate (from Domain Driven Design fame) in response to a given command. This person wanted to ensure that an aggregate would only be created once and that any form of optimistic concurrency could be detected.
To tackle the first problem - that an aggregate should only be created once - he did an insert into a transactional medium that threw when a duplicate aggregate (or more accurately the primary key thereof) was detected. The thing he inserted was the aggregate identifier as primary key and a unique identifier for a changeset. A collection of events produced by the aggregate while processing the command, is what is meant by changeset here. If someone or something else beat him to it, he would consider the aggregate already created and leave it at that. The changeset would be stored beforehand in a medium of his choice. The only promise this medium must make is to return what has been stored as-is when asked. Any failure to store the changeset would be considered a failure of the whole operation.
To tackle the second problem - detection of optimistic concurrency in the further life-cycle of the aggregate - he would, after having written yet another changeset, update the aggregate record in the transactional medium if and only if nobody had updated it behind his back (i.e. compared to what he last read just before executing the command). The transactional medium would notify him if such a thing happened. This would cause him to restart the whole operation, rereading the aggregate (or changesets thereof) to make the command succeed this time.
Of course, now he had solved the writing problems, along came the reading problems. How would one be able to read all the changesets of an aggregate that made up its history? Afterall, he only had the last committed changeset associated with the aggregate identifier in that transactional medium. And so he decided to embed some metadata as part of each changeset. Among the meta data - which is not so uncommon to have as part of a changeset - would be the identifier of the previous last committed changeset. This way he could "walk the line" of changesets of his aggregate, like a linked list so to speak.
As an additional perk, he would also store the command message identifier as part of the metadata of a changeset. This way, when reading changesets, he could know in advance if the command he was about to execute on the aggregate was already part of its history.
All's well that ends well ...
P.S.
1. The transactional medium and changeset storage medium can be the same,
2. The changeset identifier MUST not be the command identifier,
3. Feel free to punch holes in the tale :-),
4. Although not directly related to Azure Table Storage, I've implemented the above tale successfully using AWS DynamoDB and AWS S3.
How about storing each event at "PartitionKey/RowKey" created based on AggregateId/AggregateVersion?where AggregateVersion is a sequential number based on how many events the aggregate already has.
This is very deterministic, so when adding a new event to the aggregate, you will make sure that you were using the latest version of it, because otherwise you'll get an error saying that the row for that partition already exists. At this time you can drop the current operation and retry, or try to figure out if you could merge the operation anyways if the new updates to the aggregate do not conflict to the operation you just did.

Resources