Pagination after reindex in Azure Search

Pagination after reindex in Azure Search - azure

I am new in Azure Search Service and I am not sure I got one important thing about it:
Let's pretend the situation when I am as a client scrolling down through results of my search query:
"New Y". I have 1000 elements, every page contains 10 of it. But during my scroll reindex operation has been started and some elements changed their position concerning new updates in data source (Azure Table).
Will I see next pages during my scrolling after reindex with probably some duplicated data or it still be the old "snapshot" of data I was scrolling before?

You'll see the changes as you execute subsequent requests. To Azure Search each request is independent and it represents a new search (caching aside), which for paging scenarios just happens to have a different "skip" number.
This means that if your data is changing you might see an item more than once (if it moves across pages due to changes) or even skip one (if it moves from a page you didn't see yet to a page you already saw).
There's no way to get a strictly consistent view of search matches outside of a single result. If you need to approximate this behavior you can request a larger page (using "top"), cache the results and present them in chunks. We find that in practice this is rarely needed for most search scenarios, but if search is backing a part of an app that needs consistency you might need to something along those lines.

Related

In React, speed up large initial fetch of data when web app loads

Similar to the project I am working on, this website has a search bar at the top of its home page:
On the linked website, the search bar works seemingly immediately when visiting the website. According to their own website, there have been roughly ~20K MLB players in MLB history, and this is a good estimate for the number of dropdown items in this select widget.
On my project, it currently takes 10-15 seconds to make this fetch (from MongoDB, using Node + Express) for a table with ~15mb of data that contains the data for the select's dropdown items. This 15mb of data is as small as I could make this table, as it includes only two keys (1 for the id, and 1 for the name for each dropdown). This table is large because there are more than 150K options to choose from in my project's select widget. I currently have to disable the widget for the first 15 seconds while the data loads, which results in a bad user experience.
Is there any way to make the data required for the select widget immediately available to the select when users visit, that way the widget does not have to be disabled? In particular:
Can I use localstorage to store this table in the users browser? is 15MB too big for localstorage? This table changes / increases in size daily (not too persistent), and a table in localstorage would then be outdated the next day, no?
Can I avoid all-together having to do this fetch? Potentially there is a way to load the correct data into react only when a user searches for that user?
Some other approach?
Saving / fetching quicker for 15mb of data for this select will improve our react app's user experience by quite a bit.

The data on the site you link to is basically 20k in size. It does not contain all the players but fetches the data as needed when you click on a link in the drop-down. So if you have 20Mb of searchable data, then you need to find a way to only load it as required. How to do that sensibly depends on the nature of the data. Many search bars with large result sets behind them will use a typeahead search where the user's input is posted back as they type (with a decent debounce interval) and the search results matching the user's input sent back in real time (usually with a limit of, say, the first 20 or 50 results).
So basically the answer is to find a way to only serve up the data that the user needs rather than basically downloading the entire database to the browser (option 2 of your list). You must obviously provide a search API to allow that to happen.

ElasticSearch - Update or new index?

Requirements:
A single ElasticSearch index needs to be constructed from a bunch of flat files that gets dropped every week
Apart from this weekly feed, we also get intermittent diff files, providing additional data that was not a part of the original feed (insert or update, no delete)
The time to parse and load these files (weekly full feed or the diff files) into ElasticSearch is not very huge
The weekly feeds received in two consecutive weeks are expected to have significant differences (deletes, additions, updates)
The index is critical for the apps to function and it needs to have close to zero downtime
We are not concerned about the exact changes made in a feed, but we need to have the ability to rollback to the previous version in case the current load fails for some reason
To state the obvious, searches need to be fast and responsive
Given these requirements, we are planning to do the following:
For incremental updates (diff) we can insert or update records as-is using the bulk API
For full updates we will reconstruct a new index and swap the alias as mentioned in this post. In case of a rollback, we can revert to the previous working index (backups are also maintained if the rollback needs to go back a few versions)
Questions:
Is this the best approach or is it better to CRUD documents on the previously created index using the built-in versioning, when re-constructing an index?
What is the impact of modifying data (delete, update) to the underlying lucene indices/shards? Can modifications cause fragmentation or inefficiency?

At first glance, I'd say that your overall approach is sound. Creating a new index every week with the new data and swapping an alias is a good approach if you need
zero downtime and
to be able to rollback to the previous indices for whatever reason
If you were to keep only one index and CRUD your documents in there, you'd not be able to rollback if anything goes wrong and you could end up in a mixed state with data from the current week and data from the week earlier.
Every time you update (even one single field) or delete a document, the previous version will be flagged as deleted in the underlying Lucene segment. When the Lucene segments have grown sufficiently big, ES will merge them and wipe out the deleted documents. However, in your case, since you're creating an index every week (and eventually delete the index from the week prior), you won't land into a situation where you'll have space and/or fragmentation issues.

Updating lucene index frequently causing performance degrade

I am trying to add lucene.net on my project where searching getting more complicated data. but transaction (or table modifying frequently like inserting new data or modifying the field which is used in lucene index).
Is it good to use lucene.net searching here?
How can I find modified fields & update to specific lucene index which is already created? Lucene index contains documents that are deleted from the table then how can I remove them from lucene index?
while loading right now,
I have removed index which are not available in the table based on unique Field
inserting if index does not exist otherwise updating all index which are matching table unique field
While loading page it's taking more time than normal, due to my removing/inserting/updating index method calling.
How can I proceed with it?

Lucene is absolutely suited for this type of feature. It is completely thread-safe... IF you use it the right way.
Solution pointers
Create a single IndexWriter and keep it in a globally accessible singleton (either a global static variable or via dependency injection). IWs are completely threadsafe. NEVER open multiple IWs on the same folder.
Perform all updates/deletes via this singleton. (I had one project doing 100's of ops/second with no issues, even on slightly crappy hardware).
Depending on the frequency of change and the latency acceptable to the app, you could:
Send an update/delete to the index every time you update the DB
Keep a "transaction log" or queue (probably in the same DB) of changed rows and deletions (which are are to track otherwise). Then update the index by consuming the log/queue.
To search, create your IndexSearcher with searcher = new IndexSearcher(writer.GetReader()). This is part of the NRT (near real time) pattern. NEVER create a separate IndexReader on an index folder that is also open by an IW.
Depending on your pattern of usage you may wish to introduce a period of "latency" between changes happening and those changes being "visible" to the searches...
Instances of IS are also threadsafe. So you can also keep an instance of an IS through which all your searches go. Then recreate it periodically (eg with a timer) then swap it using Interlocked.Exchange.
I previously created a small framework to isolate this from the app and make it reusable.
Caveat
Having said that... Hosting this inside IIS does raise some problems. IIS will occasionally restart your app. Is will also (by default) start the new instance before stopping the existing one, then swaps them (so you don't see the startup time of the new one).
So, for a short time there will be two instances of the writer (which is bad!)
You can tell IIS to disable "overlapping" or increase the time between restarts. But this will cause other side-effects.
So, you are actually better creating a separate service to host your lucene bits. A simple self hosted WebAPI Windows service is ideal and pretty simple. This also gives you better control over where the index folder goes and the ability to host it on a different machine (which isolates the disk IO load). And means that the service can be accessed from other parts of your system, tested separately etc etc
Why is this "better" than one of the other services suggested?
It's a matter of choice. I am a huge fan of ElasticSearch. It solves a lot of problems around scale and resilience. It also uses the latest version of Java Lucene which is far, far ahead of lucene.net in terms of capability and performance. (The same goes for the other two).
BUT, ES and Solr are Java (which may or may not be an issue for you). AzureSearch is hosted in Azure which again may or may not be an issue.
All three will require climbing a learning curve and will require infrastructure support or external third party SaaS commitment.
If you keep the service inhouse and in c# it keeps it simple and you have control over the capabilities and the shape of the API can be turned for your needs.
No "right" answer. You'll have to make choices based on your situation.

You should be indexing preferrably according to some schedule (periodically). The easiest approach is to keep the date of last index and then query for all the changes since then and index new, update and remove records. In order to keep track of removed entries in the database you will need to have a log of deleted records with a date it was removed. You can then query using that date to what needs to be removed from the lucene.
Now simply run that job every 2 minutes or so.
That said, Lucene.net is not really suited for web application, you should consider using ElasticSearch, SOLR or AzureSearch. Basically server that can handle load and multi threading better.

couchdb request never completes

I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard

you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.

caching search results effectively in redis

I need a way to cache searches on my node.js app. I had an idea that uses redis but I am not sure how to implement it.
What I want to do is have a hard limit on how many searches are going to be cached because I have a limited amount of RAM. For each search, I want to store the search query and the corresponding search results.
Lets say that my hard limit on the number of searches cached is 4. Each search query is a box in the following diagram:
If there was a new search that was not cached, the new search gets pushed to the top, and the search query at the bottom gets removed.
But if there was a search that was cached, the cached search query gets removed from its position and added to the top of the cache. For example, if search 3 was searched.
By doing this, I use a relatively same amount of memory while the most searched queries would always float around in the cache and less popular searches would go through the cache and get removed.
My question is, how exactly would I do this? I thought I may have been able to do it with lists, but I am not sure how you can check if a value exists in a list. I also thought I might be able to do this with sorted sets, where I would set the score of the set to be the index, but then if a search query gets moved within the cache, I would need to change the score of every single element in the set.

Simplest for you is to spin up new redis instance just for handling search cache. For this instance you can set max memory as needed. Then you will set maxmemory-policy for this instance to allkeys-lru. By doing this, redis will automatically delete least recently used cache entry (which is what you want). Also, you will be limiting really by memory usage not by max number of cache entries.
To this redis instance you will then insert keys as: search:$seachterm => $cachedvalue and set expire for this key for few minutes for example (so you don't serve stale answers). By doing this redis will do hard work for you.

You definitely want to use a sortedset
here's what you do:
1st query: select the top element from your sortedset: zrevrange(0,1) WITHSCORES
2nd query: in a multi, do:
A. insert your element with the score that you retrieved + 1. If that element exists in the list already it will simply be rescored and not added twice.
B. zremrankbyrank. I haven't tested this, but I think the parameters you want are (0,-maxListSize)

Take a look at ZREMRANGEBYRANK . You can limit the amount of data in your sorted set to a given size.
http://redis.io/commands/zremrangebyrank

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string