Faster autocomplete implementation techniques - search

I have a website that gets daily 5000 hits on an average. It has got autocomplete search box on all pages. I have around 5000 keywords in my database, which will grow gradually.
I make ajax call to ashx handler, as soon as user enters 3rd character in to autocomplete search box. Ashx handler then fetches all the keywords from database that 'starts-with' the user's query.
But I find this process slow. I am thinking of 2 options here.
1. Storing the keywords in xml file. And then processing this file for keyword search using any of the techniques like XpathNavigator, Linq etc.
2. Storing all the keywords into a SortedList/ Hashset object and saving this object into cache.
I am unable to decide which option will be feasible for me. What will be the performance overhead or risk of saving the whole object into cache?

Use a Trie tree, it is the data structure being used by search engines and autocomplete dictionary in mobile phones

Related

REST API: Infinite scroll pagination in the GUI, but allow searching through all entries

I have Express running in a Node.js server, which serves as a backen for my React frontend application.
The frontend application fetches data from the backend (which is stored in Mongo) through a REST call, and display this data in a table.
The amount of data is growing by the day, so I though I should look into reducing the abount of data transferred to the frontend application, so avoid unnecessary strain on the backend.
I'm not sure if this is the right way to approach this, but I've been thinking I would look into having the backen fetch a limited amount of entries, so that only these data will be displayed in the frontend table.
The problem arises with searching - when the user wants to search the data in the table, I'll need to be able to search through all entries, not just the data loaded into the table.
I guess one option would be to have the search function actually query the REST API, instead of searching the table itself.
If I'm on the right track, I guess I could implement REST API pagination, somewhere along the example found in https://refactoringfactory.wordpress.com/2012/09/08/pagination-in-node-js-and-express/. Other suggestions on how to implement pagination are welcome.
I'd very much like some input on the approach I described, and suggestions for smarter ways implement this.
EDIT: I changed the title somewhat to include "Infinite scroll pagination". This is what I'm looking to implement. At the moment I have a click on pages pagination setup, but would like to replace this for the infinite scroll pagination.
I've been thinking I would look into having the backen fetch a limited amount of entries, so that only these data will be displayed in the frontend table.
This is common practice in my experience. The term for it is "pagination." Have a look at this SO question regarding best practices for pagination in REST API's: API pagination best practices.
The problem arises with searching - when the user wants to search the data in the table, I'll need to be able to search through all entries, not just the data loaded into the table.
I guess one option would be to have the search function actually query the REST API, instead of searching the table itself.
Again, you got it. Doing small filters/searches on the client is fine for a limited number of entries, but if you need to only retrieve items matching search criteria in the first place, then adding that functionality to your REST API is the right choice.
Right, you should do
pagination: you might implement it by exposing 2 arguments in the rest endpoint for the listing
?p=<number>: page number, defaults to 1
?l=<number>: number of items per page / page length, defaults to a number maybe from 10 to 100
search: implement it by exposing 1 argument in the rest endpoint for the listing
/?q=<string>: you can define to be what you want, maybe a string that matches with one or multiple fields of the data
If you want to minimize the network traffic, you might also add one more parameter to explicitly select the fields you want to be returned, like this
/?f=<string>: string could be something like id,name,age, and so the api should return only those three fields per record.
All this parameters should be accepted by a list endpoint in your RESTful API
Example:
http://example.com/api/cars/?p=2&l=15&q=toyota&f=id,brand,model,color

Are there reasons why FTSearch would not be a suitable alternative to DBColumn in a Type-Ahead on an XPage, when trying to improve performance?

I have a general requirement in my current project to make an existing XPage application faster. One thing we looked at was how to speed up some slower type-ahead fields, and one solution to this which seems to be fast, is implementing it using FTSearch rather than the DBColumn we originally had. I want to get advice on whether this would be an OK approach, or if there are any suggestions to do what we need in a different way.
Background:
While there are a number of factors affecting the speed (like network latency, server OS, available server memory etc.), as we are using 8.5.3, we have optimized the application in general as far as we can, making use of the IBM Toolkit to find problem areas, and also using the features IBM added to help with this in 8.5.3 (e.g. Partial Execution, using the optimized JS and CSS option, etc.). Unfortunately we are stuck with the server running on a 32bit Windows OS with 3.5Gb Ram for another few months.
One of the slowest elements to respond are in certain type-aheads which reference a large number of documents. The worst one averages around 5 or 6 seconds before the suggested list appears for a type-ahead enabled field.
It uses SSJS to call a java class to perform a dbcolumn call (using Ferry Kranenburg's XPages Snippet) to get a unique list from a view, then back in SSJS it loops though the array to check if each entry contains the search key value, and if found it adds a highlight (bold) html tag around the search text in the word, then returns the formatted list back to the browser.
I added a print statement to output the elapsed time it takes to run the code, and on average today on our dev server it is around 3250 ms.
I tried a few things to see how we could make this process faster:
Added a Java class to do all processing (so not using SSJS). This only saved an average of 100ms.
Using a view-scoped Managed Bean, I loaded the unique Lookup list into memory when the page is loaded. This produces a really fast type-ahead response (16ms), but I suspect this is a very bad way to do this with a large data set - and could really impact the general server if multiple users were accessing the application. I tried to find information on what would be considered a large object, but couldn't find any guidance or recommendation on how much is too much to store in memory (I searched JSF and XPage sites). Does anyone have any suggestions on this?
Still in a Java class - instead of performing a dblookup to get the 'list' of all values to search through, I have the code run a FT Search to get the doc collection, then loop each doc to extract the field value I want and add those to a 'SortedSet' (which automatically doesn't allow duplicates), then loop the sorted set to insert the bold tags around the search term, and return that to the browser. This takes on average 100ms - which is great and barely noticeable. Are there an drawbacks to this approach - or reasons I should not do it this way?
Thanks for any feedback or advice on this.
Pam.
Update Aug, 14. 2013: I tried another approach (inspired by the IBM/Tony McGuckin Insights application on OpenNtf) as the Company Search type-ahead in that is using managed beans and is fast across a lot of data.
4 . Although the Insights application deals with data split across multiple databases, the principle for the type-ahead is similar. I couldn't use a view with getAllEntriesByKey though as I needed to search for a string within the text too, not just at the start of the entry. I tried creating a ViewEntryCollection based on a view FTSearch, but as we have a lot of duplicate names in the column, this didn't give the unique list I wanted. I then tried using a NotesViewNavigator on a categorized view, and looping through that. This produced the unique list I needed, but it turned out to be slower than any of the other methods above. (I did implement these ViewNavigator performance tips).
From my standpoint, performance may be affected by any of many layers every Domino application (not only XPages) consists of.
From top - browser (DOM, JS, CSS, HTML...), network (latencies, DNS, SSO...) to application layer (effective algorithms, caches), database/API (amount of data, indexes, reader names...) and OS/hardware (disks, memory...)
According to things you tested:
That is interresting, but could be expected: SSJS is cached and may use lower level API to get data (NAPI).
For your environment (32bit/3.5G RAM - I expect your statement about 3.5M is typo) I DO NOT recommend to cache big lists, especially if you apply it as a pattern to many fields/forms/applications. Cache in WeakHashMap could be more stable, though.
Use of FT search is perfectly fine, unless you need data that update frequently. FT index need some time and resources to update.
My suggestion is: go for FT, if it solves your problem. Definitely, troubleshoot FT performance in some heavy performance test on your server first.
(I cannot comment because of my low reputation)
I have recently been tackling with a similar problem. Here are some additional points to consider:
Are there many duplicate keywords in the view? Consider making a categorized view for #DbColumn.
FTSearching a view is often slower than a database, I believe. See Andre Guirard's article. Consider using db.FTSearch() and refining your FT query to include view's selection #Formula, if possible.
The FT index can be updated programmatically with db.updateFTIndex(). If keywords are added rarely, but need to be instantly available, you can perform index update in keyword document's QuerySave event (or similar). We used this approach when the keywords were stored in different (much smaller) database and the update was very fast.
The memory consumption can be checked this way:
Install XPages Toolbox from OpenNTF.
Open your application.
Create a JVM memory dump (Session dumps - Generate Heap Dump).
Install Eclipse Memory Analyzer Tool
Install IBM Diagnostic Tool Framework into Memory Analyzer.
Load your memory dump into MAT. You will see every Java object and their sizes.
In the end, I believe that there is no single general answer to your question. You need to test different approaches to find the fastest solution in your environment.
One problem with FT search is this error:
The full text index for this database is in use
Based on my experience this will occur for a while (maybe a few seconds) when the indexer task starts to index the database. If your users are not very demanding they can just try again and it will probably work.
But in many cases you want to minimize the errors the users get and will have to handle this error nicely. I've built my own FTSearch method which waits a bit and tries again until the error is not received. This will show as slowness to the user instead of error.

Best approach to data filtering

I'm building a little library applications to have a visual catalogue of my programming ebooks.
By now, I've added some of my ebooks info into a ko.observableArray in my BooksViewModel.js file.
Later, I'll be implementing a NodeJS applications with all the data saved in a MongooseDB and access them from there, but by now, I'm just experimenting directly from Knockout.js.
By default, my library shows all the books I added, desorganized, so I'm looking forward to implement "categories" by language. Every book object contains a language attribute.
I want to filter the books showed by language but I'm a little bit confused on how will be the best way to do this.
The books on the array are not organized, they are all dropped there.. some talks about javascript, other C and so on.
At first I thought about creating a separated array for each language, and then implemeting a method in the ViewModel to select the corresponding array of the language you requested.
Later, I would implementa NodeJS API, to get them by language, lets say:
GET /languages/C // will get a json corresponding all the books that talks about C
The ViewModel could contain a method:
self.findByLanguage = function(lang) {
self.books = // GET /languages/:lang
};
But that would query the database every time. I guess is better to load the whole books json first, saved all of them to an array on the client side, and then filter them. That way only one request would be made.
I could have a global array containing all the books, and then implement the filter with ko.utils.ArrayFilter.
What do you guys think will be the best approach? Maybe there is a better way.
Thanks in advance!
If "my programming ebooks" means this application is for you only, there's a trivial difference between querying all and only the selected few books as the database load will generally be close to zero in either of these cases. The number of books would be a few hundred perhaps.
But wait, what's the actual benefits from loading them all at once?
Upsides of storing the whole list client-side
If you are always looking at most of the categories will save you some milliseconds of database load and all bandwith just from changing categories.
Downsides
Bandwith usage is awful, initial page loading is slower giving you plenty of books you don't want or need.
The database system you're using is having speed as important optimization factor. Add an index on language and querying should be done in no time, anyhow. For the time you're using arrays as data source, this might not show in comparison to 'just sending the whole array'.
Opening the page in multiple windows/browsers/on multiple pcs will require you to syncrhonize all changes to all clients. If you don't do this, you'll have old objects until you reload the page which is exactly what you should avoid if having the list client-side.
If you're planning to run this on your local computer or within your local network, speed should be a trivial issue, so why not let the database do the work? If you're not and speed is an issue, I would personally value "I can load category X pretty fast" over "Initial page loading is slow, but it's fast once everything's loaded".

How does solr work with data split into different services and therefore not synchronously available?

take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.

Search Elmah AllXml field

I want to implement a search function on any field of Elmah errors, beside using full text search on AllXml field(it is relative difficult to setup), is there any way to let the search fast? My site has a lot of traffic, and generat a lot of errors per minute.
ps, if I use full text search, as I see there are a lot of new errors generated, can I searched new errors in time?
It's almost guaranteed that full-text searching is going to be the fastest way to search the data, especially since Elmah stores the XML in an ntext field. Your only other option would be to do a text search using LIKE (slower and more limited than full-text), or to convert the ntext field into an xml data type every time you need to do a search. Depending on the number of errors you're searching over, that could be a very costly process.
The only downside to a full-text solution is that you run the risk of false positives if the search term matches to an XML item's definition (such as searching on the word "item" or "value"). As to your question about whether you'd be able to search errors in real-time, that would depend on your database platform. SQL Server can be configured in a number of ways to give you nearly real-time full text searching capabilities (see http://technet.microsoft.com/en-us/library/ms142575.aspx).

Resources