I have a database which holds keywords for items, and also their localizations in different languages (supporting around 30 different languages right now), if there are any for that item. I want to be able to search these items using Azure Search. However, I'm not sure about how to set up the index architecture. Two solutions come to my mind in this scenario:
Either I will
1) have a different index for each language, and use that language's analyzer for that index. Later on, when I want to search using this index, I will also need to detect the query language coming from the user, and then search on the index corresponding to that language.
or
2) have a single index with a lot of fields that correspond to the different localizations of the item. Azure Search has support on having language priorities when searching, so knowing the user's language may come in handy, but is not necessarily a must.
I'm kind of new to this stuff, so any pointers, links, ideas etc. will be of tremendous help, even if it doesn't answer the question directly.
Option 2 is what we recommend (having a single index with one field per language). You can set some static priorities by assigning field weights using a scoring profile. If you are able to detect the language used in a query, you can scope the search to just that language using the searchFields option.
Related
I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.
You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.
I'm just getting used to using elasticsearch in our platform, and so far it's proven to be a superb move, but other than some built in stats I haven't found any reference to creating a report of sorts. I guess the closest comparison would be facets, but it seems they need to be predefined in order to show stats for them.
What I would like to know is, is it possible to run reports such as:
What are the most popular phrases within the indexed content for the last 24 hours, week, etc.? This would be similar to what's used to produce Word/Tag Clouds, but without relying on user input (common search terms for example) as a source.
Can facets be suggested rather than specified based on the most popular phrases for a particular search term(s)? i.e. If someone searches for the term "Music" and the most popular phrases including "Music" are things like "Music Awards" or "Electronic Music", can those facets be returned without designing them explicitly into the initial request?
As you can see what I'd like to know if we can gain any analytics from the indexed content, not just explicit results.
For these types of reports and analytics of data in elasticsearch you may want to try Kibana3 http://three.kibana.org/. This will only work if your data has a timestamp as that is a requirement for analyzing data in Kibana. Tool is very flexible and I believe will give you the insights about your data you are looking for.
I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.
I am doing a multilingual search. And I will use lucene as the tool to do it.
I have the translated contents already, there will be 3 or 4 languages of each document.
For indexing and search, there could be the 4 strategies, For each document/contents:
each language are indexed in different index/directory.
each language are indexed in different document but in the same index.
each language are indexed in different Field but in the same document.
all the languages are indexed in the same Field in a document
But I have not test each of the way yet, could anyone experienced tell me which one is a better way to do the multilingual search?
Thanks!
Although the question has been asked a couple of years ago, it's still a great question.
There are a couple of aspects to consider evaluating the different solution approaches:
are language specific analyzers used at indexing time?
is the query language always known (e.g. user selectable)?
does the query language always match one of the "content" languages?
should only content matching the query language be retuned?
is relevancy important?
If (1.) & (5.) are valid in your project you should not consider any strategy that (re-)uses the same field for multiple languages in the same inverted index, as term frequencies for the various languages are all mixed up (independent of whether you index your multilingual content as one document or as multiple documents). It might be interesting to know, that adding "n" language specific fields does not result in an "n"-times larger index, but for obvious reasons it comes with some overhead.
Single Field (Strategies 2 & 4)
+ only one field to query
+ scales well for additional languages
+ can distinguish/filter languages (if multiple documents, and extra language field)
- cannot distinguish/filter languages (if single document)
- cannot just display the queried language (if single document)
- "wrong" term frequencies (as all languages mixed up)
Multiple Fields (Strategy 3)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- more fields to index
- more fields to query
Multiple Indices (Strategy 1)
+ correct term frequencies
+ can easily restrict/filter queries for particular language(s)
+ facilitates Auto-Complete & Spellcheck / Did-You-Mean
- additional languages requires all their own index
Independent of a single or multiple fields approach, your solution might need to handle result collapsing for matches in the "wrong" language, if you index your content as multiple documents. One approach might could be by adding a language field and filter for that.
Recommendation: The approach/strategy you choose, depends on a projects requirements. Whenever possible I would opt for a multiple fields or multiple indices approach.
In short, it depends on your needs, but I would go with option 3 or 1.
1) would probably the best way, if there is no overlap / shared fields between the languages at all.
3) would be the way to go if there are several fields that need to be shared across languages, as this saves disk space and allows a larger part of the index to fit in the file system cache
I would not recommend 2): this makes your search queries more complex and forces lucene to consider more documents.
4) will make your search query very complex, unless you want users to be able to search in any language without selecting it first.
We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?
ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.