How does MultiLanguage search engines Work - search

Today , when searching some videos on youtube i found out that youtube can return relevant results even if you search for videos in languages other than english.
Tried searching about this on google , but all i got was some api's to do this programmatically.Can someone throw some light on the theory behind this.Papers/Links/Explanations,anything would do.
Thanks

When I've done this with elasticsearch, I've simply mapped multiple fields for each document, like:
"text_val": {
"type": "text",
"fields": {
"en": {
"type": "text",
"analyzer": "english"
},
"it": {
"type": "text",
"analyzer": "italian"
}
}
}
And then just search both fields for every query. This works well and is good enough for many applications. However I'm sure Google is doing something much more complex, certainly language identification on both the indexed documents and the query. In case you want to do language identification, I've used python langid before and had good results.
The problem you're going to face using elasticsearch for this kind of thing, in my experience, isn't the multi-language part, but that the analyzers for languages other than English don't always work as well as you would like. You may have to write a custom analyzer, with rules to handle lots of special cases, and tuned for your specific dataset.

Related

Azure Search: Is there support for conjugation in the French or any language analyzer?

I am facing a business requirement for the French language that conjugation must be supported. For example, if the user searches for "Être" then it should also find variations of the form of the verb (voice, mood, tense, etc).
Based on what I have seen, Azure Search fr.microsoft analyzer (or custom analyzer built-on top of this) supports it. I have verified this by searching for "Être" and finding documents with: est, EST, sera, sont and etre.
It does not, however, find documents with the following: ete, etes, Ete, Etes.
I searched and found this page which documents the simple and compound forms of Être.
http://conjugator.reverso.net/conjugation-french-verb-%C3%AAtre.html
It does not look like the Microsoft French language analyzer supports all of them. Is this true? If so, then how do I ensure all are handled? Do I need to add "ete" and "etes" as synonyms for "Être"? If so, would I also need to add "Ete" and "Etes" as synonyms for "Être" as well?
Is there a way for me to get documentation on all the French conjugation support in Azure Search?
Last but not least, how do I better understand ALL the conjugation for "Être"? I tried using the Analyzer API...
{ "analyzer": "fr.microsoft", "text": "Être" }
But I only get the following responses:
{
"#odata.context": "https://one-adscope-search-poc2.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "etre",
"startOffset": 0,
"endOffset": 4,
"position": 0
},
{
"token": "être",
"startOffset": 0,
"endOffset": 4,
"position": 0
}
]
}
In Azure Search, our linguistic analyzers use normalized forms to match different conjugations of the word. For example, at indexing time, the Microsoft analyzer analyzes the word 'sont' to 'etre' and indexes both the original and the normalized/lemmatized form of the word. At query time, say you are issuing a search query with 'est'. The word 'est' also analyzes to 'etre' and finds the document containing 'sont'. The responses from the Analyze API you shared align with this expectation.
Unfortunately, we don't provide exhaustive list of conjugations in our documentation. You may be able to generate the list using a sample of your documents and using the response from the Analyze API.
Finally, you can use our synonyms feature to fill in the missing gap. I noticed that the words that are not matching(ete, etes, Ete, Etes) all analyze to the baseform 'ete'. You can define a synonym rule that says 'etre' and 'ete' are equivalent. The synonyms feature is currently in private preview. Feel free to reach out to me at nateko AT microsoft if you want to try out
Hope this helps.
Nate

How to build search with facetting over unknown/unspecified set of attributes/properties?

I'm working on a product search engine with a big set of undefined products which is constantly growing. Each product has different attributes and at this time they're saved in an array of string key-value pairs like this:
"attributes": [
{
"key": "Producttype",
"value": "Headphones - 3.5 mm plug"
},
{
"key": "Weight",
"value": "280 g"
},
{
"key": "Soundmode",
"value": "Stereo"
},
....
]
Each product has also a category. I'm using elasticsearch 2.4.x to persist data that i want to search on via spring-data-elasticsearch. It's possible to upgrade to the newest elasticsearch version if needed.
As you can see the attributes are really generic. It's also needed to use nested objects to be able to search on this attributes. I'm also thinking about preprocessing this attributes to a standardized format. For example the "Weight" key might be written in different forms like "Productweight" or "Weight of product". Because there are a lot of attributes and i wouldn't like to create a custom property/field for each one i thought about about mapping only the important ones (like weight) to a custom, own field and to map the other attributes like described above.
Now if someone searches for example "iphone" i would like to show some facettes on the left of the search result page. The facettes should differ if someone searches "Adidas shoes". Is this possible with the given format above using nested objects? Is it possible to build the facettes dynamically regarding to the resultset elasticsearch is returning? E.g. the most common properties which all result products contain should be used to create facettes. Or do i have to persist some predefined filters/facettes on each category? I think that would be too much work and also doesn't work on search results where products can have different categories. What's the best practice to build a search feature with facetting on entities with n different properties that can grow in future?

Azure Search: Searching for singular version of a word, but still include plural version in results

I have a question about a peculiar behavior I noticed in my custom analyzer (as well as in the fr.microsoft analyzer). The below Analyze API tests are shown using the “fr.microsoft” analyzer, but I saw the same exact behavior when I use my “text_contains_search_custom_analyzer” custom analyzer (which makes sense as I base it off the fr.microsoft analyzer).
UAT reported that when they search for “femme” (singular) they expect documents with “femmes” (plural) to also be found. But when I tested with the Analyze API, it appears that the Azure Search service only tokenizes plural -> plural + singular, but when tokenizing singular, only singular tokens are used. See below for examples.
Is there a way I can allow a user to search for the singular version of a word, but still include the plural version of that word in the search results? Or will I need to use synonyms to overcome this issue?
Request with “femme”
{
"analyzer": "fr.microsoft",
"text": "femme"
}
Response from “femme”
{
"#odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "femme",
"startOffset": 0,
"endOffset": 5,
"position": 0
}
]
}
Request with “femmes”
{
"analyzer": "fr.microsoft",
"text": "femmes"
}
Response from “femmes”
{
"#odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
"tokens": [
{
"token": "femme",
"startOffset": 0,
"endOffset": 6,
"position": 0
},
{
"token": "femmes",
"startOffset": 0,
"endOffset": 6,
"position": 0
}
]
}
You are using the Analyze API which uses text analyzers, that is not the same as searching using the Search API.
Text analyzers are what supports the search engine when building the indexes that is really what is at the bottom of a search engine. In order to structure a search index the the documents that goes in there needs to be analyzed, this is where the Analyzers come in. They are the ones that can understand different languages and can parse a text and make sense of if, i.e. splitting up words, removing stop words, understand sentences and so on. Or as they put it in the docs: https://learn.microsoft.com/en-us/rest/api/searchservice/language-support
Searchable fields undergo analysis that most frequently involves word-breaking, text normalization, and filtering out terms. By default, searchable fields in Azure Search are analyzed with the Apache Lucene Standard analyzer (standard lucene) which breaks text into elements following the "Unicode Text Segmentation" rules. Additionally, the standard analyzer converts all characters to their lower case form.
So what you are seeing is actually perfectly right, the french analyzer breaks down the word you send in and returns possible tokens from the text. For the first text it cannot find any other possible tokens than 'femme' (I guess there are no other words like 'fem' or 'femm' in French?), but for the second one it can find both 'femme' and 'femmes' in there.
So, what you are seeing is a natural function of a text analyzer.
Searching for the same text using the search API on the other hand should return documents with both 'femme' and 'femmes' in, if you have set the right analyzer (for instance fr.microsoft) for the searchable field. The default 'standard' analyzer does not handle pluralis and other inflections of the same word.
Just to add to yoape's response, the fr.microsoft analyzer reduces inflected words to their base form. In your case, the word femmes is reduced to its singular form femme. All cases that you described will work:
Searching with the base form of a word if an inflected form was in the document. Let's say you're indexing a document with Vive with Femmes. The search engine will index the following terms: vif, vivre, vive, femme, femmes.If you search with any of these terms e.g., femme, the document will match.
Searching with an inflected form of a word if the base form was in the document.
Let's say you're indexing a document with teext Femme fatale. The search engine will index the following terms: femme, fatal, fatale.If you search with term femmes, the analyzer will produce also its base form. Your query will become femmes OR femme. Documents with any of these terms will match.
Searching with an inflected from if another inflected form of that word was in the document. If you have a document with allez, terms allez and aller will be indexed. If you search for alle, the query becomes alle OR aller. Since both inflected forms are reduced to the same base form the document will match.
The key learning here is that the analyzer processes the documents but also query terms. Terms are normalized accounting for language specific rules.
I hope that explains it.

CouchDB view default options in design document not working

The problem is simple: I have written map functions in a design document of a CouchDB database, which emits something {"_id":doc._id}. Together with include_docs=true query option, I will get the desired results with the linked documents. Because the map functions are designed to work with include_docs=true, I put this option in the design document and make it default:
{...
"options":{"include_docs":true}
...}
However, when I query the view, the results are still those without the linked documents, and I need to specify the option explicitly in the query. I also tried to pu other query option (e.g. limit=200) into the design document, they did not work either.
I am using CouchDB 1.5, and cannot find any discussion, issue or bug regarding this. Does anyone have any idea? Thanks in advanced!
Edit: I have reported the issue in Apache, and I am told that the statement about this was removed.
_design/ddoc/options cannot do that.
According to couchdb's docs, a design doc's options object properties only affect view indexing, not view querying. (The only two settings being local_seq and include_design).
_design/ddoc/rewrites can!
If you want to set query options server side, you can do so by specifying a rewrites array in your design document.
Let's say you want to expose a query to _view/myview that has include_docs set to true, you add the following rewrites array to your design document:
{ "_id": "_design/myddoc"
, "views": { "myview": { "map": "function(doc) { ... }" } }
, "rewrites":
[ { "from": "allmyviews/myview"
, "to": "_view/myview"
, "query":
{ "include_docs": "true"
}
}
]
}
Now, when you request http://localhost:5984/mydb/_design/myddoc/_rewrite/allmyviews/myview without the include_docs parameter, couchdb will respond as if you had included it.

couchdb match multiple inconsistent keys

Considering the following two documents:
{
"_id": "a6b8d3d7e2d61c97f4285220c103abca",
"_rev": "7-ad8c3eaaab2d4abfa01abe36a74da171",
"File":"/store/document/scan_bgd123.jpg",
"Commend": "Describes a person",
"DateAdded": "2014-07-17T14:13:00Z",
"Name": "Joe",
"LastName": "Soap",
"Height": "192cm",
"Age": "25"
}
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File":"/store/document/scan_adf123.jpg",
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
How would I find a document based on multiple criteria, say for example "Make"="Ford" and "Color"="Blue". I realize I need a view for this, but I don't know what the key is going to be, and as you can see from the two documents, the key/value pairs aren't consistent. The only consistent item will be the "File" key.
I'm attempting to create couchDB database that will store the location of files, but tagged with Key/Value pairs.
EDIT:
Perhaps I should reconsider my data structure. modify it slightly?
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File": "/store/document/scan_adf123.jpg",
"Tags": {
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
}
So, I need to find by the Key>Value pair in the tag or any number of Key>Value pairs to filter which document I want. The problem here is, I want to tag objects with a key>value pair. These tags could be very different per view, so the next document will have a whole diff set of Key>Value pairs.
Couchdb supports flexible schema. There is no need for the documents to be consistent for them to be query-able. The view for your scenario is pretty straightforward. Here is the map function that should do the trick.
function(doc){
if(doc.Make&&doc.Color)
emit([doc.Make,doc.Color],null);
}
This gives you a view which you can then query like
/view-name/key=["Ford","Blue"]&include_docs=true
This should give you the desired result.
Edit based on comment
For that you will need two separate views. Every view in couchdb is designed to fulfil a specific query need. This means that you have to think about access strategy of your data. It is a lot more work on your part initially but for the trouble you are rewarded with data that is indexed and has very fast access times.
So to answer your question directly. Create two views. One for Make like we have already done and other for Name like
function(doc){
if(doc.Name&&doc.LastName)
emit([doc.Name,doc.Name],null);
}
Now the Name view will index only those documents that have name in it. Where as Make view will index those documents that have make in it.
What happens when a requirement comes in future for which you don't have a query?
You can try a few things.
This is probably the easiest solution. Use couchdb-lucene for your dynamic queries. In this case your architecture will be like couchdb views for queries that you know your application would need. Lucene index for queries that you don't know you might need. So for instance you have indexed name and last name in the in couchdb query. But a requirement arises and you might need to query by age then simply dump the age field in lucene and it will take care of the rest.
Another approach is using the PPP technique where you exploit the fact that creating views is a one time cost and you can create views on less active hours and deploy them in a production service once they are built.
Combine steps 1 and 2! lucene to handle adhoc request while you are building views using the ppp technique.

Resources