I created an Azure index for my DocumentDB collection, and it seems to be working fine. The index has properties for a user account like FirstName, LastName, and Username. The problem is the default tokenizer seems to be tokenizing the Username field. While I want token matches for the first two fields, I'd like character matching for the usernames. Is there an easy way to achieve this through the Azure portal? If not, how can I achieve this?
Adding another answer based on your above comments. So basically in the best case, what you want to do is prefix, suffix and wildcard search. So if the username was user246392, you could find it by typing "use", "392" or even "er246". The prefix is easy, because you could search use* and it would find it.
Kendra Little did a really nice blog post on how to leverage RegEx with Azure Search, which can allow you to do the full wildcard part of your ask (i.e. search for "392").
If you wanted to do the suffix search, you can do a trick that is quite efficient where you create a new field that would be a custom analyzer that would index the words in opposite order. Here is an example of a index schema that would allow this (over suffixName field)
{
"name":"people",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true, "searchable":false },
{"name": "suffixName", "type": "Edm.String", "searchable":true, "indexAnalyzer":"suffixIndexingAnalyzer", "searchAnalyzer":"reverseText"}
],
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "suffixIndexingAnalyzer",
"tokenizer": "keyword_v2",
"tokenFilters": [
"asciifolding",
"lowercase",
"reverse",
"my_edgeNGramForSuffix"
],
"charFilters": []
},
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "reverseText",
"tokenizer": "classic",
"tokenFilters": [
"lowercase",
"reverse"
],
"charFilters": []
}
],
"tokenFilters":[
{
"#odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"name": "my_edgeNGramForSuffix",
"minGram": 2,
"maxGram": 25,
"side": "front"
}
]
}
Can you give us an example of what you would want to do over this username field? I am not sure what you mean by character matching. Is it a RegEx based character match? If so, perhaps a custom analyzer that enabled RegEx searched might help for this field? Please note, RegEx is not as performant as typical indexing as we would need to scan the entire content as opposed to going to the inverted index to find token matches.
Related
I have a feeds collection with documents like this:
{
"created": 1510000000,
"find": [
"title of the document",
"body of the document"
],
"filter": [
"/example.com",
"-en"
]
}
created contains an epoch timestamp
find contains an array of fulltext snippets, e.g. the title and the body of a text
filter is an array with further search tokens, such as hashtags, domains, locales
Problem is that find contains fulltext snippets, which we want to tokenize, e.g. with a text analyzer, but filter contains final tokens which we want to compare as a whole, e.g. with the identity analyzer.
Goal is to combine find and filter into a single custom analyzer or to combine two analyzers using two SEARCH statements or something to that end.
I did manage to query by either find or by filter successfully, but do not manage to query by both. This is how I query by filter:
I created a feeds_search view:
{
"writebufferIdle": 64,
"type": "arangosearch",
"links": {
"feeds": {
"analyzers": [
"identity"
],
"fields": {
"find": {},
"filter": {},
"created": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"consolidationIntervalMsec": 10000,
"writebufferActive": 0,
"primarySort": [],
"writebufferSizeMax": 33554432,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"cleanupIntervalStep": 2,
"commitIntervalMsec": 1000,
"id": "362444",
"globallyUniqueId": "hD6FBD6EE239C/362444"
}
and I created a sample query:
FOR feed IN feeds_search
SEARCH ANALYZER(feed.created < 9990000000 AND feed.created > 1500000000
AND (feed.find == "title of the document")
AND (feed.`filter` == "/example.com" OR feed.`filter` == "-uk"), "identity")
SORT feed.created
LIMIT 20
RETURN feed
The sample query works, because find contains the full text (identity analyzer). As soon as I switch to a text analyzer, single word tokens work for find, but filter no longer works.
I tried using a combination of SEARCH and FILTER, which gives me the desired result, but I assume it probably performs worse than having the SEARCH analyzer do the whole thing. I see that analyzers is an array in the view syntax, but I seem not to be able to set individual fields for each analyzer.
The analyzers can be added as a property to each field in fields. What is specified in analyzers is the default and is used in case a more specific analyzer is not set for a given field.
"analyzers": [
"identity"
],
"fields": {
"find": {
"analyzers": [
"text_en"
]
},
"filter": {},
"created": {}
},
Credits: Simran at ArangoDB
I would like to do a query matches against two properties of the same item in a sub-collection.
Example:
[
{
"name": "Person 1",
"contacts": [
{ "type": "email", "value": "person.1#xpto.org" },
{ "type": "phone", "value": "555-12345" },
]
}
]
I would like to be able to search by emails than contain xpto.org but,
doing something like the following doesn't work:
search.ismatchscoring('email','contacts/type,','full','all') and search.ismatchscoring('/.*xpto.org/','contacts/value,','full','all')
instead, it will consider the condition in the context of the main object and objects like the following will also match:
[
{
"name": "Person 1",
"contacts": [
{ "type": "email", "value": "555-12345" },
{ "type": "phone", "value": "person.1#xpto.org" },
]
}
]
Is there any way around this without having an additional field that concatenates type and value?
Just saw the official doc. At this moment, there's no support for correlated search:
This happens because each clause applies to all values of its field in
the entire document, so there's no concept of a "current sub-document
https://learn.microsoft.com/en-us/azure/search/search-howto-complex-data-types
and https://learn.microsoft.com/en-us/azure/search/search-query-understand-collection-filters
The solution I've implemented was creating different collections per contact type.
This way I'm able to search directly in, lets say, the email collection without the need for correlated search. It might not be the solution for all cases but it works well in this case.
This is my source from ES:
"_source": {
"queryHash": "query412236215",
"id": "query412236215",
"content": {
"columns": [
{
"name": "Catalog",
"type": "varchar(10)",
"typeSignature": {
"rawType": "varchar",
"typeArguments": [],
"literalArguments": [],
"arguments": [
{
"kind": "LONG_LITERAL",
"value": 10
}
]
}
}
],
"data": [
[
"apm"
],
[
"postgresql"
],
[
"rest"
],
[
"system"
],
[
"tpch"
]
],
"query_string": "show catalogs",
"execution_time": 1979
},
"createdOn": "1514269074289"
}
How can i get the n records inside _source.data?
Lets say _source.data have 100 records , I want only 10 at a time ,also is it possible to assign offset for next 10 records?
Thanks
Take a look at scripting. As far as I know there isn't any built-in solution because Elasticsearch is primarily built for searching and filtering with a document store only as a secondary concern.
First, the order in _source is stable, so it's not totally impossible:
When you get a document back from Elasticsearch, any arrays will be in
the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed—made searchable—as multivalue fields,
which are unordered. At search time, you can’t refer to "the first
element" or "the last element." Rather, think of an array as a bag of
values.
However, source filtering doesn't cover this, so you're out of luck with arrays.
Also inner hits won't help you. They do have options for sort, size, and from, but those will only return the matched subdocuments and I assume you want to page freely through all of them.
So your final hope is scripting, where you can build whatever you want. But this is probably not what you want:
Do you really need paging here? Results are transferred in a compressed fashion, so the overhead of paging is probably much larger than transferring the data in one go.
If you do need paging, because your array is huge, you probably want to restructure your documents.
tldr;
How to match and filter localized search with a localized index ?
long version
I have an application where the user search must be done in the context of it's language.
In elastic search index, I want documents with both i18n properties and non i18n properties (I want to avoid creating multiple index, one for each language).
The mapping of the document should look like :
'entry': {
'properties': {
'name' : {'type': 'string'}, /* unlocalized properties */
'category': { /* localized properties */
"properties" : {
"lang_fr" : {
"type" : "string"
},
"lang_de" : {
"type" : "string"
}
}
},}}
having that, I have two requirements:
1) Matching: when doing a search, exclude from search the localized fields that are not concerned by the user language (let's say the user's language is 'fr', I want to exclude 'de' fields from search. How to do this without specifying the entire list of fields I want to search on. To start simple, I tried this but it doesn't work :
{
"query": {
"match": {
"*.lang_fr": "full_text"
}
}
}
However, "categories.lang_fr": "full_text" works well. But I don't want to maintain the list of fields in the query. I want a general rule like you can do in SolR.
2) Filtering: when I retrieve my results, I want to filter out all localized fields that doesn't corresponds to my user language. In other words, using the source filter, I'd like to have all unlocalized fields, exclude all fields starting with "lang" , but include all fields being 'lang_fr'. I tried the following but it doesn't work:
{
"_source": {
"include": [ "*", "*.lang_fr" ],
"exclude": [ "*.lang_*" ],
}
...}
the wildcard operator doesn't seems to work. I partially have what I want if I specify "categories.lang_de", but again, I don't want to maintain the list of fields, I want a generic rule. The include/exclude operation doesn't work as I would like. The only thing that actually works is a query where I specify all languages to exclude for all fields specifically, such as :
{
"_source": {
"exclude": [ "categories.lang_de", "categories.lang_en", "categories.lang_it",
"another_field.lang_de", "catanother_fieldgories.lang_en", "another_field.lang_it"],
}
...}
for 'fr' search.
I'm quite surprised I couldn't find anything on google. I see it as a very standard case of i18n applied to elasticsearch. Maybe I'm modelizing i18n the wrong way in ES ?
thank you in advance !
You can achieve the first one using a query_string query which takes advantage of the powerful Lucene expression language and allows to specify wildcard in field names:
{
"query": {
"query_string": {
"query": "\\*.lang_fr:full_text"
}
}
}
or you can also specify the field name in the fields parameter, like this
{
"query": {
"query_string": {
"query": "full_text"
"fields": ["*.lang_fr"]
}
}
}
As for your second one, source filtering is indeed the way to go but I suggest simply excluding all languages but the one you're searching for. For instance, if the search is in French, you'd simply exclude all other languages without necessarily having to enumerate all the fields, just all the languages that you don't want (which would be much less). That would allow you to add localized fields as you go without having to change the query.
{
"_source": {
"exclude": [ "*.lang_de", "*.lang_it" ],
}
...}
I am investigate possibility to switch to ElasticSearch from SphinxSearch.
What is good about SphinxSearch - full-text search just work out of the bot on pretty good level. Make it work on ElasticSearch appeared not as easy as I expected.
In my project I have search box with typeahead, means I stype Clint E and see dropdown with results including Clint Eastwood on the first place. Type robert down and see Robert Downey Jr. on the first place. All this I achieved with SphinxSearch out of the box just providing it my DB credentials and SQL query to pull the necessary fields.
On the other hand, with ElasticSearch I can't get satisfying results even after a day of reading about Fuzzy Like This Query, matching, partial matching and other. A lot of information but it does not make task easier. I feel like I need to be PhD in search just to make it work at simplest level.
So far I ended up with such configuration
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
The Query look like this:
{
"query": {
"query_string": {
"query": "clint eastw"
"default_field": "title"
}
}
}
But quality of search in this case is not satisfying at all - back to my example, it can not find Clint Eastwood profile until I type his name completely.
Then I tried to use
{
"query": {
"fuzzy_like_this": {
"fields": [
"title"
],
"like_text": "clint eastw",
"max_query_terms": 25,
"fuzziness": 0.5
}
}
}
It helps but not much, now I can find what I need with shorter request clint eastwo and after some manipulations with parameters with clint eastw but still not encouraging.
So I wonder, is there a simple recipe how to cook full-text search with ElasticSearch and get decent quality of results. I spend a day reading but didn't find the solution.
Couple of images to demonstrate what I am talking about:
Elastic, name almost complete but no expected result, note that there is no better match as well.
One letter after, elastic found it!
At the same moment Sphinx shining :)
Elasticsearch ships with auto completion suggester.
You need not put this into query functioanility , the way it works is on token level and not on partial token level.
Go for completion suggester , it also have support for fuzzy logic.