How can I get elastic search to return results inside angle brackets? - search

I'm new to elastic search. I'm trying to fix our search so that it will allow users to search on content within html tags. Currently, we're using a whitespace tokenizer because we need it to return results on hyphenated names. Consequently, aname123-suffix project is indexed as ["aname123-suffix", "project"] and a user search for "aname123-*" returns the correct results.
My problem arises because we also want to be able to search on content within html tags. So, for example for a project called <aname123>-suffix project, we'd like to be able to enter the search term <aname123>-* and get back the correct results.
The index has the correct tokens for a whitespace tokenizer, namely ["<aname123>-suffix", "project"] but when my search string is "\<aname123\>\-suffix" or "\\<aname123\\>\\-suffix" elastic search returns no results.
I think the solution lies either in
modifying the search string so that elastic search returns <aname123>-suffix when I ask for it; or
being able to index the content within the tag separately from the whitespace tokens, i.e. ["<aname123>-suffix", "project", "aname123", "suffix"]
So far I've been approaching it by changing the indexing, but I have not yet succeeded. A standard tokenizer will allow search results for content within tags, but it fails to return search results for aname123-*. Currently my analyzer settings look like this:
{ "analysis":
{ "analyzer":
{ "my_whitespace_analyzer" :
{"type": "custom"
{"tokenizer": "whitespace},
{"filter": ["standard", "lowercase", "stop"]}
}
},
{ "my_tag_analyzer":
{"type": "custom"
{"tokenizer": "standard"},
{"filter": ["standard", "lowercase", "stop"]}
}
}
}
}
I can create a custom char filter that strips out the < and the >, so my index contains aname123; but for some reason elastic search still does not return correct results when searching on <aname123>*. However, when I use instead a standard analyzer, the index contains aname123 and it returns the expected results for <aname123>* ... What is so special about angle brackets in elastic search?

You may want to take a look at the html_strip character filter:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
An example from one of the elasticsearch developers is here:
https://gist.github.com/clintongormley/780895

Related

AZURE SEARCH, ismatch not filtering

I'm using Azure Search for perform some customs search in a database.
I got this one field that have this kind of structure:
"STUFF": "05-05-16-00|"
but I'm having trouble by creating the filter, because its possible that I'll not have all the numbers that builds this structure. It all depends that what the final user will type. So I need a wildcard to fill the blanks with the missing numbers, like this
"05-05-??-??" -> the pipe is important, because this field can have more than 1 code inside.
Now I need to catch all the possible elements that STARTS WITH 05-05, like, for example: 05-05-11-01
I thought I suposed to use the search.ismatch() function, but it doesnt work.
here some code:
search.ismatch('05-05-??-??','STUFF');
And the results were:
"STUFF": "02-02-16-00|",
"STUFF": "02-02-14-00|",
this is driving me crazy, because I dont know why this results came back.
Maybe is important to know that Im performing a POST request to the Azure Search API with this code in 'filter'
Maybe i should to escape this especial characters like - and ? like this
search.ismatch('05\\-05\\-\\?\\?\\-\\?\\?','STUFF')
But the results were the same.
Can somebody please help me ?
EDIT 1
following this Article I change some things and make the following search:
search.ismatch('\"05-00*\"','STUFF','simple', 'all')
And I starting the get some results, but now this is my results:
"STUFF": "06-05-02-00|", //WRONG
"STUFF": "05-02-05-01|", //RIGHT
"STUFF": "05-02-02-07|", //RIGHT
For some reason, it's returing the right structure but not the in the front of the text.
EDIT 2
I made some changes and change all the "-" for the keyword "OU" and I'm trying to follow this question to make sore like a "contains", but i perfoming a POST request with the following parameters
{
"search": "*",
"filter": "search.ismatch('/.*08010000OU/.*','STUFF', 'full', 'all')",
"skip": "0",
"count": true
}
Im trying to use a wildcard in the begining of the query search because I still missing some information.
I believe you won't be able to solve this using the StandardAnalyzer. Try switching to WhitespaceAnalyzer for this particular field and it probably will work with "05-05*"

Unable to full text search in Solr

I have some data in solr. I want to search which name is Chinmay Sahu See below I have 3 results in output. But I got 3 instead of 1. Because Content name searched partially.
I want to full search those name having Chinmay Sahu only that contents will come.
Output:
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
{
"id": "4e98d680efaab3afe051f3ddc00dc5f2",
"content_id": "1825",
"name": "Chinmay Panda",
"_version_": 1596995745829879800
}
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1259",
"name": "Sasmita Sahu",
"_version_": 1596995745829879800
}
]
Query:
name:Chinmay Sahu
Expected :
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
]
Please help
Try doing this
name:"Chinmay Sahu"
You need to do a phrase query to match the exact name.
I am guessing in your case the name field is using Standard tokenizer which will split tokens if whitespace is there. So while indexing in all the 3 docs there will be a token called "chinmay".
While you search using
name:Chinmay Sahu
Solr will search it like this since if there is no fieldName specified before a token solr automatically searches it in default_field.(however default field is removed from solr 7.3, So it depends on what version of solr are you using.
)
Name:chinmay AND default_field:sahu
So since all the three docs are having chinmay as a token in the index,the query will match all 3 docs.
Now i dont know what your default field is? can you post your solr schema? That way we can explain why you are seeing those 3 docs.
Since root545 already explained that field:foo bar will search for foo in field and bar in the default search field, I'll suggest that it seems like you don't want to concern yourself with the exact Lucene syntax for searching. The edismax query parser is well suited for separating the typed search string from what fields are being searched and whether you want all tokens to match.
The query in that case would be just Chinmay Sahu, while you'd set q.op=AND (all terms must match), defType=edismax (use the edismax query parser) and qf=name (search the name field):
q=Chinmay Sahu&q.op=AND&defType=edismax&qf=name
You can also tune the different phrase parameters to make sure that names with the tokens in the exact same sequence will be boosted higher than those that have them in the opposite sequence (i.e. Sahu Chinmay).
If this is a programmatic search where no user is actually typing in the suggestion, using a phrase search as suggested is the way to go (name:"Chinmay Sahu").
I would suggest using query like
name:(Chinmay Sahu)
And make sure default operator is AND either in settings or query string like q.op=AND
With that approach you can use user input much easier since you don't need to parse it too much.

CouchDB find by search term

I import a CSV file to CouchDB with the correct structure.
Now I would like to search for records matching one search term in ANY of the fields. Here is an example record :
{
"_id": "QW141401",
"_rev": "1-7aae4ce6f6c148d82d7d6e1e3ba28542",
"PART": {
"ONE": "QUA01135",
"TWO": "W/364",
"THREE": "QUA04384",
"FOUR": "QUA12167"
},
"FOO": {
"BAR": "C40"
},
"DÉSIGNATION": "THE QUICK BROWN FOX"
}
Now given a search term, for example QUA04384 this record should come up. Aloso for C40. And, if possible, also for a partial match like FOX
The keys under PART and FOO can change from record to record...
This could be a similar question. Probably you are looking for a Full Text Search mechanism.
Yo can try with couchdb-lucene or elasticseach
A stupid way to do this is to build an additional field (call it 'fulltext') in each Lucene document, containing the concatenation of all other field values. (Remember to build this completely dynamically so that every single field has its contents in this additional field no matter what the original field name was.) Then you can perform your searches on this 'fulltext' field.

Case insensitive search in arrays for CosmosDB / DocumentDB

Lets say I have these documents in my CosmosDB. (DocumentDB API, .NET SDK)
{
// partition key of the collection
"userId" : "0000-0000-0000-0000",
"emailAddresses": [
"someaddress#somedomain.com", "Another.Address#someotherdomain.com"
]
// some more fields
}
I now need to find out if I have a document for a given email address. However, I need the query to be case insensitive.
There are ways to search case insensitive on a field (they do a full scan however):
How to do a Case Insensitive search on Azure DocumentDb?
select * from json j where LOWER(j.name) = 'timbaktu'
e => e.Id.ToLower() == key.ToLower()
These do not work for arrays. Is there an alternative way? A user defined function looks like it could help.
I am mainly looking for a temporary low-effort solution to support the scenario (I have multiple collections like this). I probably need to switch to a data structure like this at some point:
{
"userId" : "0000-0000-0000-0000",
// Option A
"emailAddresses": [
{
"displayName": "someaddress#somedomain.com",
"normalizedName" : "someaddress#somedomain.com"
},
{
"displayName": "Another.Address#someotherdomain.com",
"normalizedName" : "another.address#someotherdomain.com"
}
],
// Option B
"emailAddressesNormalized": {
"someaddress#somedomain.com", "another.address#someotherdomain.com"
}
}
Unfortunately, my production database already contains documents that would need to be updated to support the new structure.
My production collections contain only 100s of these items, so I am even tempted to just get all items and do the comparison in memory on the client.
If performance matters then you should consider one of the normalization solution you have proposed yourself in question. Then you could index the normalized field and get results without doing a full scan.
If for some reason you really don't want to retouch the documents then perhaps the feature you are missing is simple join?
Example query which will do case-insensitive search from within array with a scan:
SELECT c FROM c
join email in c.emailAddresses
where lower(email) = lower('ANOTHER.ADDRESS#someotherdomain.com')
You can find more examples about joining from Getting started with SQL commands in Cosmos DB.
Note that where-criteria in given example cannot use an index, so consider using it only along another more selective (indexed) criteria.

poor search performance for certain wildcard queries

I am having performance issues when using wildcard searching for certain letter combinations, and I am not sure what else I need to to to possibly improve it. All of my documents are following an envelope pattern that look something like the following.
<pdbe:person-envelope>
<person xmlns="http://schemas.abbvienet.com/people-db/model">
<account>
<domain/>
<username/>
</account>
<upi/>
<title/>
<firstName>
<preferred/>
<given/>
</firstName>
<middleName/>
<lastName>
<preferred/>
<given/>
</lastName>
</person>
<pdbe:raw/>
</pdbe:person-envelope>
I have a field defined called name, which includes the firstName and lastName paths:
{
"field-name": "name",
"field-path": [
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:firstName",
"weight": 1
},
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:lastName",
"weight": 1
}
],
"trailing-wildcard-searches": true,
"trailing-wildcard-word-positions": true,
"three-character-searches": true
}
When I do some queries using search:search, some come back fast, whereas others come back slow. This is with the filtered queries.
search:search("name:ha*",
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="name">
<word>
<field name="name"/>
</word>
</constraint>
<return-plan>true</return-plan>
</options>
)
I can see from the query plan that it is going to filter over all 136547 fragments in the db. But this query works fast.
<search:query-resolution-time>PT0.013205S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.008933S</search:snippet-resolution-time>
<search:total-time>PT0.036542S</search:total-time>
However a search for name:tj* takes a long time, and also filters over all of the 136547 fragments.
<search:query-resolution-time>PT6.168373S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.004935S</search:snippet-resolution-time>
<search:total-time>PT12.327275S</search:total-time>
I have the same indexes on both. Are there any other indexes I should be enabling when I am specifically just doing a search via the field constraint? I have these other indexes enabled on the database itself, in general.
"collection-lexicon": true,
"triple-index": true,
"word-searches": true,
"word-positions": true
I tried doing an unfiltered query, but that did not help as I got a bunch of matches on the whole document, and not the the fields I wanted. I even tried to set the root-fragment to just my person element, but that did not seem to help things.
"fragment-root": [
{
"namespace-uri": "http://schemas.abbvienet.com/people-db/model",
"localname": "person"
}
]
Thanks for any ideas.
Fragment roots are helpful if you want to use a searchable expression for that person element, and mostly if it occurs multiple times in one document. It won't make your current search constrain on that element.
In your case you enabled a number of relevant options, but the wildcard option only works for 4 characters of more. If you want to search on wildcards with less characters, you need to enable the three, two and one character search options.
The search phrases mentioned above both contained two characters with a wildcard. Since you only enabled the three character option, it had to rely on filtering. The fact some run fast, some slow is probably because of caching. If you repeat the same query, MarkLogic will return the result from cache.
For performance testing you would either have to restart MarkLogic regularly to flush caches, or search on (semi) random strings to avoid MarkLogic being able to cache. Or maybe both..
HTH!

Resources