CouchDB find by search term - couchdb

I import a CSV file to CouchDB with the correct structure.
Now I would like to search for records matching one search term in ANY of the fields. Here is an example record :
{
"_id": "QW141401",
"_rev": "1-7aae4ce6f6c148d82d7d6e1e3ba28542",
"PART": {
"ONE": "QUA01135",
"TWO": "W/364",
"THREE": "QUA04384",
"FOUR": "QUA12167"
},
"FOO": {
"BAR": "C40"
},
"DÉSIGNATION": "THE QUICK BROWN FOX"
}
Now given a search term, for example QUA04384 this record should come up. Aloso for C40. And, if possible, also for a partial match like FOX
The keys under PART and FOO can change from record to record...

This could be a similar question. Probably you are looking for a Full Text Search mechanism.
Yo can try with couchdb-lucene or elasticseach

A stupid way to do this is to build an additional field (call it 'fulltext') in each Lucene document, containing the concatenation of all other field values. (Remember to build this completely dynamically so that every single field has its contents in this additional field no matter what the original field name was.) Then you can perform your searches on this 'fulltext' field.

Related

solr sort by score not working properly

I am using Solr v6.2.1 .We are not getting accurate results using "sort score desc".
let's assume we have a list of documents in our index as below
[{
"id": "1",
"content": ["java developer"]
},
{
"id": "2",
"content": ["Java is object oriented.Java robust language.Core java "]
},
{
"id": "3",
"content": ["java is platform independent. Java language."]
}]
Content is defined as multivalued field in the schema
field name="content" type="text_general" multiValued="true" indexed="true" stored="true"
when I search for java using below query
curl http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score desc
I am expecting the content with Id :2 should come first as it contains more matches related to java.But solr is giving inconsistent results.
Please suggest why I am not able to get desired results.
You need to add typeDef as edismax in your query, please find below query again.
http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score
desc&defType=edismax
Once you pass edismax as defType sorting on scores starts working as expected.
First, as suggested by Rahul, you should mention df or 'default query field' to execute your query explicitly on.
Secondly, your assumption about the doc with maximum occurrences of a particular term to show up as first result is not correct. What you are referring to is called term frequency or shortly tf. The ranking function used by Solr to calculate the relevance score uses 'tf', along with 'idf', the inverse document frequency. You can read more about it here Okapi_BM25.
Roughly, the score translates into (tf)*log(idf).
This will ensure that the most relevant documents for a particular query are retrieved. Intuitively, this means that, since 'Java' is present in other documents as well, the terms that differentiate doc 2 are probably 'object oriented', 'robust'.

Unable to full text search in Solr

I have some data in solr. I want to search which name is Chinmay Sahu See below I have 3 results in output. But I got 3 instead of 1. Because Content name searched partially.
I want to full search those name having Chinmay Sahu only that contents will come.
Output:
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
{
"id": "4e98d680efaab3afe051f3ddc00dc5f2",
"content_id": "1825",
"name": "Chinmay Panda",
"_version_": 1596995745829879800
}
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1259",
"name": "Sasmita Sahu",
"_version_": 1596995745829879800
}
]
Query:
name:Chinmay Sahu
Expected :
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
]
Please help
Try doing this
name:"Chinmay Sahu"
You need to do a phrase query to match the exact name.
I am guessing in your case the name field is using Standard tokenizer which will split tokens if whitespace is there. So while indexing in all the 3 docs there will be a token called "chinmay".
While you search using
name:Chinmay Sahu
Solr will search it like this since if there is no fieldName specified before a token solr automatically searches it in default_field.(however default field is removed from solr 7.3, So it depends on what version of solr are you using.
)
Name:chinmay AND default_field:sahu
So since all the three docs are having chinmay as a token in the index,the query will match all 3 docs.
Now i dont know what your default field is? can you post your solr schema? That way we can explain why you are seeing those 3 docs.
Since root545 already explained that field:foo bar will search for foo in field and bar in the default search field, I'll suggest that it seems like you don't want to concern yourself with the exact Lucene syntax for searching. The edismax query parser is well suited for separating the typed search string from what fields are being searched and whether you want all tokens to match.
The query in that case would be just Chinmay Sahu, while you'd set q.op=AND (all terms must match), defType=edismax (use the edismax query parser) and qf=name (search the name field):
q=Chinmay Sahu&q.op=AND&defType=edismax&qf=name
You can also tune the different phrase parameters to make sure that names with the tokens in the exact same sequence will be boosted higher than those that have them in the opposite sequence (i.e. Sahu Chinmay).
If this is a programmatic search where no user is actually typing in the suggestion, using a phrase search as suggested is the way to go (name:"Chinmay Sahu").
I would suggest using query like
name:(Chinmay Sahu)
And make sure default operator is AND either in settings or query string like q.op=AND
With that approach you can use user input much easier since you don't need to parse it too much.

poor search performance for certain wildcard queries

I am having performance issues when using wildcard searching for certain letter combinations, and I am not sure what else I need to to to possibly improve it. All of my documents are following an envelope pattern that look something like the following.
<pdbe:person-envelope>
<person xmlns="http://schemas.abbvienet.com/people-db/model">
<account>
<domain/>
<username/>
</account>
<upi/>
<title/>
<firstName>
<preferred/>
<given/>
</firstName>
<middleName/>
<lastName>
<preferred/>
<given/>
</lastName>
</person>
<pdbe:raw/>
</pdbe:person-envelope>
I have a field defined called name, which includes the firstName and lastName paths:
{
"field-name": "name",
"field-path": [
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:firstName",
"weight": 1
},
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:lastName",
"weight": 1
}
],
"trailing-wildcard-searches": true,
"trailing-wildcard-word-positions": true,
"three-character-searches": true
}
When I do some queries using search:search, some come back fast, whereas others come back slow. This is with the filtered queries.
search:search("name:ha*",
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="name">
<word>
<field name="name"/>
</word>
</constraint>
<return-plan>true</return-plan>
</options>
)
I can see from the query plan that it is going to filter over all 136547 fragments in the db. But this query works fast.
<search:query-resolution-time>PT0.013205S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.008933S</search:snippet-resolution-time>
<search:total-time>PT0.036542S</search:total-time>
However a search for name:tj* takes a long time, and also filters over all of the 136547 fragments.
<search:query-resolution-time>PT6.168373S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.004935S</search:snippet-resolution-time>
<search:total-time>PT12.327275S</search:total-time>
I have the same indexes on both. Are there any other indexes I should be enabling when I am specifically just doing a search via the field constraint? I have these other indexes enabled on the database itself, in general.
"collection-lexicon": true,
"triple-index": true,
"word-searches": true,
"word-positions": true
I tried doing an unfiltered query, but that did not help as I got a bunch of matches on the whole document, and not the the fields I wanted. I even tried to set the root-fragment to just my person element, but that did not seem to help things.
"fragment-root": [
{
"namespace-uri": "http://schemas.abbvienet.com/people-db/model",
"localname": "person"
}
]
Thanks for any ideas.
Fragment roots are helpful if you want to use a searchable expression for that person element, and mostly if it occurs multiple times in one document. It won't make your current search constrain on that element.
In your case you enabled a number of relevant options, but the wildcard option only works for 4 characters of more. If you want to search on wildcards with less characters, you need to enable the three, two and one character search options.
The search phrases mentioned above both contained two characters with a wildcard. Since you only enabled the three character option, it had to rely on filtering. The fact some run fast, some slow is probably because of caching. If you repeat the same query, MarkLogic will return the result from cache.
For performance testing you would either have to restart MarkLogic regularly to flush caches, or search on (semi) random strings to avoid MarkLogic being able to cache. Or maybe both..
HTH!

How can I get elastic search to return results inside angle brackets?

I'm new to elastic search. I'm trying to fix our search so that it will allow users to search on content within html tags. Currently, we're using a whitespace tokenizer because we need it to return results on hyphenated names. Consequently, aname123-suffix project is indexed as ["aname123-suffix", "project"] and a user search for "aname123-*" returns the correct results.
My problem arises because we also want to be able to search on content within html tags. So, for example for a project called <aname123>-suffix project, we'd like to be able to enter the search term <aname123>-* and get back the correct results.
The index has the correct tokens for a whitespace tokenizer, namely ["<aname123>-suffix", "project"] but when my search string is "\<aname123\>\-suffix" or "\\<aname123\\>\\-suffix" elastic search returns no results.
I think the solution lies either in
modifying the search string so that elastic search returns <aname123>-suffix when I ask for it; or
being able to index the content within the tag separately from the whitespace tokens, i.e. ["<aname123>-suffix", "project", "aname123", "suffix"]
So far I've been approaching it by changing the indexing, but I have not yet succeeded. A standard tokenizer will allow search results for content within tags, but it fails to return search results for aname123-*. Currently my analyzer settings look like this:
{ "analysis":
{ "analyzer":
{ "my_whitespace_analyzer" :
{"type": "custom"
{"tokenizer": "whitespace},
{"filter": ["standard", "lowercase", "stop"]}
}
},
{ "my_tag_analyzer":
{"type": "custom"
{"tokenizer": "standard"},
{"filter": ["standard", "lowercase", "stop"]}
}
}
}
}
I can create a custom char filter that strips out the < and the >, so my index contains aname123; but for some reason elastic search still does not return correct results when searching on <aname123>*. However, when I use instead a standard analyzer, the index contains aname123 and it returns the expected results for <aname123>* ... What is so special about angle brackets in elastic search?
You may want to take a look at the html_strip character filter:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
An example from one of the elasticsearch developers is here:
https://gist.github.com/clintongormley/780895

ElasticSearch: Match Field Within Query

Searching for a string within an indexed document is simple with match. What about the opposite? I need to look for matches of a string field within a query. For example, searching for:
correct horse battery staple
Should match a document with a field with a value of horse battery, and only that. What is the query for that with ElasticSearch?
Edit: Here's a thread about someone wanting to do the same thing, but never received any replies: https://groups.google.com/d/topic/elasticsearch/IYDu5-0YD6E/discussion
Inverted index doesn't perform very well in knowing which multiple terms a document contains exactly. A solution found in the definitive guide was to index the term count and to query over the different possible combinations, which is very tedious.
Here is a related question (it's about filter, but the problematic is the same) with more developped answers.
The solution I came to was to use the porcolator API. I indexed the field value as a search query, and then matched it against a document that contained the query string. This method is working quite well. Here is how I'm creating the percolator:
curl -XPUT localhost:9200/myindex/.percolator/model-2332 -d '
{
"query": {
"match_phrase": {
"name": "horse battery"
}
}
}'
And how I'm querying for it:
curl -XGET localhost:9200/myindex/model/_percolate -d '
{
"doc": {
"name": "correct horse battery staple"
}
}'

Resources