Is there any way to search through CouchDB documents for substring - couchdb

CouchDB gives an opportunity to search values from startkey, for exact key-value pair etc
But is there any way to search for substring in specified field?
The problem is like this. Our news database consists of about 40,000 news documents. Say, they have title, content and url fields. We want to find news documents which have "restaurant" in their title. Is there any way to do it?
View Collation wiki page tells nothing :( And it seems strange to me that there's no tool to handle this problem and all I can to do is just parsing JSON results with Python, PHP or smth else. In MySQL it's simply LOCATE() function..

Use couchdb-lucene.

Be careful here. Lucene is not always the best answer.
If your only searching one limited field and only searching for a word like restaurant then lucene which is really meant to tokenize large texts/documents can be way overkill, you can get the same effect by splitting the title.
function(doc){
var stringarray = doc.title.split(" ");
for(var idx in stringarray)
emit(stringarray[idx],doc);
}
Also Lucene and Couchdb do not support substring search, where the string is not in the beginning of a word.

Related

Full text search with different types besides String in MongoDB

I want to use full text search in MongoDB, and I know the solution of using text indexes (https://docs.mongodb.com/manual/core/index-text/). But, this solution is meant for searching on String type fields only.
How can I perform full text search on other types as well? Suppose I have a collection with documents in which I have fields from a variety of types like String, Number etc.
What can I do?
P.S: I use MongoDB native driver for Nodejs.
Wildcard Indexing
there can be scenarios where you want any text content in your documents to be searchable. Maybe this will help you.
db.collection.createIndex({"$**":"text"})

Azure-search: How to get documents which exactly contain search term

This question/answer dealt with a pretty similar topic, but I couldn't find the solution I was searching for.
How to practially use a keywordanalyzer in azure-search?
Starting situation:
I created a resource with multiple indexes. One of these indexes contains a Collection(Edm.String) field.
From this field i only want to get documents which exactly contain the search term. For example the field contains documents like these: "Hovercraft zero", "Hovercraft one", "Hovercraft two".
If the search term is "Hover" all three documents should be returned. If the search term is "craft zer" only the document "Hovercraft zero" should be returned. The document shouldn't get a higher score, the desired behaviour is that I only get the "Hovercraft zero" document as result.
Further information:
It is not possible to set the searchmode to all (like it was recommended in the question on the top) because I just want to set this behaviour for this specific field and not for all search queries. It also is not possible to let the responsibility on the user to enter the search term with quotes.
What I have tried so far:
Use the keyword analyzer like it was described in the question on
top: no success
Use an indexanalyzer with specific token filters (ngram,
lowercase) and a searchanalyzer as a keyword analyzer: no success
Use Charfilters to manipulate the search term and manually set the
quotes on the first and last position (craft zer -> "craft zer").
Like Yahnoosh explained in the question on top, the query parser
processes the query string before the analyzers are applied. So:
no success
Is there any solution for this issue?
Or is there a other approach to achieve the desired behaviour?
Hopefully someone can help.
Thanks in advance!
Using your example with three documents: "Hovercraft zero", "Hovercraft one", "Hovercraft two"
Issue a prefix query to find all documents that contain terms that start with "Hover"
search=Hover*
To match the term "craft zer", you need to use the keyword analyzer (or the keyword tokenizer with the lowercase token filter) at indexing time to make sure elements of your string collection are not tokenized. Then at query time you can issue a regex query (note regex queries are much slower than term or prefix queries)
search=/.craft zer./&queryType=full
Also, please use the Analyze API to test your custom analyzer configurations. It will help you make sure the analyzer produces the terms you expect.
Thanks #Yahnoosh for your answer, I found a solution that worked for me.
Short example:
I have an index including three fields (field1, field2, field3). From field3 I want a result where documents exactly contain the search term. From field1 and field2 I want do get a "standard" result.
Solution:
I manipulated the searchquery to ->
field1:{searchterm} || field2:{searchterm} || field3:"{searchterm}" &queryType=full
Using this searchquery field1 and field2 are queried in the "standard" way and field3 is queried with the behaviour i was searching for. Of course there are more efficient and elegant ways out there to solve this issue, but it worked for me.
If anybody has a better solution let me know ;)

PouchDB get documents by ID with certain string in them

I would like to get all documents that contain a certain string in them, I can't seem to find a solution for it..
for example I have the following doc ids
vw_10
vw_11
bmw_12
vw_13
bmw_14
volvo_15
vw_16
how can I get allDocs with the string vw_ "in" it?
Use batch fetch API:
db.allDocs({startkey: "vm_", endkey: "vm_\ufff0"})
Note: \ufff0 is the highest Unicode character which is used as sentinel to specify ranges for ordered strings.
You can use PouchDB find plugin API which is way more sophisticated than allDocs IMO for querying. With the PouchDB find plugin, there is a regex search operator which will allow you do exactly this.
db.find({selector: {name: {$regext: '/vw_'}}});
It's in BETA at the time of writing but we are about to ship a production app with it. That's how stable it has been so far. See https://github.com/nolanlawson/pouchdb-find for more on Pouch Db Find
You better have a view with the key you want to search. This ensures that the key is indexed. Otherwise, the search might be too slow.

Solr : Return documents that start with query keywords

I am using solr for indexing some documents and then searching. I want to return those documents that have the same start as the search keywords higher in the results. How can i achieve that?
E.g.
If i the search keyword is "php"
and there are two documents with content :
php developer
ajax php
then i want to return 'php developer' first instead of 'ajax php'.
Any suggestions on how to return results in this order?
I am looking for some sort of an analyzer that only indexes the first word from the content of a field and then giving that field a lot of weight while querying. Maybe that can help. I couldnt find such an analyzer for my purposes.
You can boost the first tokens using payload. Refer to the link mentioned in Payloads

WildcardQuery error in Solr

I use solr to search for documents and when trying to search for documents using this query "id:*", I get this query parser exception telling that it cannot parse the query with * or ? as the first character.
HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
type Status report
message org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
description The request sent by the client was syntactically incorrect (org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery).
Is there any patch for getting this to work with just * ? Or is it very costly to do such a query?
If you want all documents, do a query on *:*
If you want all documents with a certain field (e.g. id) try id:[* TO *]
Lucene doesn't allow you to start WildcardQueries with an asterisk by default, because those are incredibly expensive queries and will be very, very, very slow on large indexes.
If you're using the Lucene QueryParser, call setAllowLeadingWildcard(true) on it to enable it.
If you want all of the documents with a certain field set, you are much better off querying or walking the index programmatically than using QueryParser. You should really only use QueryParser to parse user input.
id:[a* TO z*] id:[0* TO 9*] etc.
I just did this in lukeall on my index and it worked, therefore it should work in Solr which uses the standard query parser. I don't actually use Solr.
In base Lucene there's a fine reason for why you'd never query for every document, it's because to query for a document you must use a new indexReader("DirectoryName") and apply a query to it. Therefore you could totally skip applying a query to it and use the indexReader methods numDocs() to get a count of all the documents, and document(int n) to retrieve any of the documents.
If you are just trying to get all documents, Solr does support the *:* query. It's the only time I know of that Solr will let you begin a query with an *. I'm sure you've probably seen this as the default query in the Solr admin page.
If you are trying to do a more specific query with an * as the first character, like say id:*456 then one of the best ways I've seen is to index that field twice. Once normally (field name: id), and once with all the characters reversed (field name: reverse_id). Then you could essentially do the query id:456 by sending the query reverse_id:654 instead. Hope that makes sense.
You can also search the Solr user group mailing list at http://www.mail-archive.com/solr-user#lucene.apache.org/ where questions like this come up quite often.
The following Solr issue is a request to be able to configure the default lucene query parser.
https://issues.apache.org/jira/browse/SOLR-218
In this issue you can find the following description how to 'patch' Solr. This modification would allow you to start queries with a *.
Jonas Salk: I've basically updated only one Java file: SolrQueryParser.java.
public SolrQueryParser(IndexSchema schema, String defaultField) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
...
public SolrQueryParser(QParser parser, String defaultField, Analyzer analyzer) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
I'm not sure if setLowercaseExpandedTerms is needed...
I'm assuming with id:* you're just trying to match all documents, right?
I've never used solr before, but in my Lucene experience, when ingesting data, we've added a hidden field to every document, then when we need to return every record we do a search for the string constant in that field that's the same for every record.
If you can't add a field like that in your situation, you could use a RegexQuery with a regex that would match anything that could be found in the id field.
Edit: actually answering the question. I've never heard of a patch to get that to work, but I would be surprised if it could even be made to work reasonably well. See this question for a reason why unconstrained PrefixQuery's can cause a problem.
Actually, I have been using a workaround for this. I append a character to the id, eg: A1, A2, etc.
With such values in the field, it is possible to search using the query id:A*
But would love to find whether a true solution exists.

Resources