lucene index special characters - search

I am indexing fields with characters like <,{,//," etc in Lucene
Is that fine?
My search query will not contain these special characters but the fields which will be retrieved in response to query might contain one or more of these.
Table: keywords
Fields->keyword,text
When the user will enter a search term, it will be matched against the column keyword which will have special characters.

I think it doesnt matter as long as your search term/search phrase/search query doesn't contain special characters

Related

cts search returning wrong results for wildcarded search

Facing an issue with ‘wildcarded’ search for ‘unfiltered’ cts search query.
Problem explanation:
I have inserted the below docs in DB.
xdmp:document-insert('/a/a1.xml', <root><aa>123</aa></root>);
xdmp:document-insert('/a/a2.xml', <root><aa>12</aa></root>);
xdmp:document-insert('/a/a3.xml', <root><aa>1</aa></root>);
In the below query I am looking for documents having only one digit in ‘aa’ element.
But the below query returning me all the documents I have inserted above.
cts:search(
doc(),
cts:element-word-query(xs:QName('aa'), '?', ('wildcarded')),
'unfiltered'
)
If I will perform ‘filtered’ search I am getting the right result which is doc ‘/a/a3.xml.
Same issue is when the search term is ‘??’(docs expected which contain two digit number in ‘aa’ element) and
‘???’ (docs expected which contain three digit number in ‘aa’ element)
Below indexes are set to true:
three character searches
three character word positions
fast element character searches
trailing wildcard searches
trailing wildcard word positions
fast element trailing wildcard searches
I am curious to know why this is happening and how can I correct this?
An unfiltered search can only return accurate results if there is an index that can satisfy the query. You can see how your query is being formulated to index resolution using xdmp:plan:
xdmp:plan(
cts:search(doc(),cts:element-word-query(xs:QName("aa"),"?","wildcarded"))
In your case, you have no index that can do this and the plan will show that you are just asking for all documents with that element in them. The three character and trailing wildcard indexes only work if there are three or more non-wildcard characters, and the fast element character index just means to apply whatever character indexes you have with the element context. We recommend that for wildcards you add a codepoint collation word lexicon. You can add it to the database as a whole, or, if you know you only need these kinds of wildcards for this particular element, you can add an element word lexicon. Lexicon expansion can then be used to resolve the wildcard.
This happens in a heuristic way automatically (which is to say, depending on the size of your database and the number of lexicon matches, we may formulate the query in more or less accurate ways), but there are also various options to force the handling to behave a certain way. See the API for cts:element-word-query

MongoDB: Indexing for a live search

Situation
I need to create a live search with MongoDB. But I don't know, which index is better to use normal or text. Yesterday I found main differences between them. I have a following document:
{
title: 'What vitamins are found in blueberries'
//other fields
}
So, when user enter blue, the system must find this document (... blueberries).
Problem
I found these differences in the article about them:
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three).
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
That's what I need, but:
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
Question
I need a fast clever dictionary but I also need searching by substring. How can I join these two methods?

Azure Search- Is there way to get exact match of words?

In Azure Search , Is there a way we can get exact match result of multiple words?
If i Search for word "Coca Cola Millenials". Can i get the result from results of azure matching the word "Coca Cola Millenials"
Are you asking if you can search for the phrase "Coca Cola Millenials"? Yes, you can. Surround the phrase with quotes as you did in this question.
From our documentation:
The phrase operator encloses a phrase in quotation marks. For example,
while Roach Motel (without quotes) would search for documents
containing Roach and/or Motel anywhere in any order, "Roach Motel"
(with quotes) will only match documents that contains that whole
phrase together and in that order (text analysis still applies).
Hope that helps

ArangoDB Full Text Index Performance

I have 4842 documents with a sample format
{"ID":"12345","NAME":"name_value","KIND":"kind_value",...,"Secondary":{...},"Tertiary":{...}} where “...” are a few more varying number of key value pairs per object
I have indexed KIND as a full text index using - db.collection.ensureFulltextIndex("KIND") before inserting data.Also, KIND is just a one word string. ie. without spaces
Via AQL following queries were executed:
FOR doc IN FULLTEXT(collection, 'KIND', 'DeploymentFile') RETURN doc --> takes 3.54s (avg)
FOR doc IN collection FILTER doc.KIND == 'DeploymentFile' RETURN doc --> takes 1.16s (avg)
2944 Objects returned in both queries
Q1. Assuming that we have used a fulltext index and I haven't hash indexed KIND, shouldn't the query using FULLTEXT function be faster than the normal == operation (since == doesn't utilize the full text index). If so, what am I doing wrong here?
Q2. Utilizing the fulltext index, can i perform a query which does a CONTAINS string or LIKE string?
---UPDATE Q2.The requirement is searching for a substring within a parent string (which is only one word). The substring can lie anywhere within the parent string. (SQL equivalent of LIKE '%text%')
Q1: The fulltext index does allow for more complex query. It splits the text at word breaks and checks if a word occurs within a larger text. All of these features are not needed in your example. Therefore it generates more overhead than it is saving.
In your example it would be better to create a skip-list or hash-index and search for equality.
Q2: In the simplest form, a fulltext query contains just the sought word. If multiple search words are given in a query, they should be separated by commas. All search words will be combined with a logical AND by default, and only such documents will be returned that contain all search words. This default behavior can be changed by providing the extra control characters in the fulltext query, which are:
+: logical AND (intersection)
|: logical OR (union)
-: negation (exclusion)
Examples:
"banana": searches for documents containing "banana"
"banana,apple": searches for documents containing both "banana" AND "apple"
"banana,|orange": searches for documents containing either "banana" OR "orange" OR both
"banana,-apple": searches for documents that contains "banana" but NOT "apple".
Logical operators are evaluated from left to right.
Each search word can optionally be prefixed with complete: or prefix:, with complete: being the default. This allows searching for complete words or for word prefixes. Suffix searches or any other forms are partial-word matching are currently not supported.
Examples:
"complete:banana": searches for documents containing the exact word "banana"
"prefix:head": searches for documents with words that start with prefix "head"
"prefix:head,banana": searches for documents contain words starting with prefix - "head" and that also contain the exact word "banana".
Complete match and prefix search options can be combined with the logical operators.

Google Site Search - Filtering with PageMap attributes that contain blanks / special characters

We're using a PageMap to provide structured data with our Html content. Part of this structured data are keywords that are displayed on the result page. Furthermore, it should also be possible to filter the results with such a keyword.
We do have keywords that contain spaces as well as special characters. So here's an excerpt of a result element returned by the XML API of the Google Site Search:
<PageMap>
<DataObject type="document">
<Attribute name="mykeywords">Computer & Hobby</Attribute>
...
</DataObject>
</PageMap>
This works perfectly for displaying the result. However, for filtering we would have to pass a query like that:
more:pagemap:document-mykeywords:computer___hobby
How can we determine the query string from the result in the XML? Simply by lowercasing the value and replacing every non word character with _? How reliable is this?
Or is it better to provide two distinct attributes in our PageMap, one for the label of the keyword and the other for the id of the keyword?

Resources