Search and Search All in COBOL - search

I want to know the result when we try to search an item in COBOL using SEARCH or SEARCH ALL and this item appears multiple times in the table.
Will any of the two will find all the occurrences ?

Will any of the two will find all the occurrences ?
No. Either SEARCH will identify one, and only one, table element.
The SEARCH statement is used to search a table for a table element that satisfies the specified condition and to adjust the value of the associated index to indicate that table element.
For SEARCH, the final value, of the index or identifier, will be the first table element that matches the conditions.
For SEARCH ALL, the final setting of the search index is equal to one of them, but it is undefined which one.

Neither will, but with the search you can set the initial starting index and do a second search to find subsequent entries.
Search
The Search verb does a linear search through the table. Table entries can be in any sequence.
If there are multiple entries the first after the starting index will be found.
You can use the Set index to verb to set the starting position.
Search All
The Search All does a binary search of the Table. The table must be in Key Sequence.
If there are multiple match's, any one could be found. For large tables, the Search All will be faster option.

Related

cts search returning wrong results for wildcarded search

Facing an issue with ‘wildcarded’ search for ‘unfiltered’ cts search query.
Problem explanation:
I have inserted the below docs in DB.
xdmp:document-insert('/a/a1.xml', <root><aa>123</aa></root>);
xdmp:document-insert('/a/a2.xml', <root><aa>12</aa></root>);
xdmp:document-insert('/a/a3.xml', <root><aa>1</aa></root>);
In the below query I am looking for documents having only one digit in ‘aa’ element.
But the below query returning me all the documents I have inserted above.
cts:search(
doc(),
cts:element-word-query(xs:QName('aa'), '?', ('wildcarded')),
'unfiltered'
)
If I will perform ‘filtered’ search I am getting the right result which is doc ‘/a/a3.xml.
Same issue is when the search term is ‘??’(docs expected which contain two digit number in ‘aa’ element) and
‘???’ (docs expected which contain three digit number in ‘aa’ element)
Below indexes are set to true:
three character searches
three character word positions
fast element character searches
trailing wildcard searches
trailing wildcard word positions
fast element trailing wildcard searches
I am curious to know why this is happening and how can I correct this?
An unfiltered search can only return accurate results if there is an index that can satisfy the query. You can see how your query is being formulated to index resolution using xdmp:plan:
xdmp:plan(
cts:search(doc(),cts:element-word-query(xs:QName("aa"),"?","wildcarded"))
In your case, you have no index that can do this and the plan will show that you are just asking for all documents with that element in them. The three character and trailing wildcard indexes only work if there are three or more non-wildcard characters, and the fast element character index just means to apply whatever character indexes you have with the element context. We recommend that for wildcards you add a codepoint collation word lexicon. You can add it to the database as a whole, or, if you know you only need these kinds of wildcards for this particular element, you can add an element word lexicon. Lexicon expansion can then be used to resolve the wildcard.
This happens in a heuristic way automatically (which is to say, depending on the size of your database and the number of lexicon matches, we may formulate the query in more or less accurate ways), but there are also various options to force the handling to behave a certain way. See the API for cts:element-word-query

Search for exact term in an Algolia index

I want to filter an index by an exact value of an attribute. I wonder what possibilities Algolia offers for that.
Querying an index always results in a search for substrings, that means a search term abc will always match any object which attribute values contain abc. What I want to achieve is a search for abc that finds only abc as a value of an attribute (in this case I have specific attributes to search in).
One possibility I came up with was tagging, which doesn't seem to be the best way to think of.
Edit
I think I could also use facet filters. I thought about the different pros and cons and can't come up with arguments that places either one position above the other.
You're right with your edit that facet filters would be the way to go on this one. You'll get the exact match you're looking for and won't have to create a new attribute of _tags to use the tag filter.

ArangoDB Full Text Index Performance

I have 4842 documents with a sample format
{"ID":"12345","NAME":"name_value","KIND":"kind_value",...,"Secondary":{...},"Tertiary":{...}} where “...” are a few more varying number of key value pairs per object
I have indexed KIND as a full text index using - db.collection.ensureFulltextIndex("KIND") before inserting data.Also, KIND is just a one word string. ie. without spaces
Via AQL following queries were executed:
FOR doc IN FULLTEXT(collection, 'KIND', 'DeploymentFile') RETURN doc --> takes 3.54s (avg)
FOR doc IN collection FILTER doc.KIND == 'DeploymentFile' RETURN doc --> takes 1.16s (avg)
2944 Objects returned in both queries
Q1. Assuming that we have used a fulltext index and I haven't hash indexed KIND, shouldn't the query using FULLTEXT function be faster than the normal == operation (since == doesn't utilize the full text index). If so, what am I doing wrong here?
Q2. Utilizing the fulltext index, can i perform a query which does a CONTAINS string or LIKE string?
---UPDATE Q2.The requirement is searching for a substring within a parent string (which is only one word). The substring can lie anywhere within the parent string. (SQL equivalent of LIKE '%text%')
Q1: The fulltext index does allow for more complex query. It splits the text at word breaks and checks if a word occurs within a larger text. All of these features are not needed in your example. Therefore it generates more overhead than it is saving.
In your example it would be better to create a skip-list or hash-index and search for equality.
Q2: In the simplest form, a fulltext query contains just the sought word. If multiple search words are given in a query, they should be separated by commas. All search words will be combined with a logical AND by default, and only such documents will be returned that contain all search words. This default behavior can be changed by providing the extra control characters in the fulltext query, which are:
+: logical AND (intersection)
|: logical OR (union)
-: negation (exclusion)
Examples:
"banana": searches for documents containing "banana"
"banana,apple": searches for documents containing both "banana" AND "apple"
"banana,|orange": searches for documents containing either "banana" OR "orange" OR both
"banana,-apple": searches for documents that contains "banana" but NOT "apple".
Logical operators are evaluated from left to right.
Each search word can optionally be prefixed with complete: or prefix:, with complete: being the default. This allows searching for complete words or for word prefixes. Suffix searches or any other forms are partial-word matching are currently not supported.
Examples:
"complete:banana": searches for documents containing the exact word "banana"
"prefix:head": searches for documents with words that start with prefix "head"
"prefix:head,banana": searches for documents contain words starting with prefix - "head" and that also contain the exact word "banana".
Complete match and prefix search options can be combined with the logical operators.

How to search phrase queries in inverted index structure?

If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure ,
which ways should we do ?
1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .
2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .
I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?
thanks a lot.
As the wikipedia entry well explains,
There are two main variants of
inverted indexes: A record level
inverted index (or inverted file index
or just inverted file) contains a list
of references to documents for each
word. A word level inverted index (or
full inverted index or inverted list)
additionally contains the positions of
each word within a document. The
latter form offers more functionality
(like phrase searches), but needs more
time and space to be created.
Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.
To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).
So, in either case, option (1) is more promising (unless your documents are peculiarly small).

Lucene number extracting

I have this number extracting problem.
I want to get all matches that don't have a certain number in it
ex : 125501874, 125001873
Every number that as 55 at the position 2 are not to be considered.
The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99]
(we cannot have 00 as the first two number)
With Lucene I wanted to add NOT field:[01-99]55*
But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?
Thank you Lucene guru
Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.
Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"
Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.
Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.
Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.
But like you said before, better start with an index on the relevant digits straighaway.
I have another solution.
NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*
It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start.
Now I'm testing this on a million of row and it's pretty efficient for our needs.

Resources