How to index keywords using couchdb-lucene - couchdb

I'm trying to build a couchdb view using couchdb-lucene to query on keywords. I want lucene to index them without any processing.
I'm using "index": "not_analyzed" option, but it is still not doing as I expected.
When I query of /works/OL1000010W, couchdb-lucene is converting it into lowercase and stripping the first / character.
$ curl -s 'http://127.0.0.1:5984/editions_1k/_fti/_design/seeds/by_seed?q=seed:/works/OL1000010W&limit=1'
{
"rows": [],
"total_rows": 0,
"skip": 0,
"search_duration": 1,
"q": "seed:works/ol1000010w",
"fetch_duration": 0,
"etag": "11e4be5bdb5c1598",
"limit": 1
}
Is there any way to make couchdb-lucene index it without processing and stop couchdb-lucene from processing the query?
Here is my design document:
https://gist.github.com/670374

Found that this is due a bug in couchdb-lucene.
https://github.com/rnewson/couchdb-lucene/issues/#issue/92
And workaround is to write the view like this:
{
"analyzer": "keyword",
"index": "function(doc) {...}"
}

Related

Arangodb Search view is not consistent with collection

I am using arangodb version 3.9.3 and I am experiencing an inconsistency between an arangodb collection and the corresponding arangosearch view on top of it.
The number of documents found by querying the collection is nearly half of the number of documents found by querying the view.
I am using the following query
RETURN COUNT(FOR n IN collection RETURN n)
which returns 4353.
And the query on view
RETURN COUNT(FOR n IN view SEARCH true RETURN n)
returns 7303.
Because of this inconsistency, the LIMIT operation is not working as expected and queries are also returning extra results. Moreover, it seems that it is returning old documents as well.
I also tested it on older arangodb versions.
It is happening on both 3.7.18, 3.8.7 and on latest 3.9.3. And I am using a single node instance.
My workflow is something like this:
I just delete and create a document (changing few attributes of the document) multiple times (like 1000 times) in a loop which leads to inconsistency in the view.
And this seems to only happen when I have primarySortOrder/storedValues defined.
I am creating the View on app startup which looks like this
{
"writebufferSizeMax": 33554432,
"writebufferIdle": 64,
"cleanupIntervalStep": 2,
"commitIntervalMsec": 250,
"consolidationIntervalMsec": 500,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"primarySortCompression": "none",
"writebufferActive": 0,
"links": {
"FilterNodes": {
"analyzers": [
"identity"
],
"fields": {
"name": {},
"city": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"globallyUniqueId": "h9943DEC4CDFB/231",
"id": "231",
"storedValues": [],
"primarySort": [
{
"field": "name",
"asc": true
},
{
"field": "city",
"asc": true
}
],
"type": "arangosearch"
}
I am unsure if I am doing something wrong or if this is a longstanding bug, which seems like a pretty strong inconsistency.
Has anyone else encountered this? And can anyone help?
Thanks

How to include two analyzers into a single SEARCH statement?

I have a feeds collection with documents like this:
{
"created": 1510000000,
"find": [
"title of the document",
"body of the document"
],
"filter": [
"/example.com",
"-en"
]
}
created contains an epoch timestamp
find contains an array of fulltext snippets, e.g. the title and the body of a text
filter is an array with further search tokens, such as hashtags, domains, locales
Problem is that find contains fulltext snippets, which we want to tokenize, e.g. with a text analyzer, but filter contains final tokens which we want to compare as a whole, e.g. with the identity analyzer.
Goal is to combine find and filter into a single custom analyzer or to combine two analyzers using two SEARCH statements or something to that end.
I did manage to query by either find or by filter successfully, but do not manage to query by both. This is how I query by filter:
I created a feeds_search view:
{
"writebufferIdle": 64,
"type": "arangosearch",
"links": {
"feeds": {
"analyzers": [
"identity"
],
"fields": {
"find": {},
"filter": {},
"created": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"consolidationIntervalMsec": 10000,
"writebufferActive": 0,
"primarySort": [],
"writebufferSizeMax": 33554432,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"cleanupIntervalStep": 2,
"commitIntervalMsec": 1000,
"id": "362444",
"globallyUniqueId": "hD6FBD6EE239C/362444"
}
and I created a sample query:
FOR feed IN feeds_search
SEARCH ANALYZER(feed.created < 9990000000 AND feed.created > 1500000000
AND (feed.find == "title of the document")
AND (feed.`filter` == "/example.com" OR feed.`filter` == "-uk"), "identity")
SORT feed.created
LIMIT 20
RETURN feed
The sample query works, because find contains the full text (identity analyzer). As soon as I switch to a text analyzer, single word tokens work for find, but filter no longer works.
I tried using a combination of SEARCH and FILTER, which gives me the desired result, but I assume it probably performs worse than having the SEARCH analyzer do the whole thing. I see that analyzers is an array in the view syntax, but I seem not to be able to set individual fields for each analyzer.
The analyzers can be added as a property to each field in fields. What is specified in analyzers is the default and is used in case a more specific analyzer is not set for a given field.
"analyzers": [
"identity"
],
"fields": {
"find": {
"analyzers": [
"text_en"
]
},
"filter": {},
"created": {}
},
Credits: Simran at ArangoDB

How to reduce query execution time using mango query in CouchDB?

I am doing pagination of 15000 records using mango query in CouchDB, but as I skip the records in more numbers then the execution time is increasing.
Here is my query:
{
"selector": {
"name": {"$ne": "null"}
},
"fields": ["_id", "_rev", "name", "email" ],
"sort": [{"name": "asc" }],
"limit": 10,
"skip": '.$skip.'
}
Here skip documents are dynamic depends upon the pagination number and as soon as the skip number increases the query execution time also get increase.
CouchDB "Mango" queries that use the $ne (not equal) operator tend to suffer performance issues because of the way the indexing works. One solution is to create and index that *only contains documents where name does not equal null by using CouchDB's relative new partial index feature.
Partial indexes allow the database to be filtered at index time, so that the built index only contains documents that pass the filter test you specify. The index can then be used with a query at query time to further winnow the data set down.
An index is created by calling the /db/_index endpoint:
POST /db/_index HTTP/1.1
Content-Type: application/json
Content-Length: 144
Host: localhost:5984
{
"index": {
"partial_filter_selector": {
"name": {
"$ne": "null"
}
},
"fields": ["_id", "_rev", "name", "email"]
},
"ddoc": "mypartialindex",
"type" : "json"
}
This creates an index where only documents whose name is not null are included. We can then specify this index at query time:
{
"selector": {
"name": {
"$ne": "null"
}
},
"use_index": "mypartialindex"
}
In the above query, my selector is choosing all records, but the index it is accessing is already filtered. You may add additional clauses to the selector here to further filter the data at query time.
Partial indexing is described in the CouchDB documentation here and in this blog post.

Need help on Azure search with search term having asterisk(*)

We are facing an issue with Azure search API when hit with search term with asterisk(*) at the end and also with special characters.
We are hitting our production Azure search API with below json object and get no results. Notice the search term "déménage*" with asterisk(*) at the end.
https://one-adscope-search-fr-prod.search.windows.net/indexes/one-adscope-advancedsearch-fr/docs/search?api-version=2016-09-01
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage*",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
But when hit the API with similar json except only one change – search term without and asterisk(*) at the end like "déménage” we are getting appropriate results.
Please notice below all the other fields are the same along with SearchFields.
{
"count": "true",
"facets": null,
"orderby": "firstSeenDate desc,creativeIdNumber asc",
"search": "déménage",
"searchFields": "keywordSignatureLangSearch,keywordSloganLangSearch,keywordTextLangSearch,keywordScriptLangSearch,keywordIncrustTVLangSearch,keywordVisualKeywordsLangSearch,keywordAgencyLangSearch,keywordMusicTitleLangSearch,keywordMusicPerformerLangSearch,keywordMusicAuthorLangSearch,categoryLevel_1_nameLangSearch,categoryLevel_2_nameLangSearch,categoryLevel_3_nameLangSearch,categoryLevel_4_nameLangSearch,categoryLevel_5_nameLangSearch,productLevel_1_nameLangSearch,productLevel_2_nameLangSearch,productLevel_3_nameLangSearch,productLevel_4_nameLangSearch,productLevel_5_nameLangSearch,campaignNamesLangSearch,themeNamesLangSearch,creativeTitleLangSearch,visualLangSearch,keyword_tagsLangSearch,countryNameLangSearch,directorLangSearch,hashtagsLangSearch,illustratorLangSearch,inlayLangSearch,csmediaNameLangSearch,subMediaNameLangSearch,modifVersionLangSearch,photographerLangSearch,productionLangSearch,taglineLangSearch,partnersLangSearch,creativeLabelLangSearch,propertyNameLangSearch,sponsorshipProgramTitleLangSearch",
"searchMode": "any",
"select": "",
"skip": 0,
"top": 250,
"queryType": "full"
}
Please advise at the earliest.
Thanks,
Bhavik Shah
I suspect the documents returned in the case without the suffix operator '*' are matching because the diacritics were removed from the search term during the lexical analysis process. Please see this post for details: Prefix queries (*) in Azure Search don't return expected results
Consider changing your query to search=déménage* OR déménage

Case insensitive search in mongodb and nodejs inside an array

I want to perform a tag search which has to be case insensitive against tag keywords. I need this for a single keyword search and how to do that for multiple keywords too. But the problem is when I search with following queries I am getting nothing. I am new to NodeJs and MongoDb so if there is any mistake in the queries please do rectify me.
The tags can be 'tag1' or 'TAG1' or 'taG1'.
for single tag keyword search I have used (I'm not getting any result):
db.somecollection.find({'Tags':{'TagText': new RegExp('Tag5',"i")}, 'Status':'active'})
for multiple tag keyword search (need to make this case insensitive too :( )
db.somecollection.find({'Tags':{'TagText': {"$in": ['Tag3','Tag5', 'Tag16']}}, 'Status':'active'})
the record-set in the db:
{
"results": {
"products": [
{
"_id": "5858cc242dadb72409000029",
"Permalink": "some-permalink-1",
"Tags": [
{"TagText":"Tag1"},
{"TagText":"Tag2"},
{"TagText":"Tag3"},
{"TagText":"Tag4"},
{"TagText":"Tag5"}
],
"Viewcount": 3791
},
{
"_id": "58523cc212dadb72409000029",
"Permalink": "some-permalink-2",
"Tags": [
{"TagText":"Tag8"},
{"TagText":"Tag2"},
{"TagText":"Tag1"},
{"TagText":"Tag7"},
{"TagText":"Tag2"}
],
"Viewcount": 1003
},
{
"_id": "5858cc242dadb11839084523",
"Permalink": "some-permalink-3",
"Tags": [
{"TagText":"Tag11"},
{"TagText":"Tag3"},
{"TagText":"Tag1"},
{"TagText":"Tag6"},
{"TagText":"Tag18"}
],
"Viewcount": 2608
},
{
"_id": "5850cc242dadb11009000029",
"Permalink": "some-permalink-4",
"Tags": [
{"TagText":"Tag14"},
{"TagText":"Tag12"},
{"TagText":"Tag4"},
{"TagText":"Tag5"},
{"TagText":"Tag7"}
],
"Viewcount": 6202
},
],
"count": 4
}
}
Create a text index for the field that you want search on. (Default is case insensitive)
db.somecollection.createIndex( { "Tags.TagText": "text" } )
For more options, https://docs.mongodb.com/v3.2/core/index-text/#index-feature-text
Make use $text operator in combination with $search for searching the content.
For more options, https://docs.mongodb.com/v3.2/reference/operator/query/text/#op._S_text
Search with single term
db.somecollection.find({$text: { $search: "Tag3"}});
Search with multiple search terms
db.somecollection.find({$text: { $search: "Tag3 Tag5 Tag16"}});
Update:
Looks like you are looking for case insensitive equality which can be easily achieved by regex. You'll not need text search. Drop the text search index.
Search with single term
db.somecollection.find({'Tags.TagText': {$regex: /^Tag3$/i}}).pretty();
Search with multiple search terms
db.somecollection.find({'Tags.TagText': {$in: [/^Tag11$/i, /^Tag6$/i]}}).pretty();

Resources