two special signs separated with blanko not found by solr (e.g: ! !) - search

Two special signs separated with blanko not found by solr (e.g: ! !)
I have this index:
http://localhost:8983/solr/koolcha/get?id=547deb3649dbae548b0f0100
{
"doc": {
"status": "xxxxxx",
"updated": "2014-12-05T09:47:27Z",
"ns": "foo3.bags",
"created": "2014-12-02T16:39:18Z",
"_ts": 6.2177735253447e+18,
"label": "_DSC0571.tif",
"project": "xxxxx",
"assignee": "xxxxx",
"folderid": "! !",
"_version_": 1.5180111153642e+18,
"_id": "547deb3649dbae548b0f0100",
"bagid": "xxxxx"
}
}
When I try to search it by 'folderid'
http://localhost:8983/solr/koolcha/select?q=folderid:\!%20\!
solr do not find anything
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="q">folderid:\! \!</str>
</lst>
</lst>
<result name="response" numFound="0" start="0"/>
</response>
If I put some other value it works, even with special signs e.g.'!!' will work.
Only combination of special signs with blanks return nothing.
Is this a bug in Solr? Or I do something wrong?

Besides escaping I think you have to quote the filter query:
http://localhost:8983/solr/koolcha/select?q=folderid:"\! \!"
And use the Lucene Query parser (which you probably already do).

Related

Azure Cognitive Search, how to configure analyzer to support "startsWith"?

I have a field in Azure Cognitive Search that has special characters in it.
they look like this: some_id: 'SOME*STUFF*123'
I'm trying to have a "startsWith" query, but that doesnt return anything as soon as the regex tries to match anything that goes farther than the \*
After a bit google I found out its the Analyzer, possibly breaking apart strings at '*'
So I changed the Analyzer to "keyword", as I read multiple times its the Analyzer you are supposed to use for this.
the new config looks like this:
{
"name": "some_id",
"type": "Edm.String",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "keyword",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
my request look like this:
{
"count": true,
"skip": 0,
"top": 5,
"searchMode": "any",
"queryType": "full",
"search": "some_id:/SO(.*)/" // SOME\\*S(.*) also doesnt work
}
I get zero matches.
With the Standart analyzer I started going no matches as soon as I had a \\* in my regex (I escaped them with \\)
Clarification on Requirements:
I can not change any data, the values (including the \*) can not be changed. I'm trying to have the whole field matched as a single token and for me to run startsWith on.
For example this regex: /SOME\\*ST(.*)/ is supposed to literally return entries that fully match the regex. No magic with seperators or tokens, simply the whole value as a single token that I can run startsWith on.
What I'm trying to say is, take for example JavaScript, I want the exact same results you would get from string.startsWith(value).
I'm guessing there is either something wrong with my config, or with my requests, can anyone help me?
IMHO, you should work with a different separator. For example:
Field1 (FROM) | Field2 (TO)
SOME*STUFF*123 | SOME||STUFF||123
Then use a custom analyzer to break terms every ||. Aditionally, you can also work with tokenizer and specify it to do it every 3 chars.
Samples:
SOM
OME
STU
TUF
UFF
123
Then search using:
SOM*
and it should return the data you're looking for. It would be better if you could provide more details about your content and give us samples, but this answer should point you to the result you're looking for.

Search, Sort and aggregate documents

I have a database with two different document types:
{
"id": "1",
"type": "User",
"username": "User 1"
}
and a second document type with the following structure:
{
"id": "2",
"type": "Asset",
"name": "Asset one",
"owner_id": "1" //id of the user who owns the asset
}
We need to display the list of existing assets and the name of the owner (side by side). We were able to achieve this by using views and linked documents. The problem is, now we need to be able to search and sort which is not supported by views.
Is what we're trying to accomplish possible using CouchDB? Can we do this using search indexes?
We're using CouchDB 2.3.1 and we're not able to upgrade (at least for now).
I need to search for username and asset name and also be able to sort by these fields. We don't need a full featured search. Something like matches (case insensitive) is good enough.
The id / owner_id specified in the examples, represent the document _id. A user will not own more than ~10 assets. The normal scenario will be 2/3 assets.
Without knowing the complete nature of the asset documents (e.g. lifetime, immutability etc) this may get you moving in a positive direction. The problem appears that information from both documents is needed to generate a meaningful view, which isn't happening.
Assuming asset names are immutable and the number of assets per user are low, consider decoupling and denormalizing the owner_id relationship by keeping a list of assets in the User document.
For example, a User document Where the assets property contains a collection of owned asset document information (_id, name):
{
"_id": "1",
"type": "User",
"username": "User 1",
"assets": [
[
"2",
"Asset one"
],
[
"10",
"Asset ten"
]
]
}
Given this structure, an Asset document is fairly thin
{
"_id": "2",
"type": "Asset",
"name": "Asset one"
}
I will assume there is much more information in the Asset documents than presented.
So how to get search and sorted results? Consider a design doc _design/user/_view/assets with the following map function:
function (doc) {
if(doc.type === "User" && doc.assets) {
for(var i = 0; i < doc.assets.length; i++) {
/* emit user name, asset name, value as asset doc id */
emit(doc.username + '/' + doc.assets[i][1], { _id: doc.assets[i][0] });
/* emit asset name with leading /, value as User doc _id */
emit('/' + doc.assets[i][1], { _id: doc._id })
}
}
}
Let's assume the database only has the one user "User 1" and two Asset documents "Asset one" and "Asset ten".
This query (using cUrl)
curl -G <db endpoint>/_design/user/_view/assets
yields
{
"total_rows":4,"offset":0,"rows":[
{"id":"1","key":"/Asset one","value":{"_id":"1"}},
{"id":"1","key":"/Asset ten","value":{"_id":"1"}},
{"id":"1","key":"User 1/Asset one","value":{"_id":"2"}},
{"id":"1","key":"User 1/Asset ten","value":{"_id":"10"}}
]
}
Not very interesting, except notice the rows are returned in ascending order according to its key. To reverse the order simply adding the descending=true parameter
curl -G <db endpoint>/_design/user/_view/assets?descending=true
yields
{
"total_rows":4,"offset":0,"rows":[
{"id":"1","key":"User 1/Asset ten","value":{"_id":"10"}},
{"id":"1","key":"User 1/Asset one","value":{"_id":"2"}},
{"id":"1","key":"/Asset ten","value":{"_id":"1"}},
{"id":"1","key":"/Asset one","value":{"_id":"1"}}
]
}
Now here's where things get cool, and those cool things are startkey and endkey.
For the nature of the keys we can query all assets for "User 1" and have the Asset documents returned in ordered fashion according to the asset name, leveraging the slash in the key
curl -G <db endpoint>/_design/user/_view/assets
-d "startkey="""User%201/"""" -d "endkey="""User%201/\uFFF0""""
note I'm on Windows, where we have to escape double quotes ;(
yields
{
"total_rows":4,"offset":2,"rows":[
{"id":"1","key":"User 1/Asset one","value":{"_id":"2"}},
{"id":"1","key":"User 1/Asset ten","value":{"_id":"10"}}
]
}
This is a prefix search. Note the use of the high unicode character \uFFF0 as a terminator; we're asking for all documents in the view that start with "User 1/".
Likewise to get a sorted list of all Assets
curl -G <db endpoint>/_design/user/_view/assets
-d "startkey="""/"""" -d "endkey="""/\uFFF0""""
yields
{
"total_rows":4,"offset":0,"rows":[
{"id":"1","key":"/Asset one","value":{"_id":"1"}},
{"id":"1","key":"/Asset ten","value":{"_id":"1"}}
]
}
Since the Asset document _id is emit'ed, use include_docs to fetch the Asset document:
curl -G <db endpoint>_design/user/_view/assets -d "include_docs=true"
-d "startkey="""User%201/"""" -d "endkey="""User%201/\uFFF0""""
yields
{
"total_rows": 4,
"offset": 2,
"rows": [
{
"id": "1",
"key": "User 1/Asset one",
"value": {
"_id": "2"
},
"doc": {
"_id": "2",
"_rev": "2-f4e78c52b04b77e4b5d2787c21053155",
"type": "Asset",
"name": "Asset one"
}
},
{
"id": "1",
"key": "User 1/Asset ten",
"value": {
"_id": "10"
},
"doc": {
"_id": "10",
"_rev": "2-30cf9245b2f3e95f22a06cee6789d91d",
"type": "Asset",
"name": "Asset 10"
}
}
]
}
Same goes for Assets where the User _id is emit'ted.
Caveat
The major drawback here is that deleting an Asset document requires updating the User document; not the end of the world but it would be ultra nice to avoid that dependency.
Given the original 1-1 relationship of asset to user, totally getting rid of the Asset document all together and simply storing all Asset data with the User document might be feasible depending on your usage, and wildly reduces complexity.
I hope the above inspires a solution. Good luck!

Sort Solr documents based on a substring from a multivalued field

Not sure if I can achieve this
I have the below documents in the index
{
"name": "nissan",
"type": "product",
"features":["build_100",
"stability_80"]
}
{
"name": "toyota",
"type": "product",
"features":["stability_100",
"design_30"]
}
{
"name": "Audi",
"type": "product",
"features":["build_70",
"design_100"]
}
For a search of build in the features field "design" I get doc 2 and 3 back from recall and my question is that is there a way I could sort/rank the documents based on the number after the "_", so that in the above case I would get doc3 first and then doc 2?
If this can be achieved by changing the document structure then that is also fine with me.
Index them as independent fields and make sure to enable docValues on them (enabled by default on recent version of Solr).
<dynamicField name="features_*" type="int" indexed="true" stored="true"/>
You then index each feature as a separate field:
"feature_design": 100,
"feature_build": 70,
and so on. Sorting by the field can then be done in the same was you'd sort on any other field (sort=feature_design).

character "#" in regex azure search lucene

when implementing a search in Azure Search with a text containing the # character does not return information.
Which analyzer are you using for the search field? If you did not specify an analyzer, it defaults to the lucene standard analyzer which discards punctuations and symbols and the email address abc#bcd.gov.co is tokenized into , , , and . As documented, regex search query only applies to single tokenized terms. The regex /.bcd.gov.co./ doesn't find the email address as it does not match any of the tokenized terms. You can either use whitespace analyzer or a build a custom one that doesn't discard punctuations or symbols to apply regex matching on the entire address.
Hope this helps. Thanks.
Nate
here is sample code
{
"name": "Username",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"sortable": false,
"facetable": false,
"analyzer": "email_analyzer"
},
"analyzers": [
{
"name": "email_analyzer",
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer": "uax_url_email"
}
]

elasticsearch query issue with ngram

i have this data in my index
https://gist.github.com/bitgandtter/6794d9b48ae914a3ac7c
If you notice in the mapping im using the ngram from 3 tokens to 20.
when i execute this query:
GET /my_index/user/_search?search_type=dfs_query_then_fetch
{
"query": {
"filtered": {
"query":{
"multi_match":{
"query": "F",
"fields": ["username","firstname","middlename","lastname"],
"analyzer": "custom_search_analyzer"
}
}
}
}
}
I should get the 8 documents i have indexed but i get only 6 leaving out two with their names are Franz and Francis. I expect to have those two also because the f its included in the data. for some reason its not working.
when i execute:
GET /my_index/user/_search?search_type=dfs_query_then_fetch
{
"query": {
"filtered": {
"query":{
"multi_match":{
"query": "Fran",
"fields": ["username","firstname","middlename","lastname"],
"analyzer": "custom_search_analyzer"
}
}
}
}
}
i get those two documents.
If i lower the ngram to start at 1 i get all the documents but i think this will affect the performance of the query.
What im missing here. Thanks in advance.
NOTE: all the examples are coded used sense
This is expected since the min_gram is specified as 3 it would mean that the minimum length of token produced by the custom analyzer is 3 codepoints.
Hence the first token for "Franz Silva" would be "Fra".
Hence token "F" would not be a match on this document.
One can test out the tokens produced by the analyzer using :
curl -Xget "http://<server>/index_name/_analyze?analyzer=custom_analyzer&text=Franz Silva"
Also note since the "custom_analyzer" specified above does not specify "token_chars", the tokens can contain spaces.

Resources