poor search performance for certain wildcard queries - search

I am having performance issues when using wildcard searching for certain letter combinations, and I am not sure what else I need to to to possibly improve it. All of my documents are following an envelope pattern that look something like the following.
<pdbe:person-envelope>
<person xmlns="http://schemas.abbvienet.com/people-db/model">
<account>
<domain/>
<username/>
</account>
<upi/>
<title/>
<firstName>
<preferred/>
<given/>
</firstName>
<middleName/>
<lastName>
<preferred/>
<given/>
</lastName>
</person>
<pdbe:raw/>
</pdbe:person-envelope>
I have a field defined called name, which includes the firstName and lastName paths:
{
"field-name": "name",
"field-path": [
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:firstName",
"weight": 1
},
{
"path": "/pdbe:person-envelope/pdbm:person/pdbm:lastName",
"weight": 1
}
],
"trailing-wildcard-searches": true,
"trailing-wildcard-word-positions": true,
"three-character-searches": true
}
When I do some queries using search:search, some come back fast, whereas others come back slow. This is with the filtered queries.
search:search("name:ha*",
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="name">
<word>
<field name="name"/>
</word>
</constraint>
<return-plan>true</return-plan>
</options>
)
I can see from the query plan that it is going to filter over all 136547 fragments in the db. But this query works fast.
<search:query-resolution-time>PT0.013205S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.008933S</search:snippet-resolution-time>
<search:total-time>PT0.036542S</search:total-time>
However a search for name:tj* takes a long time, and also filters over all of the 136547 fragments.
<search:query-resolution-time>PT6.168373S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.004935S</search:snippet-resolution-time>
<search:total-time>PT12.327275S</search:total-time>
I have the same indexes on both. Are there any other indexes I should be enabling when I am specifically just doing a search via the field constraint? I have these other indexes enabled on the database itself, in general.
"collection-lexicon": true,
"triple-index": true,
"word-searches": true,
"word-positions": true
I tried doing an unfiltered query, but that did not help as I got a bunch of matches on the whole document, and not the the fields I wanted. I even tried to set the root-fragment to just my person element, but that did not seem to help things.
"fragment-root": [
{
"namespace-uri": "http://schemas.abbvienet.com/people-db/model",
"localname": "person"
}
]
Thanks for any ideas.

Fragment roots are helpful if you want to use a searchable expression for that person element, and mostly if it occurs multiple times in one document. It won't make your current search constrain on that element.
In your case you enabled a number of relevant options, but the wildcard option only works for 4 characters of more. If you want to search on wildcards with less characters, you need to enable the three, two and one character search options.
The search phrases mentioned above both contained two characters with a wildcard. Since you only enabled the three character option, it had to rely on filtering. The fact some run fast, some slow is probably because of caching. If you repeat the same query, MarkLogic will return the result from cache.
For performance testing you would either have to restart MarkLogic regularly to flush caches, or search on (semi) random strings to avoid MarkLogic being able to cache. Or maybe both..
HTH!

Related

solr distinct query I want only certain fields to be listed

I want to find the number of unique records based on myparam value.
Solr distinct query I want only certain fields to be listed.
too many ifs in the distinctValues ​​array in the results. whereas I just want to get the countDistinct value.
url:
http://xxxxxxx:18282/solr/2022/select?q=:&wt=json&rows=0&stats=on&stats.calcdistinct=true&stats.field=myparam
In fact, it would be great if I could get a result like the one below.
result:
{
"responseHeader":{
"status":0,
"QTime":10627,
"params":{
"q":"*:*",
"stats.calcdistinct":"true",
"stats":"on",
"rows":"0",
"wt":"json",
"stats.field":"myparam"}},
"response":{"numFound":816091,"start":0,"docs":[]
},
"stats":{
"stats_fields":{
"myparam":{
"countDistinct":5,
}}}}
I found the answer I was looking for in solr functions.
adress:18282/solr/2021/select?q=:&wt=json&json.facet={x:'unique(username)'}&rows=0
Accordingly, I can find out how many different users there are.

AZURE SEARCH, ismatch not filtering

I'm using Azure Search for perform some customs search in a database.
I got this one field that have this kind of structure:
"STUFF": "05-05-16-00|"
but I'm having trouble by creating the filter, because its possible that I'll not have all the numbers that builds this structure. It all depends that what the final user will type. So I need a wildcard to fill the blanks with the missing numbers, like this
"05-05-??-??" -> the pipe is important, because this field can have more than 1 code inside.
Now I need to catch all the possible elements that STARTS WITH 05-05, like, for example: 05-05-11-01
I thought I suposed to use the search.ismatch() function, but it doesnt work.
here some code:
search.ismatch('05-05-??-??','STUFF');
And the results were:
"STUFF": "02-02-16-00|",
"STUFF": "02-02-14-00|",
this is driving me crazy, because I dont know why this results came back.
Maybe is important to know that Im performing a POST request to the Azure Search API with this code in 'filter'
Maybe i should to escape this especial characters like - and ? like this
search.ismatch('05\\-05\\-\\?\\?\\-\\?\\?','STUFF')
But the results were the same.
Can somebody please help me ?
EDIT 1
following this Article I change some things and make the following search:
search.ismatch('\"05-00*\"','STUFF','simple', 'all')
And I starting the get some results, but now this is my results:
"STUFF": "06-05-02-00|", //WRONG
"STUFF": "05-02-05-01|", //RIGHT
"STUFF": "05-02-02-07|", //RIGHT
For some reason, it's returing the right structure but not the in the front of the text.
EDIT 2
I made some changes and change all the "-" for the keyword "OU" and I'm trying to follow this question to make sore like a "contains", but i perfoming a POST request with the following parameters
{
"search": "*",
"filter": "search.ismatch('/.*08010000OU/.*','STUFF', 'full', 'all')",
"skip": "0",
"count": true
}
Im trying to use a wildcard in the begining of the query search because I still missing some information.
I believe you won't be able to solve this using the StandardAnalyzer. Try switching to WhitespaceAnalyzer for this particular field and it probably will work with "05-05*"

Indexing arrays in CosmosDB

Why doesn't CosmosDB index arrays by default? The default index path is
"path": "/*"
Doesn't that mean "index everything"? Not "index everything except arrays".
If I add my array field to the index with something like this:
"path": "/tags/[]/?"
It will work and start indexing that particular array field.
But my question is why doesn't "index everything" index everything?
EDIT: Here's a blog post that describes the behavior I'm seeing. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains queries are very slow, clearly not using the index. If you add the field in question to the index explicitly then the queries are fast (clearly they start using the index).
"New" index layout
As stated in Index Types
Azure Cosmos containers support a new index layout that no longer uses
the Hash index kind. If you specify a Hash index kind on the indexing
policy, the CRUD requests on the container will silently ignore the
index kind and the response from the container only contains the Range
index kind. All new Cosmos containers use the new index layout by
default.
The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in 36.55 RUs). However pre-existing collections may still be using the old layout.
"Old" index layout
I was able to reproduce the issue with ARRAY_CONTAINS that you are asking about.
Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)
{
"id": "50614926",
"title": "Indexing arrays in CosmosDB",
/*Other irrelevant properties omitted */
"tags": [
"azure",
"azure-cosmosdb"
]
}
And then performing the following query
SELECT COUNT(1)
FROM t IN c.tags
WHERE t = 'sql-server'
The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)
{
"path": "/tags/[]/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": -1
}
]
}
However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.
The range index uses keys based on partial forward paths. So will contain paths such as the following.
tags/0/azure
tags/0/c#
tags/0/oracle
tags/0/sql-server
tags/1/azure-cosmosdb
tags/1/c#
tags/1/sql-server
With this index structure it starts at tags/0/sql-server and then reads all of the remaining tags/0/ entries and the entirety of the entries for tags/n/ where n is an integer greater than 0. Each distinct document mapping to any of these needs to be retrieved and evaluated.
By contrast the hash index uses reverse paths (more details - PDF)
StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are
sql-server/0/tags
sql-server/1/tags
sql-server/2/tags
sql-server/3/tags
sql-server/4/tags
With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.
In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.
The following query took 97 RUs with default indexing policy in my test collection.
SELECT COUNT(1)
FROM c
WHERE 'sql-server' IN (c.tags[0], c.tags[1], c.tags[2], c.tags[3], c.tags[4])
Cosmos DB does indexes all the element of an Array. By, default, All Azure Cosmos DB data is indexed. Read more here https://learn.microsoft.com/en-us/azure/cosmos-db/indexing-policies

Unable to full text search in Solr

I have some data in solr. I want to search which name is Chinmay Sahu See below I have 3 results in output. But I got 3 instead of 1. Because Content name searched partially.
I want to full search those name having Chinmay Sahu only that contents will come.
Output:
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
{
"id": "4e98d680efaab3afe051f3ddc00dc5f2",
"content_id": "1825",
"name": "Chinmay Panda",
"_version_": 1596995745829879800
}
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1259",
"name": "Sasmita Sahu",
"_version_": 1596995745829879800
}
]
Query:
name:Chinmay Sahu
Expected :
"docs": [
{
"id": "741fde46a654879949473b2cdc577913",
"content_id": "1277",
"name": "Chinmay Sahu",
"_version_": 1596995745829879800
},
]
Please help
Try doing this
name:"Chinmay Sahu"
You need to do a phrase query to match the exact name.
I am guessing in your case the name field is using Standard tokenizer which will split tokens if whitespace is there. So while indexing in all the 3 docs there will be a token called "chinmay".
While you search using
name:Chinmay Sahu
Solr will search it like this since if there is no fieldName specified before a token solr automatically searches it in default_field.(however default field is removed from solr 7.3, So it depends on what version of solr are you using.
)
Name:chinmay AND default_field:sahu
So since all the three docs are having chinmay as a token in the index,the query will match all 3 docs.
Now i dont know what your default field is? can you post your solr schema? That way we can explain why you are seeing those 3 docs.
Since root545 already explained that field:foo bar will search for foo in field and bar in the default search field, I'll suggest that it seems like you don't want to concern yourself with the exact Lucene syntax for searching. The edismax query parser is well suited for separating the typed search string from what fields are being searched and whether you want all tokens to match.
The query in that case would be just Chinmay Sahu, while you'd set q.op=AND (all terms must match), defType=edismax (use the edismax query parser) and qf=name (search the name field):
q=Chinmay Sahu&q.op=AND&defType=edismax&qf=name
You can also tune the different phrase parameters to make sure that names with the tokens in the exact same sequence will be boosted higher than those that have them in the opposite sequence (i.e. Sahu Chinmay).
If this is a programmatic search where no user is actually typing in the suggestion, using a phrase search as suggested is the way to go (name:"Chinmay Sahu").
I would suggest using query like
name:(Chinmay Sahu)
And make sure default operator is AND either in settings or query string like q.op=AND
With that approach you can use user input much easier since you don't need to parse it too much.

Case insensitive search in arrays for CosmosDB / DocumentDB

Lets say I have these documents in my CosmosDB. (DocumentDB API, .NET SDK)
{
// partition key of the collection
"userId" : "0000-0000-0000-0000",
"emailAddresses": [
"someaddress#somedomain.com", "Another.Address#someotherdomain.com"
]
// some more fields
}
I now need to find out if I have a document for a given email address. However, I need the query to be case insensitive.
There are ways to search case insensitive on a field (they do a full scan however):
How to do a Case Insensitive search on Azure DocumentDb?
select * from json j where LOWER(j.name) = 'timbaktu'
e => e.Id.ToLower() == key.ToLower()
These do not work for arrays. Is there an alternative way? A user defined function looks like it could help.
I am mainly looking for a temporary low-effort solution to support the scenario (I have multiple collections like this). I probably need to switch to a data structure like this at some point:
{
"userId" : "0000-0000-0000-0000",
// Option A
"emailAddresses": [
{
"displayName": "someaddress#somedomain.com",
"normalizedName" : "someaddress#somedomain.com"
},
{
"displayName": "Another.Address#someotherdomain.com",
"normalizedName" : "another.address#someotherdomain.com"
}
],
// Option B
"emailAddressesNormalized": {
"someaddress#somedomain.com", "another.address#someotherdomain.com"
}
}
Unfortunately, my production database already contains documents that would need to be updated to support the new structure.
My production collections contain only 100s of these items, so I am even tempted to just get all items and do the comparison in memory on the client.
If performance matters then you should consider one of the normalization solution you have proposed yourself in question. Then you could index the normalized field and get results without doing a full scan.
If for some reason you really don't want to retouch the documents then perhaps the feature you are missing is simple join?
Example query which will do case-insensitive search from within array with a scan:
SELECT c FROM c
join email in c.emailAddresses
where lower(email) = lower('ANOTHER.ADDRESS#someotherdomain.com')
You can find more examples about joining from Getting started with SQL commands in Cosmos DB.
Note that where-criteria in given example cannot use an index, so consider using it only along another more selective (indexed) criteria.

Resources