Azure Cognitive Search - No results using wildcards on content with DOT - azure

I'm using Azure Cognitive Search to build a rich search experience inside a web application, however I'm facing the following issue: a field of the index will contains codes like "Z.A.01.12", "A.A.44.11" and so on...I'm trying to use the wildcards * as suffix in order to search all the results that starts with the value Z.A (just an example).
"Z.A.01.12" -> Z.A* => No results found.
"Z.A.01.12" -> Z.A.* => No results found.
"Z.A.01.12" -> Z\.A* => No results found.
I have tried different analyzer (standard lucene, en.microsoft, whitespace and keyword), but also when I see that exists only one token produced (for example with whitespace) with the entire content, when I query the service using wildlcard I receive "No results found".
I have already set queryType=full and searchMode=any. Furthermore I also tried to escape the . with "", but the results is always empty. Is there anything I can do to manage these cases?

Related

How to filter blobs in Azure SDK for Python

I want to search for blobs in my Azure blob storage according to a specific tag (like: .name, .creation_date, .size...)
My current way is returning all blobs from the container with MyContainerClient.list_blobs and searching for the corresponding tag afterwards. Since my container stores around 800000 blobs, this takes me around 20 min, which is not usable for a live view of the content.
But I also found another ContainerClient function: .find_blobs_by_tags(filter_expression: str) where I can search for a specific blob whose tags matches the specified condition.
In the Azure API they specified this filter_expression as: ""yourtagname"='firsttag'" , therefore I specified: ""name"='example.jpg'" or ""creation_date"='2021-07-04 09:35:19+00:00'"
Azure SDK Python - ContainerClient.find_blobs_by_tag
Unfortunately I always get an error:
azure.core.exceptions.HttpResponseError: Error parsing query at or near character position 1: unexpected 'creation_time'
RequestId:63bd850b-401e-005f-745e-400d5a000000
Time:2022-03-25T15:40:22.4156367Z
ErrorCode:InvalidQueryParameterValue
queryparametername:where
queryparametervalue:'creation_time'='0529121f-7676-46c7-8a52-424664774240/0529121f-7676-46c7-8a52-424664774240.json'
reason:This query parameter value is invalid.
Content: <?xml version="1.0" encoding="utf-8"?>
<Error><Code>InvalidQueryParameterValue</Code><Message>Error parsing query at or near character position 1: unexpected &apos;creation_time&apos;
RequestId:63bd850b-401e-005f-745e-400d5a000000
Time:2022-03-25T15:40:22.4156367Z</Message><QueryParameterName>where</QueryParameterName><QueryParameterValue>&apos;creation_time&apos;=&apos;0529121f-7676-46c7-8a52-424664774240/0529121f-7676-46c7-8a52-424664774240.json&apos;</QueryParameterValue><Reason>This query parameter value is invalid.</Reason></Error>
Has someone experience with this Azure function calls?
Looking at the github code(in the find_blobs_by_tags function) , it says :
:param str filter_expression:
The expression to find blobs whose tags matches the specified condition.
eg. "\"yourtagname\"='firsttag' and \"yourtagname2\"='secondtag'"
Looks like you are missing the escape characters? Can you try including them in?

Finding Related Topics using Google Knowledge Graph API

I'm currently working on a behavioral targeting application and I need a considerably large keyword database/tool/provider that enables applications to reach to the similar keywords via given keyword for my app. I've recently found that Freebase, which had been providing a similar service before Google acquired them and then integrated to their Knowledge Graph. I was wondering if it's possible to have a list of related topics/keywords for the given entity.
import json
import urllib
api_key = 'API_KEY_HERE'
query = 'Yoga'
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = service_url + '?' + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())
for element in response['itemListElement']:
print element['result']['name'] + ' (' + str(element['resultScore']) + ')'
The script above returns the queries below, though I'd like to receive related topics to yoga, such as health, fitness, gym and so on, rather than the things that has the word "Yoga" in their name.
Yoga Sutras of Patanjali (71.245544)
Yōga, Tokyo (28.808222)
Sri Aurobindo (28.727333)
Yoga Vasistha (28.637642)
Yoga Hosers (28.253984)
Yoga Lin (27.524054)
Patanjali (27.061115)
Yoga Journal (26.635073)
Kripalu Center (26.074436)
Yōga Station (25.10318)
I'd really appreciate any suggestions, and I'm also open to using any other API if there is any that I could make use of. Cheers.
See your point:) So here's the script I use for that using Serpstat's API. Here's how it works:
Script collects the keywords from Serpstat's database
Then, collects search suggestions from Serpstat's database
Finally, collects search suggestions from Google's suggestions
Note that to make script work correctly, it's preferable to fill all input boxes. But not all of them are required.
Keyword — required keyword
Search Engine — a search engine for which the analysis will be carried out. For example, for the US Google, you need to set the g_us. The entire list of available search engines can be found here.
Limit the maximum number of phrases from the organic issue, which will participate in the analysis. You cannot set more than 1000 here.
Default keys — list of two-word keywords. You should give each of them some "weight" to receive some kind of result if something goes wrong.
Format: type, keyword, "weight". Every keyword should be written from a new line.
Types:
w — one word
p — two words
Examples:
"w; bottle; 50" — initial weight of word bottle is 50.
"p; plastic bottle; 30" — initial weight of phrase plastic bottle is 30.
"w; plastic bottle; 20" — incorrect. You cannot use a two-word phrase for the "w" type.
Bad words — comma-separated list of words you want the script to exclude from the results.
Token — here you need to enter your token for API access. It can be found on your profile page.
You can download the source code for script here

Get default stop word list in elastic search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.
The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

exclude a certain path from all user searches

Unfortunately we have a special folder named "_archive" in our repository everywhere.
This folder has its purpose. But: When searching for content/documents we want to exclude it and every content beneath "_archive".
So, what i want is to exclude the path and its member from all user searches. Syntax is easy with fts:
your_query AND -PATH:"//cm:_archive//*"
to test:
https://www.docdroid.net/RmKj9gB/search-test.pdf.html
take the pdf, put it into your repo twice:
/some_random_path/search-test.pdf
/some_random_path/_archive/search-test.pdf
In node-browser everything works as expected:
TEXT:"HODOR" AND -PATH:"//cm:_archive//*"
= 1 result
TEXT:"HODOR"
= 2 results
So, my idea was to edit search.get.config.xml and add the exclusion to the list of properties:
<search>
<default-operator>AND</default-operator>
<default-query-template>%(cm:name cm:title cm:description ia:whatEvent
ia:descriptionEvent lnk:title lnk:description TEXT TAG) AND -PATH:"//cm:_archive//*"
</default-query-template>
</search>
But it does not work as intended! As soon as i am using 'text:' or 'name:' in the search field, the exclusion seems to be ignored.
What other option do i have? Basically just want to add the exclusion to the base query after the default query template is used.
Version is Alfresco Community 5.0.d
thanks!
I guess you're mistaken what query templates are meant for. Take a look at the Wiki.
So what you're basically doing is programmatically saying I've got a keyword and I want to match the keywords to the following metadata fields.
Default it will match cm:name cm:title cm:description etc. This can be changed to a custom field or in other cases to ALL.
So putting an extra AND or here of whatever won't work, cause this isn't the actual query which will be built. I can go on more about the query templates, but that won't do you any good.
In your case you'll need to modify the search.get webscript of Alfresco and the method called function getSearchResults(params) in search.lib.js (which get's imported).
Somewhere in at the end of the method it will do the following:
ftsQuery = '(' + ftsQuery + ') AND -TYPE:"cm:thumbnail" AND -TYPE:"cm:failedThumbnail" AND -TYPE:"cm:rating" AND -TYPE:"st:site"' + ' AND -ASPECT:"st:siteContainer" AND -ASPECT:"sys:hidden" AND -cm:creator:system AND -QNAME:comment\\-*';
Just add your path to query to it and that will do.

Using indexed types for ElasticSearch in Titan

I currently have a VM running Titan over a local Cassandra backend and would like the ability to use ElasticSearch to index strings using CONTAINS matches and regular expressions. Here's what I have so far:
After titan.sh is run, a Groovy script is used to load in the data from separate vertex and edge files. The first stage of this script loads the graph from Titan and sets up the ES properties:
config.setProperty("storage.backend","cassandra")
config.setProperty("storage.hostname","127.0.0.1")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","db/es")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
The second part of the script sets up the indexed types:
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make();
The third part loads in the data from the CSV files, this has been tested and works fine.
My problem is, I don't seem to be able to use the ElasticSearch functions when I do a Gremlin query. For example:
g.E.has("property",CONTAINS,"test")
returns 0 results, even though I know this field contains the string "test" for that property at least once. Weirder still, when I change CONTAINS to something that isn't recognised by ElasticSearch I get a "no such property" error. I can also perform exact string matches and any numerical comparisons including greater or less than, however I expect the default indexing method is being used over ElasticSearch in these instances.
Due to the lack of errors when I try to run a more advanced ES query, I am at a loss on what is causing the problem here. Is there anything I may have missed?
Thanks,
Adam
I'm not quite sure what's going wrong in your code. From your description everything looks fine. Can you try the follwing script (just paste it into your Gremlin REPL):
config = new BaseConfiguration()
config.setProperty("storage.backend","inmemory")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","/tmp/es-so")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
g = TitanFactory.open(config)
g.makeKey("name").dataType(String.class).make()
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make()
g.makeLabel("knows").make()
g.commit()
alice = g.addVertex(["name":"alice"])
bob = g.addVertex(["name":"bob"])
alice.addEdge("knows", bob, ["property":"foo test bar"])
g.commit()
// test queries
g.E.has("property",CONTAINS,"test")
g.query().has("property",CONTAINS,"test").edges()
The last 2 lines should return something like e[1t-4-1w][4-knows-8]. If that works and you still can't figure out what's wrong in your code, it would be good if you can share your full code (e.g. in Github or in a Gist).
Cheers,
Daniel

Resources