Google Datastore partial string matching

Google Datastore partial string matching - node.js

I have read that there is the Search API. But it seems like this API does not exist for Node.JS.
How can I partially match strings for querying entities without knowing the full name of the attribute?
For example I want to select all users that start with a G. How can I accomplish this?
Thank you for your help!

While you cannot do a "true" partial string matching (i.e. contains) with Datastore, you can do a "begins with" query as described in this post:
Basically, create a composite inequality filter like this:
SELECT * FROM USER WHERE USERNAME >= 'G' AND USERNAME < 'G\ufffd'.
Here, \ufffd is the last valid unicode character.
This would return all entities with their usernames starting with 'G'. You can use the same technique for matching multiple characters (e.g. >= 'JA' and < 'JA\ufffd').
Note that the string values/indexes in the Datastore are case sensitive, so you need an indexed property with all characters in either lower case or upper case so you can perform the search accordingly.
You can also mimic a word search like this -
Let's say you have a property named name that stores the following values:
John Doe
John Smith
James Mike Murphy
To do a word search (find entities with word smith or james and murphy) - create another property (e.g. nameIndex) and store the words from name as an array property (note that all words are converted to lower case).
["john","doe"]
["john", "smith"]
["james", "mike" "murphy"]
Now you can do a word search using the nameIndex property -
SELECT * FROM Entity WHERE nameIndex = 'smith'
SELECT * FROM Entity WHERE nameIndex = 'james' AND nameIndex='murphy'
Again, note that the nameIndex need to store the data in a fixed case (lower or upper) and your query parameters should use that case. Also, OR queries not supported unless the client library you are using supports it (typically done by running multiple queries).
This approach won't work if your property has more than 1500 bytes of data (limit for indexed properties)
Again, the proposed solutions are not replacement for full text search engines, rather a couple of tricks you could do with Datastore alone and may satisfy simple requirements.

You can't perform partial match searches on the Datastore entities (let alone without knowing the name of the property/attribute). See Appengine Search API vs Datastore
And the Search API is, indeed, not available in the flexible environment (that includes Node.JS). A potential alternative is indicated the Search section in Migrating Services from the Standard Environment to the Flexible Environment:
The Search service is currently unavailable outside of the standard
environment. You can host any full-text search database such as
ElasticSearch on Google Compute Engine and access it from both
the standard and flexible environments.
UPDATE:
Node.JS is currently available in the standard environment as well, see:
Now, you can deploy your Node.js app to App Engine standard environment
Google App Engine Node.js Standard Environment Documentation

Related

How do you construct an Azure Search query to return a wildcard search based solely on a specific field?

If I may have missed this in some other area of SO please redirect me but I don't think this is a duped question.
I am using Azure Search with an index field called Title which is searchable and filterable using a Standard Lucerne Analyzer.
When using the built-in explorer, if I want to return all Job Titles that are explicitly named Full Stack Developer I can achieve it this way:
$filter=Title eq 'Full Stack Developer'&$count=true
But if I want to retrieve all the Job Titles using a wildcard to return all records having Full Stack in the name this way:
$filter=Title eq 'Full Stack*'&$count=true
The first 20 or so records returned are spot on, but after that point I get a mix of records that have absolutely nothing in common with the Title I specified in the query. My initial assumption was that perhaps Azure was including my specified Title performing an inclusive keyword search on the text as well.
Though I found a few instances where that hypothesis seemed to prove out, so many more of the records returned invalidated that altogether.
Maybe I don't understand fully the mechanics under the hood of Azure Search and so though my query appears to be valid; my expectation of the result is way off.
So how should my query look to perform a wildcard resulting to guarantee the words specified in the search to be included in the Titles returned, if this should be possible? And what would be the correct syntax to condition the return to accommodate for OR operators to be inclusive?

Azure Cognitive Search allows you to perform wildcard searches limited to specific fields only. To do so, you will need to specify the name of the fields in which you want to perform the search in searchFields parameter.
Your search URL in this case would be something like:
https://accountname.search.windows.net/indexes/indexname/docs?api-version=2020-06-30&searchFields=Title&search=Full Stack*
From the link here:

finding organization and industry/sector from string in dbpedia

I am generating a short list of 10 to 20 strings which I want to lookup on dbpedia to see if they have an organization tag and if so return the industry/sector tag. I have been looking at the SPARQLwrapper queries on their website but am having trouble constructing one that returns organization and sector/industry for my string. Is there a way to do this?
If I use the code below I get a list of industry types I think rather than the industry of the company.
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
SELECT ?industry WHERE
{ <http://dbpedia.org/resource/IBM> a ?industry}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Instead of looking at queries which are meant to help you understand the querying tool, you should start by looking at the data which is being queried. For instance, just click http://dbpedia.org/resource/IBM, and look at the properties (the left hand column) to see its rdf:type values (of which there are MANY)!
Note that IBM is not described as a ?industry. IBM is described as a <http://dbpedia.org/resource/Public_company> (among other things). On the other hand, IBM is also described as having three values for <http://dbpedia.org/ontology/industry> --
<http://dbpedia.org/resource/Cloud_computing>
<http://dbpedia.org/resource/Information_technology>
<http://dbpedia.org/resource/Cognitive_computing>
I don't know whether these are what you're actually looking for or not, but hopefully what I've done above will start you down the right path to whatever you do want to get out of DBpedia.

Returning accented as well as normal result set via azure search filters

Does anyone know how to ensure we can return normal result as well as accented result set via the azure search filter. For e.g the below filter query in Azure search returns a name called unicorn when i check for record with name unicorn.
var result= searchServiceClient.Documents.SearchAsync<myDto>("*",new SearchParameters
{
SearchFields = new List<string> {"Name"},
Filter = "Name eq 'unicorn'"
});
This is all good but what i want is i want to write a filter such that it returns record named unicorn as well as record named únicorn (please note the first accented character) provided that both record exist.
This can be achieved when searching for such name via the search query using language or Standard ASCII folding search analyzer as mentioned in this link. What i am struggling to find out is how can we implement the same with azure filters?
Please let me know if anyone has got any solutions around this.

Filters are applied on the non-analyzed representation of the data, so I don’t think there’s any way to do any type of linguistic analysis on filters. One way to work around this is to manually create a field which only do lowercasing + asciifloding (no tokenization) and then search lucene queries that look like this:
"normal search query terms" AND customFilterColumn:"filtérValuèWithÄccents"
Basically the document would both need to match the search terms in any field AND also match the filter term in the “customFilterColumn”. This may not be sufficient for your needs, but at least you understand the art of the possible.

Using filters it won't work unless you specify in advance all the possibilities:
for example:
$filter=name eq 'unicorn' or name eq 'únicorn'
You'd better work with a different analyzer that will change accents to it's root form. As another possibility, you can try fuzzy search:
search=unicorn~&highlight=Name

Solr Fuzzy search (max 2 edits)

I am using Solr 6.0.0
I am using data-driven-configuration for my configuration related purpose. Most of the configuration is standard.
I have a document in Solr with
name:"aquickbrownfox"
Now if I do a fuzzy search like:
name:aquickbrownfo~0.7
OR
name:aquickbrownf~0.7
It lists out the record in the results.
But if I do a search like:
name:aquickbrown~0.7
It does not list the record.
Does it have to do something with the maxEdits in solrconfig.xml which is set to 2 ?
I tried increasing it. But I could not create a collection with this configuration. It gave an error:
ERROR: Error CREATEing SolrCore 'my-search': Unable to create core
[my-search] Caused by: Invalid maxEdits
Max 2 Edits seems to be a serious limitation. I wonder what is the use of passing the fractional value after the ~ operator.
My Usecase:
I have a contact database. I am supposed to detect the duplicates based on three parameters : Name, Email and Phone. So I rely on Solr for Fuzzy search. Email and Phone are relatively easy to work with simple assumptions. Name seems to be a bit tricky. For each word in the Name, I plan to do a fuzzy search. I expected the optional parameter after ~ to work without the maxEdit distance limitation.

The documentation no longer suggests using a fractional value after the tilde - see http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fuzzy_Searches for more information.
However, you are correct that only 2 changes are allowed to be made to the search string in order to carry out a fuzzy search. I would guess this limitation strikes a balance between efficiency and usefulness.
The maxEdits parameter in solrconfig.xml applies to the DirectSpellChecker configuration, and doesn't affect your searching, unless you're using the spell checker.
For your use case, your best approach may be to index the name field twice, using different field configurations: one using a simple set of analyzers and filters (ie. StandardTokenizerFactory, StandardFilterFactory, LowerCaseFilterFactory), and the other using a phonetic matcher such as the Beider-Morse filter. You can use the first field to carry out fuzzy searches, and the second version to look for names which may be spelled differently but sound the same as the name being checked.

Neo4j quick lookup of string within property of many nodes

I'm using neo 2.2.2 and i'm currently using Regex search to find a string in the name property over 600k nodes.
Each node is structured with a minimum of the following two properties.
{
name: 'some string of text',
sid: 12345
}
I've created an index on name and another index on sid. Lookups on sid are very fast. Searches [using regex] are very slow. Currently I'm searching for a string with a * before and after.
What can be done with neo to make searching for string within a property very fast?
If doing something special within neo is not ideal, I could theoretically standup some supplementary algorithm/service separate from Neo4j that searches for a string value within the name property, and then gives me the sid, which then is used to look up the node within neo.
Help me do fast string search with neo4j, please. :)

You can use legacy fulltext indexing to speed up your search. This blog shows you how.

In general Regexes are very expensive. From my point of view, you should find another solution for that.
Could you please tell us more about your use case and why you want to use Regex?
One solution for that you already suggest. Store SID and Name in another format (or database), which has better performance for Regex searching than Neo4j.
Or do some analysis of name property content and base on that create representation of the content as a graph.
e.g.
* Node for a count of letters in name property
* Node for starting letter
* Split name property to multiple properties
* etc...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Google Datastore partial string matching - node.js

Related

How do you construct an Azure Search query to return a wildcard search based solely on a specific field?

finding organization and industry/sector from string in dbpedia

Returning accented as well as normal result set via azure search filters

Solr Fuzzy search (max 2 edits)

Neo4j quick lookup of string within property of many nodes

Categories

Resources