Why use apoc instead of cypher for simple queries? - neo4j-apoc

I have written (and optimized, using "PROFILE") a Cypher query that answers neighbors of a node given the node. Now I find an apoc procedure (apoc.neighbors.athop) that seems to do the same thing.
Is the APOC version better? Faster? More robust?
I understand the value of APOC when there is no counterpart in regular Cypher for the given behavior. In the case of collecting neighbors, the Cypher seems easy:
MATCH (target:SomeLabel)
WITH target
MATCH (target)-[:ADJOINS]-(neighbor:SomeLabel)
WITH target, neighbor
As I understand it, the APOC counterpart is:
MATCH (target:SomeLabel)
WITH target
CALL apoc.neighbors.athop(target, "ADJOINS", 1)
YIELD node
RETURN node
Why would I choose the latter over the former?

The OPTIONAL MATCH clause should achieve the same result:
MATCH (target:SomeLabel)
OPTIONAL MATCH (target)-[:ADJOINS]-(neighbor:SomeLabel)
RETURN target, neighbor
According to the documentation, OPTIONAL MATCH was introduced in Neo4j 3.5, so maybe it didn't exist when this question was asked.
On the other hand, the APOC procedures apoc.neighbours.athop/byhop/tohop allow you to pass the relationship pattern, i.e. type and direction (incoming, outgoing, bidirectional), and the distance, i.e. the number of hops between two nodes, as dynamic parameters. This means that you can determine the relationship pattern and length in your application and pass them as parameters to the Neo4j driver.
Example for apoc.neighbors.tohop:
MATCH (p:Person {name: "Emil"})
CALL apoc.neighbors.byhop(p, "KNOWS", 2)
YIELD nodes
RETURN nodes
This query would traverse these relationships:
(praveena)-[:FOLLOWS]-(joe)
(praveena)-[:FOLLOWS]-(joe)-[:FOLLOWS]→(mark)

Related

ArangoDB REGEX_TEST index acceleration?

Is there a way to index while performing REGEX_TEST() on a string to field to retrieve documents in ArangoDB?
Also if there is any way to optimize this please let me know
There is no index acceleration available for the REGEX_TEST() AQL function, and it is unlikely to come in the future. Not because there is no interest from users and developers, but because it's not really possible to build any sort of index data structure that would allow to speed up regular expression evaluation.
Regular expressions as supported by ArangoDB allow for many different types of expressions, but because they can differ so much, there is almost no chance to have a suitable index. For equality comparisons there are hash indexes, which are probably the fastest kind of index. For range queries there are skiplist indexes, and there are of course quite a few more index types known in computer science, but I'm not aware of a single one that could speed up arbitrary regex.
If your expression allows, maybe there is a chance add a filter criterion before REGEX_TEST() which might utilize an index? This will mostly be limited to case-sensitive prefix matching, e.g. FILTER REGEX_TEST(doc.str, "a[a-z]*") could be extended to FILTER doc.str >= "a" AND doc.str < "b" AND REGEX_TEST(doc.str, "a[a-z]*") and allow for a skiplist index being used to only evaluate the regex on documents where str starts with a. Or some simple regex like [fm]oo|bar could be rewritten to a set of equality comparisons: FILTER doc.str IN ["foo","moo","bar"]. Also have a look at ArangoSearch.

Azure Search - exact match as first or single result

I'm using Azure Search based on the rich Lucene Query Parser syntax. I defined to "~1" as additional parameter to one symbol for distance ). But I faced with problem, that the entity is not ordered even if there is exact match. (For example,"blue~1" would return "blues", "blue", "glue". Or when searching product SKU like "P002", I would get result "P003", "P005", "P004", "P002", "P001", "P006" )
So my question: is there some way to define, that the entity with exact match must be first in list, or be singl search result even then I'm using fuzzy search "~1"?
With Lucene Query syntax you can boost individual subqueries, for example: term^2 | term~1 - this translates to "find documents that match 'term' OR 'term' with edit distance 1, and score the exact matches higher relative to fuzzy matches by a factor of two.
search=blue^2|blue~1&queryType=full
There is no guarantee that the exact match will always be first in the results set as the document score is a function of term frequency and inverse document frequency. If the fuzzy sub-query expands the input term to a term that's very unique in your document corpus you may need to bump the boosting factor (2 in my example). In general, relying on the relevance score for ordering is not a practical idea. Take a look at my answer in the following post for more information: Azure Search scoring
Let me know if this helps

Negative filterVertices option for traversal

The GRAPH_TRAVERSAL has an option called filterVertices, which the documentation states will be used to only allow those vertices matching the examples to go through. Is there any negative version of this, e.g. to allow everything except those matching the filter?
In many cases this would be useful, e.g. traverse everything except those marked disabled (or old-version) or something like that. Of course this can be done with a JS function, but why not built-in?
You're right, its currently not possible, and if you want to use GRAPH_TRAVERSAL you have to write your own visitor function.
However, the recommended way is to use the new pattern matching where you can use FILTER statements like this:
db._query("FOR vertex IN 1..3 OUTBOUND 'circles/A' GRAPH
'traversalGraph' FILTER vertex._key != 'G' return v._key")
.toArray();
so you can use arbitrary filter expressions on vertices, edges and paths and their sub-parts.
In general our development focus will be on the pattern matching traversals and doing as much as is possible in AQL. If you like to implement such a feature for the general graph module, Contributions are always welcome.

Optimized way of negation of values in solr?

I am trying to search the results for the negation of particular id in solr. It have found that this can be done in two ways:
(1) fq=userid:(-750376)
(2) fq=-userid:750376
Both are working fine and both are giving correct results. But I can one tell me which is the better way of either two. Which one should I prefer?
You can find out what query the fq parameter's value is parsed into by turning on debugQuery (add the parameter debug=true). Then, in the Solr response, there should be an entry "parsed_filter_queries" under "debug", and the entry should show the string representation of the parsed filter query (or queries) being used.
In your case, both forms of fq should be parsed into the same query, i.e. a boolean query with a single clause stating that the term userid:750376 must not occur. Therefore, which form you use does not matter, at least in terms of correctness or performance.
For us the query looks little different. But for Solr, both are same.
First, Solr parse the query provided by you. Then search for the result. In your case, for both the queries Solr's "parsed_filter_queries" is fq=-userid:750376 only.
fq=userid:(-750376)
fq=-userid:750376
You can check this by enabling debugQuery from Admin window. You can also pass debugQuery=true with query. Hope this will help.

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

Resources