Negative filterVertices option for traversal - arangodb

The GRAPH_TRAVERSAL has an option called filterVertices, which the documentation states will be used to only allow those vertices matching the examples to go through. Is there any negative version of this, e.g. to allow everything except those matching the filter?
In many cases this would be useful, e.g. traverse everything except those marked disabled (or old-version) or something like that. Of course this can be done with a JS function, but why not built-in?

You're right, its currently not possible, and if you want to use GRAPH_TRAVERSAL you have to write your own visitor function.
However, the recommended way is to use the new pattern matching where you can use FILTER statements like this:
db._query("FOR vertex IN 1..3 OUTBOUND 'circles/A' GRAPH
'traversalGraph' FILTER vertex._key != 'G' return v._key")
.toArray();
so you can use arbitrary filter expressions on vertices, edges and paths and their sub-parts.
In general our development focus will be on the pattern matching traversals and doing as much as is possible in AQL. If you like to implement such a feature for the general graph module, Contributions are always welcome.

Related

Find any values from a set/list within a string

Long time lurker, first time poster. I'm hoping to get some advice from the brilliant minds in this community. In the project I'm working in, the goal is to look at a user-provided string and determine if the content of that string contains any (one or many) matches to a list of match criteria. For example:
User-provided string: "I like thing a and thing b"
Match List:
Match Criteria
Match Type
Category
Foo
Exact (Case Insensitive)
Bar
Thing a
Contains (Case Insensitive)
Things
Thing b
Contains (Case Insensitive)
Stuff
In this case, it would return the following matches:
Thing a > Things
Thing b > Stuff
As of now, my approach is to iterate through the match criteria list and check each list item against the user-supplied string using the Match Type specified (Exact, Contains, Regular Expression), returning a list of the matches and then doing some stuff with that list. This approach works, even when matching ~100 rules and handling a 200-record batch, but it seems obvious that the performance will be pretty terrible if a large number of rules is introduced.
Is there a better way to do this that would be supported in Apex called by a trigger? I would love to learn a more sophisticated approach if there is one.
Thanks in advance!
What do you need it for. Is it a pure apex exercise or is it "close" to certain standard sObjects? There are lots of built-in features around "fuzzy matching".
In no specific order...
Have you looked into "all things Einstein", from categorising leads to predicting how likely this opportunity is to close. Might not be direction you expected to take but who knows
Obviously SOSL comes to mind, like what powers the global search. It automatically does some substitutions for you like Mike -> Michael
Matching rules, duplicate rules. You'd have a limit of say 5 active rules but you could hook to them up from Apex, including creative abuse of the system. "Dear salesforce, let's pretend I'm making such and such Opportunity, can you find me similar Opportunities?" (plot twist- you're not making an Oppty at all, you're creating account of some venture capitalist looking for investments that match his preferences). Give matching rules a go, if not for everything then at least for more creative fuzzy matching. You really don't want to implement soundex, levenshtein etc manually...
tags? There's somewhat forgotten feature from SF classic, it creates bunch of tables (AccountTag, ContactTag). This plus SOSL could be close to what you need.
Additionally if you need this for anything close to Knowledge Base:
Data Categories come to mind
KB supports synonyms, letting you define your (not very intuitive) "thing b => stuff" mapping.
and it should survive translations

ArangoDB REGEX_TEST index acceleration?

Is there a way to index while performing REGEX_TEST() on a string to field to retrieve documents in ArangoDB?
Also if there is any way to optimize this please let me know
There is no index acceleration available for the REGEX_TEST() AQL function, and it is unlikely to come in the future. Not because there is no interest from users and developers, but because it's not really possible to build any sort of index data structure that would allow to speed up regular expression evaluation.
Regular expressions as supported by ArangoDB allow for many different types of expressions, but because they can differ so much, there is almost no chance to have a suitable index. For equality comparisons there are hash indexes, which are probably the fastest kind of index. For range queries there are skiplist indexes, and there are of course quite a few more index types known in computer science, but I'm not aware of a single one that could speed up arbitrary regex.
If your expression allows, maybe there is a chance add a filter criterion before REGEX_TEST() which might utilize an index? This will mostly be limited to case-sensitive prefix matching, e.g. FILTER REGEX_TEST(doc.str, "a[a-z]*") could be extended to FILTER doc.str >= "a" AND doc.str < "b" AND REGEX_TEST(doc.str, "a[a-z]*") and allow for a skiplist index being used to only evaluate the regex on documents where str starts with a. Or some simple regex like [fm]oo|bar could be rewritten to a set of equality comparisons: FILTER doc.str IN ["foo","moo","bar"]. Also have a look at ArangoSearch.

How does one use an IndexedTraversal in Control.Lens to perform an index-aware action for each element?

I have an IndexedTraversal (from the Control.Lens package) and I would like to apply an index-aware monadic action to each element in it. Unfortunately, all of the convenient ways that I see of doing something like this --- such as ^! combined with act function --- seem to ignore the index with each element. Is there a nice way to run an action for every element (and its index) in an indexed traversal?
Does imapMOf work? You would use it as imapMOf someIndexedTraversal actionWithIndex dataStructure I think.
If you just need to perform an action, there's also imapMOf_ in Control.Lens.Fold.
I haven't used indexed traversals much, but I find the API a bit confusing. Most of the time I use lenses with either ^. or ^!, but for indexed traversals it seems the usual way is to use one of the special index-aware functions, which seems a bit different.

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Resources