How to use the term position parameter in Xapian query constructors - xapian

Xapian docs talk about a query constructor that takes a term position parameter, to be used in phrase searches:
Quote:
This constructor actually takes a couple of extra parameters, which
may be used to specify positional and frequency information for terms
in the query:
Xapian::Query(const string & tname_,
Xapian::termcount wqf_ = 1,
Xapian::termpos term_pos_ = 0)
The term_pos represents the position of the term in the query. Again,
this isn't useful for a single term query by itself, but is used for
phrase searching, passage retrieval, and other operations which
require knowledge of the order of terms in the query (such as
returning the set of matching terms in a given document in the same
order as they occur in the query). If such operations are not
required, the default value of 0 may be used.
And in the reference, we have:
Xapian::Query::Query ( const std::string & tname_,
Xapian::termcount wqf_ = 1,
Xapian::termpos pos_ = 0
)
A query consisting of a single term.
And:
typedef unsigned termpos
A term position within a document or query.
So, say I want to build a query for the phrase: "foo bar baz", how do I go about it?!
Does term_pos_ provide relative position values, ie define the order of terms within the document:
(I'm using here the python bindings API, as I'm more familiar with it)
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 1),xapian.Query("bar", wqf,2),xapian.Query("baz", wqf,3)] )
And just for the sake of testing, suppose we did:
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 3),xapian.Query("bar", wqf, 4),xapian.Query("baz", wqf, 5)] )
So this would give the same results as the previous example?!
And suppose we have:
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 2),xapian.Query("bar", wqf, 4),xapian.Query("baz", wqf, 5)] )
So now this would match where documents have "foo" "bar" separated with one term, followed by "baz" ??
Is it as such, or is it that this parameter is referring to absolute positions of the indexed terms?!
Edit:
And how is OP_PHRASE related to this? I find some online samples using OP_PHRASE as such:
q = xapian.Query(xapian.Query.OP_PHRASE, term_list)
This makes obvious sense, but then what is the role of the said term_pos_ constructor in phrase searches - is it a more surgical way of doing things!?

int pos = 1;
std::list<Xapian::Query> subs;
subs.push_back(Xapian::Query("foo", 1, pos++));
subs.push_back(Xapian::Query("bar", 1, pos++));
querylist.push_back(Xapian::Query(Xapian::Query::OP_PHRASE, subs.begin(), subs.end()));

Related

How can I generate a graph by constraining it to be subisomorphic to a given graph, while not subisomorphic to another?

TL;DR: How can I generate a graph while constraining it to be subisomorph to every graph in a positive list while being non-subisomorph to every graph in a negative list?
I have a list of directed heterogeneous attributed graphs labeled as positive or negative. I would like to find the smallest list of patterns(graphs with special values) such that:
Every input graph has a pattern that matches(= 'P is subisomorphic to G, and the mapped nodes have the same attribute values')
A positive pattern can only match a positive graph
A positive pattern does not match any negative graph
A negative pattern can only match a negative graph
A negative pattern does not match any negative graph
Exemple:
Input g1(+),g2(-),g3(+),g4(+),g5(-),g6(+)
Acceptable solution: p1(+),p2(+),p3(-) where p1(+) matches g1(+) and g4(+); p2(+) matches g3(+) and g6(+); and p3(-) matches g2(-) and g5(-)
Non acceptable solution: p1(+),p2(-) where p1(+) matches g1(+),g2(-),g3(+); p2(-) matches g4(+),g5(-),g6(+)
Currently, I'm able to generate graphs matching every graph in a list, but I can't manage to enforce the constraint 'A positive pattern does not match any negative graph'. I made a predicate 'matches', which takes as input a pattern and a graph, and uses a local array of variables 'mapping' to try and map nodes together. But when I try to use that predicate in a negative context, the following error is returned: MiniZinc: flattening error: free variable in non-positive context.
How can I bypass that limitation? I tried to code the opposite predicate 'not_matches' but I've not yet found how to specify 'for all node mapping, the isomorphism is invalid'. I also can't define the mapping outside the predicate, because a pattern can match a graph more than once and i need to be able to get all mappings.
Here is a reproductible exemple:
include "globals.mzn";
predicate p(array [1..5] of var 0..10:arr1, array [1..5] of 1..10:arr2)=
let{array [1..5] of var 1..5: mapping; constraint all_different(mapping)} in (forall(i in 1..5)(arr1[i]=0\/arr1[i]=arr2[mapping[i]]));
array [1..5] of var 0..10:arr;
constraint p(arr,[1,2,3,4,5]);
constraint p(arr,[1,2,3,4,6]);
constraint not p(arr,[1,2,3,5,6]);
solve satisfy;
For that exemple, the decision variable is an array and the predicate p is true if a mapping exists such that the values of the array are mapped together. One or more elements of the array can also be 0, used here as a wildcard.
[1,2,3,4,0] is an acceptable solution
[0,0,0,0,0] is not acceptable, it matches anything. And the solution should not match [1,2,3,5,6]
[1,2,3,4,7] is not acceptable, it doesn't match anything(as there is no 7 in the parameter arrays)
Thanks by advance! =)
Edit: Added non-acceptable solutions
It is probably good to note that MiniZinc's limitation is not coincidental. When the creation of a free variable is negated, rather then finding a valid assignment for the variable, instead the model would have to prove that no such valid assignment exists. This is a much harder problem that would bring MiniZinc into the field of quantified constraint programming. The only general solution (to still receive the same flattened constraint model) would be to iterate over all possible values for each variable and enforce the negated constraints. Since the number of possibilities quickly explodes and the chance of getting a good model is small, MiniZinc does not do this automatically and throws this error instead.
This technique would work in your case as well. In the not_matches version of your predicate, you can iterate over all possible permutations (the possible mappings) and enforce that they not correct (partial) mappings. This would be a correct way to enforce the constraint, but would quickly explode. I believe, however, that there is a different way to enforce this constraint that will work better.
My idea stems from the fact that, although the most natural way to describe a permutation from one array to the another is to actually create the assignment from the first to the second, when dealing with discrete variables, you can instead enforce that each has the exact same number of each possible value. As such a predicate that enforces X is a permutation of Y might be written as:
predicate is_perm(array[int] of var $$E: X, array[int] of var $$E: Y) =
let {
array[int] of int: vals = [i | i in (dom_array(X) union dom_array(Y))]
} in global_cardinality(X, vals) = global_cardinality(Y, vals);
Notably this predicate can be negated because it doesn't contain any free variables. All new variables (the resulting values of global_cardinality) are functionally defined. When negated, only the relation = has to be changed to !=.
In your model, we are not just considering full permutations, but rather partial permutations, and we use a dummy value otherwise. As such, the p predicate might also be written:
predicate p(array [int] of var 0..10: X, array [int] of var 1..10: Y) =
let {
set of int: vals = lb_array(Y)..ub_array(Y); % must not include dummy value
array[vals] of var int: countY = global_cardinality(Y, [i | i in vals]);
array[vals] of var int: countX = global_cardinality(X, [i | i in vals]);
} in forall(i in vals) (countX[i] <= countY[i]);
Again this predicate does not contain any free variables, and can be negated. In this case, the forall can be changed into a exist with a negated body.
There are a few things that we can still do to optimise p for this use case. First, it seems that global_cardinality is only defined for variables, but since Y is guaranteed par, we can rewrite it and have the correct counts during MiniZinc's compilation. Second, it can be seen that lb_array(Y)..ub_array(Y) gives the tighest possible set. In your example, this means that only slightly different versions of the global cardinality function are evaluated, that could have been
predicate p(array [1..5] of var 0..10: X, array [1..5] of 1..10: Y) =
let {
% CHANGE: Use declared values of Y to ensure CSE will reuse `global_cardinality` result values.
set of int: vals = 1..10; % do not include dummy value
% CHANGE: parameter evaluation of global_cardinality
array[vals] of int: countY = [count(j in index_set(Y)) (i = Y[j]) | i in vals];
array[vals] of var int: countX = global_cardinality(X, [i | i in 1..10]);
} in forall(i in vals) (countX[i] <= countY[i]);
Regarding the example. One approach might be to rewrite the not p(...) constraint to a specific not_p(...) constraint. But I'm how sure how that be formulated.
Here's an example but it's probably not correct:
predicate not_p(array [1..5] of var 0..10:arr1, array [1..5] of 1..10:arr2)=
let{
array [1..5] of var 1..5: mapping;
constraint all_different(mapping)
} in
exists(i in 1..5)(
arr1[i] != 0
/\
arr1[i] != arr2[mapping[i]]
);
This give 500 solutions such as
arr = [1, 0, 0, 0, 0];
----------
arr = [2, 0, 0, 0, 0];
----------
arr = [3, 0, 0, 0, 0];
...
----------
arr = [2, 0, 0, 3, 4];
----------
arr = [2, 0, 1, 3, 4];
----------
arr = [2, 1, 0, 3, 4];
Update
I added not before the exists loop.

Dynamic logical expression evaluating

I need to evaluate a dynamic logical expression and I know that in ABAP it is not possible.
I found the class cl_java_script and with this class I could achieve my requeriment. I've try something like this:
result = cl_java_script=>create( )->evaluate( `( 1 + 2 + 3 ) == 6 ;` ).
After the method evaluate execution result = true as espected. But my happiness is over when I look into the class documentation that says This class is obsolete.
My question is, there is another way to achieve this?
Using any turing complete language to parse a "dynamic logical expression" is a terrible idea, as an attacker might be able to run any program inside your expression, i.e. while(true) { } will crash your variant using cl_java_script. Also although I don't know the details of cl_java_script, I assume it launches a separate JS runtime in a separate thread somewhere, this does not seem to be the most efficient choice to calculate such a small dynamic expression.
Instead you could implement your own small parser. This has the advantage that you can limit what it supports to the bare minimum whilst being able to extend it to everything you need in your usecase. Here's a small example using reverse polish notation which is able to correctly evaluate the expression you've shown (using RPN simplifies parsing a lot, though for sure one can also build a full fledged expression parser):
REPORT z_expr_parser.
TYPES:
BEGIN OF multi_value,
integer TYPE REF TO i,
boolean TYPE REF TO bool,
END OF multi_value.
CLASS lcl_rpn_parser DEFINITION.
PUBLIC SECTION.
METHODS:
constructor
IMPORTING
text TYPE string,
parse
RETURNING VALUE(result) TYPE multi_value.
PRIVATE SECTION.
DATA:
tokens TYPE STANDARD TABLE OF string,
stack TYPE STANDARD TABLE OF multi_value.
METHODS pop_int
RETURNING VALUE(result) TYPE i.
METHODS pop_bool
RETURNING VALUE(result) TYPE abap_bool.
ENDCLASS.
CLASS lcl_rpn_parser IMPLEMENTATION.
METHOD constructor.
" a most simple lexer:
SPLIT text AT ' ' INTO TABLE tokens.
ASSERT lines( tokens ) > 0.
ENDMETHOD.
METHOD pop_int.
DATA(peek) = stack[ lines( stack ) ].
ASSERT peek-integer IS BOUND.
result = peek-integer->*.
DELETE stack INDEX lines( stack ).
ENDMETHOD.
METHOD pop_bool.
DATA(peek) = stack[ lines( stack ) ].
ASSERT peek-boolean IS BOUND.
result = peek-boolean->*.
DELETE stack INDEX lines( stack ).
ENDMETHOD.
METHOD parse.
LOOP AT tokens INTO DATA(token).
IF token = '='.
DATA(comparison) = xsdbool( pop_int( ) = pop_int( ) ).
APPEND VALUE #( boolean = NEW #( comparison ) ) TO stack.
ELSEIF token = '+'.
DATA(addition) = pop_int( ) + pop_int( ).
APPEND VALUE #( integer = NEW #( addition ) ) TO stack.
ELSE.
" assumption: token is integer
DATA value TYPE i.
value = token.
APPEND VALUE #( integer = NEW #( value ) ) TO stack.
ENDIF.
ENDLOOP.
ASSERT lines( stack ) = 1.
result = stack[ 1 ].
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
" 1 + 2 + 3 = 6 in RPN:
DATA(program) = |1 2 3 + + 6 =|.
DATA(parser) = NEW lcl_rpn_parser( program ).
DATA(result) = parser->parse( ).
ASSERT result-boolean IS BOUND.
ASSERT result-boolean->* = abap_true.
SAPs BRF is an option, but potentially massive overkill in your scenario.
Here is a blog on calling BRF from abap.
And here is how Rules/Expressions can be defined dynamically.
BUT, if you know enough about the source problem to generate
1 + 2 + 3 = 6
Then it is hard to imagine why a simple custom parser cant be used.
Just how complex should the expressions be ?
Id probably write my own parser before investing in calling BRF.
Since some/many BSPs use server side JAVAscript and not ABAP as the scripting language, i cant see SAP removing the Kernel routine anytime soon.
SYSTEM-CALL JAVA SCRIPT EVALUATE.
SO maybe consider Just calling the cl_java_script anyway until it is an issue.
Then worry about a parser if and when it is really no longer a valid call.
But definitely some movement in the obsolete space here.
SAP is pushing/forcing you to cloud with the SDK, to execute such things.
https://sap.github.io/cloud-sdk/docs/js/overview-cloud-sdk-for-javascript

How can I get element by the Index from the OrderedTable in Nim?

Nim has OrderedTable how can I get the element by its index, and not by its key?
If it's possible - is this an efficient operation, like O(log n) or better?
import tables
let map = {"a": 1, "b": 2, "c": 3}.toOrderedTable
Something like
map.getByIndex(1) # No such method
P.S.
I'm currently using both seq and Table to provide both key and indexed access and wonder if it could be replaced by OrderedTable
type IndexedMap = ref object
list*: seq[float]
map*: Table[string, float]
There is no direct index access to ordered tables because of their internal structure. The typical way to access the elements in order is:
import tables
let map = {"a": 1, "b": 2, "c": 3}.toOrderedTable
for f in map.keys:
echo $f
Basically, accessing the keys iterator. If you click through the source link in the documentation, you reach the actual iterator code:
let L = len(t)
forAllOrderedPairs:
yield t.data[h].key
And if you follow the implementation of the forAllOrderedPairs template (it's recommended you are using an editor with jump to implementation capabilities to inspect such code easier):
if t.counter > 0:
var h = t.first
while h >= 0:
var nxt = t.data[h].next
if isFilled(t.data[h].hcode):
yieldStmt
h = nxt
No idea about performance there, but it won't be as fast as accessing a simple list/array, because the internal structure of OrderedTable contains a hidden data field with the actual keys and values, and it requires an extra conditional check to verify that the entry is actually being used. This implementation detail is probably a compromise to avoid reshuffling the whole list after a single item deletion.
If your accesses are infrequent, using the iterator to find the value might be enough. If benchmarking shows it's a bottleneck you could try freezing the keys/values iterator into a local list and use that instead, as long as you don't want to mutate further the OrderedTable.
Or return to your original idea of keeping a separate list.

Query sub-sequences in time-series sequence data

I'm facing a problem and I feel like there's a solution in Graph theory or Graph databases. My knowledge in these fields is very limited. I'm hoping someone can recognise my problem and perhaps point me to the name of a technique used to solve it.
Simplified Example:
I am dealing with time-series of states. A simple example, where there are only two states:
TS State
t0 T
t1 F
t2 F
t3 F
t4 T
t5 T
t6 T
t7 F
t... ...
I could convert this into some graph with two nodes (T and F) and where the "dwell time" in the state is an attribue (in brackets):
T(1) -> F(3) -> T(3) -> F(1)
An example of my problem is to write a "query" that extracts any sub-sequence matching this pattern F(>=2) -> T(<10).
In my example above, my query would extract the sub-sequence:
F(3) -> T(3)
But if it were present in the dataset, the query could also extract sequences like:
F(2) -> T(8)
F(20) -> T(3)
The example I've put up is simplified: there are more than two states, and more advanced queries would allow loops, where these loops could be constrained in either overall time spent in the loop, or number of loops that can be done:E.g.
`T(>2) -> [loops of F(1)->T(1)] -> T(<10)`
Where my loop could perhaps be constrained not to take more than 10 iterations, or not more than 10 time units.
The icing on the cake would be to find sequences like this
T(n)->F(<n)
Which translates as: sequences that start with T (and stay in T for n time-units), followed by the F state where it stays in F for less than n (i.e., F is shorter than the preceding T)
What I tried:
I originally thought of converting this to a string, and using a RegEx to extract matches. Regex could do all I need, but fall short of comprehending arithmetic like "greater than". I guess I could keep my raw time-series of states (TFFFTTTF) and do a regex on this... but it seems pretty ugly.
The fields of natural Language Processing, Graph Theory, Graph databases come to mind, as ones that would have similar problems.
I don't know how I would encode the "duration of state" attribute in my graph. I don't know if there's some sort of "industry-standard" query language for sub-sequence searches in graph databases.
Questions:
-Is there a framework to solve these sub-sequence extraction problems, if so, how is it called? Is there a "best practice"? How should I structure my data? Is there a query language to query sub-sequences in a database of sequences?
I might flip the problem around. You've indicated that this is time series data. Given that, I might create a new state node every time the state changes. I would then encode the "dwell" time in the previous node and link the new node to the previous state node creating a linked list in the graph database. With this structure, your pattern query becomes simple.
Objectivity/DB is a schema-based object/graph database with a complete set of graph navigational query capabilities. It has its own query language called Declarative Objectivity, or DO.
We start with a schema definition:
UPDATE SCHEMA {
CREATE CLASS State{
label : String,
dwellTime : INTEGER { Storage: B32 },
prev : Reference { referenced: State, Inverse: next },
next : Reference { referenced: State, Inverse: prev}
}
};
Then we can execute a DO query like the following:
MATCH p = (:State {label == 'T' AND dwellTime > 5})
-->(:State {label == 'F' AND dwellTime > 5})
-->(:State {label == 'T' AND dwellTime < 2})
-->(:State {label == 'T' AND dwellTime > 100})
-->(:State {label == 'F' AND dwellTime > 100})
RETURN p;
This kind of query will find all of the "TFTTF" patterns that meet the specified dwell times.

What is the !$ (bang dollar) operator in Nim?

In the example of defining a custom hash function on page 114 of Nim in Action, the !$ operator is used to "finalize the computed hash".
import tables, hashes
type
Dog = object
name: string
proc hash(x: Dog): Hash =
result = x.name.hash
result = !$result
var dogOwners = initTable[Dog, string]()
dogOwners[Dog(name: "Charlie")] = "John"
And in the paragraph below:
The !$ operator finalizes the computed hash, which is necessary when writing a custom hash procedure. The use of the $! operator ensures that the computed hash is unique.
I am having trouble understanding this. What does it mean to "finalize" something? And what does it mean to ensure that something is unique in this context?
Your questions might become answered if instead of reading the single description of the !$ operator you take a look at the beginning of the hashes module documentation. As you can see there, primitive data types have a hash() proc which returns their own hash. But if you have a complex object with many variables, you might want to create a single hash for the object itself, and how do you do that? Without going into hash theory, and treating hashes like black boxes, you need to use two kind of procs to produce a valid hash: the addition/concatenation operator and the finalization operator. So you end up using !& to keep adding (or mixing) individual hashes into a temporal value, and then use !$ to finalize that temporal value into a final hash. The Nim in Action example might have been easier to understand if the Dog object had more than a single variable, thus requiring the use of both operators:
import tables, hashes, sequtils
type
Dog = object
name: string
age: int
proc hash(x: Dog): Hash =
result = x.name.hash !& x.age.hash
result = !$result
var dogOwners = initTable[Dog, string]()
dogOwners[Dog(name: "Charlie", age: 2)] = "John"
dogOwners[Dog(name: "Charlie", age: 5)] = "Martha"
echo toSeq(dogOwners.keys)
for key, value in dogOwners:
echo "Key ", key.hash, " for ", key, " points at ", value
As for why are hash values temporarily concatenated and then finalized, that depends much on which algorithms have the Nim developers chosen to use for hashing. You can see from the source code that hash concatenation and finalization is mostly bit shifting. Unfortunately the source code doesn't explain or point at any other reference to understand why is that done and why this specific hashing algorithm was selected compared to others. You could try asking the Nim forums for that, and maybe improve the documentation/source code with your findings.

Resources