Solr: find groups containing a value - search

I have a structure of documents indexed in Solr that are grouped together by a property.
Let's say I have a group consisting of three documents:
A -> B -> C
I want to performa query by a property value V that will return the whole group whether A or B or C contain the value V.
For example - the query will return my whole group (A,B,C) if B contains the value V.
Is this possible in solr?
Thanks!

If I understand correctly, yes, this possible. You can use Graph query parser to do this:
you index your docs with the right info on how docs are linked to each other in some fields (see sample in the docs).
then, you query like this:
q={!graph+from=in_edge+to=out_edge}id:A
where id:A is the query to get the starting set of docs, and the {!graph ...} is to get all docs reachable from the starting set.
Some caveats:
works only in standalone solr or single Solrcloud shard (though some graph features are available also in Streaming expressions, those would work across all shards, but have less features for now)
depending on how the graph of docs you want to reach looks like (and how you index the edge info), works case you might need to run two queries, one to get the docs 'from' the starting set and another to get the docs 'to' the starting set, for example.

Related

How to Traverse to get all children of a specific type using AQL

I have a graph representing levels in a game. Some examples of types of nodes I have are "Level", "Room", "Chair". The types are represented by VertexCollections where one collection is called "Level" etc. They are connected by edges from separate EdgeCollections, for example "Level" -roomEdge-> "Room" -chairEdge-> "Chair". Rooms can also contain rooms etc. so the depth is unknown.
I want to start at an arbitrary "Level"-vertex and traverse the whole subtree and find all chairs that belong to the level.
I'm trying to see if ArangoDB would work better than OrientDB for me, in OrientDB I use the query:
SELECT FROM (TRAVERSE out() FROM startNode) WHERE #class = 'Chair'
I have tried the AQL query:
FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v;
It does however seem to be executing much slower compared to the OrientDB query(~1 second vs ~0.1 second).
The code im using for the query is the following:
String statement = "FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v";
timer.start();
ArangoCursor<BaseDocument> cursor = db.query(statement, new MapBuilder().put("startVertex", "Level/"+startNode.getKey()).get(), BaseDocument.class);
timer.saveTime();
Both solutions are running on the same hardware without any optimization done, both databases are used "out of the box". Both cases use the same data (~1 million vertices) and return the same result.
So my question is if I'm doing things correctly in my AQL query or is there a better way to do it? Have I misunderstood the concept of VertexCollections and how they're used?
Is there a reason you have multiple collections for each entity type, e.g. one collection for Rooms, one for Levels, one for Chairs?
One option is to have a single collection that contains your entities, and you identify the type of entity it is with a type: "Chair" or type: "Level" key on the document.
You then have a single relationship collection, that holds edges both _to and _from the entity collection.
You can then start at a given node (for example a Level) and find all entities of type Chair that it is connected to with a query like:
FOR v, e, p IN 1..6 OUTBOUND Level_ID Relationship_Collection
FILTER p.vertices[-1].Type == 'Chair'
RETURN v
You could return v (final vertex) or e (final edge) or p (all paths).
I'm not sure you need to use a graph object, rather use a relationships collection that adds relationships to your entity collection.
Graphs are good if you need them, but not necessary for traversal queries. Read the documentation at ArangoDB to see if you need them, usually I don't use them as using a graph can slow performance down a little bit.
Remember to look at indexes, and use the 'Explain' feature in the UI to see how your indexes are being used. Maybe add a hash index to the 'Type' key.

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".
The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

Why is this ArangoDB query too slow?

I am new ArangoDB user and I am using the following query
FOR i IN meteo
FILTER
i.`POM` == "Maxial"
&& TO_NUMBER(i.`TMP`) < 4.2
&& DATE_TIMESTAMP(i.`DTM`) > DATE_TIMESTAMP("2014-12-10")
&& DATE_TIMESTAMP(i.`DTM`) < DATE_TIMESTAMP("2014-12-15")
RETURN
i.`TMP`
on a 2 million document collection. It has an index on the three fields that are filtered. It takes aprox. 9 secs on the Web Interface.
Is it possible to run it faster?
Thank you
Hugo
I have no access to the underlying data and data distribution nor the exact index definitions, so I can only give rather general advice:
Use the explain() command in order to see if the query makes use of indexes, and if yes, which.
If explain() shows that no index is used, check if the attributes contained in the query's FILTER conditions are actually indexed. There is the db.<collection>.getIndexes() command to check which attributes are indexed.
If indexes are present but not used by query, the indexes may have the wrong type. For example, a hash index will only be used for equality comparisons (i.e. ==) but not for other comparison types (<, <=, >, >= etc.). A hash index will only be used if all the indexed attributes are used in the query's FILTER conditions. A skiplist index will only be used if at least its first attribute is used is used in a FILTER condition. If further of the skiplist index attributes are specified in the query (from left-to-right), they may also be used and allow to filter more documents.
Only a single index will be picked when scanning a collection. Having multiple, separate indexes on "POM", "TMP", and "DTM" won't help this query because it will only use one of them per collection that it iterates over. Instead, I suggest trying to put multiple attributes into an index if the query could benefit from this.
The more selective an index is, the better. For example, an index on a single attribute may filter a lot of documents, but a combined index on multiple attributes may filter even more. For this particular query, a skiplist index on [ "POM", "DTM" ] may be the right choice (in combination with 6.)
The only attribute for which the optimizer may consider an index lookup in the given original query is the "POM" attribute. The reason is that the other attributes are used inside function calls (i.e. TO_NUMBER(), DATE_TIMESTAMP()). In general, indexes will not be used for attributes which are used inside functions (e.g. for TO_NUMBER(i.tmp) < 4.2 no index will be used. Same for DATE_TIMESTAMP(i.DTM) > DATE_TIMESTAMP("2014-12-10"). Modifying the conditions so the indexed attributes are directly compared to some constant or a one-time calculated value can enable more candidate indexes. If possible, try to rewrite the conditions so that only the indexed attributes are present on the one side of the comparison. For this particular query, it would be better to use i.DTM > "2014-12-10" instead of DATE_TIMESTAMP(i.DTM) > DATE_TIMESTAMP("2014-12-10").

Optimized way of negation of values in solr?

I am trying to search the results for the negation of particular id in solr. It have found that this can be done in two ways:
(1) fq=userid:(-750376)
(2) fq=-userid:750376
Both are working fine and both are giving correct results. But I can one tell me which is the better way of either two. Which one should I prefer?
You can find out what query the fq parameter's value is parsed into by turning on debugQuery (add the parameter debug=true). Then, in the Solr response, there should be an entry "parsed_filter_queries" under "debug", and the entry should show the string representation of the parsed filter query (or queries) being used.
In your case, both forms of fq should be parsed into the same query, i.e. a boolean query with a single clause stating that the term userid:750376 must not occur. Therefore, which form you use does not matter, at least in terms of correctness or performance.
For us the query looks little different. But for Solr, both are same.
First, Solr parse the query provided by you. Then search for the result. In your case, for both the queries Solr's "parsed_filter_queries" is fq=-userid:750376 only.
fq=userid:(-750376)
fq=-userid:750376
You can check this by enabling debugQuery from Admin window. You can also pass debugQuery=true with query. Hope this will help.

saving intermediate steps in gremlin

I am writing a query which should detect certain loops within a graph, which means that I need I need to assign names to certain nodes within the path so that I can compare nodes later in the path with the saved ones. for example A -> B -> C -> A. Is this possible within gremlin?
It sounds like you're looking for something like this:
https://github.com/tinkerpop/gremlin/wiki/Except-Retain-Pattern
where you keep a list of previously traversed vertices and then utilize that list later in the traversal.

Resources