I need to make a view that emits a value for each pair of documents (A cartesian product of _all_docs with itself)
For example, assume DB has documents with IDs a, b, c -> then the view should emit 9 keys aa, ab, ac, ba, ... , cc (assuming no grouping)
E.g. if the documents are "cities" with coordinates, the view returns pairs of cities and distance between them (real example is more complicated), so I could then use _list function to compute "top10 closest cities" and so on.
This looks like a very simple task, however Google and SO search gives no results. Am I missing some magic keyword here?
I can't think of a way to do this in CouchDB - fundamentally, this doesn't lend itself to map/reduce indexes - in the map function you only have access to one document at a time and in the reduce stage you need to reduce the result (computing the cartesian product would expand it).
If you use another system to precompute the distances between the cities then CouchDB is likely a good fit for storing and querying the result of that cartesian product (to e.g. find the top 10 closest cities). However, you might also want to look at a graph database (Neo4j or Giraph) as well.
Related
For some context: I am currently using azure cosmos db with gremlin api, because of the storage-scaling architecture, it's much less expensive to perform a '.out()' operation than a '.in()' operation, hence I always create double directed edges, so I choose which one to use with '.out()' operation depending on which direction I want to query.
We use the graph to associate events with users. Whenever a user 'U' raises an event 'E', we create two edges:
g.V('U').addE('raisedEvent').to(g.V('E'))
g.V('E').addE('raisedByUser').to(g.V('U'))
Very rarely, one of these queries fails for one reason or another and we end up with only a single edge between the two vertices. I've been trying to find a way to query for all vertices that have only a uni-directional relationship given a set of 'paired' edge-labels, in order to find these errors and re-create the missing edge.
Basically I need a query where...
given a pair of edge labels E1 (for outgoing, V1-->V2), E2 (for incoming V1<--V2)
finds finds all vertices V1 where for every outgoing edge E1 to another vertex V2, V2 doesn't have an edge E2 going back to V1; and vice-versa
Example:
// given a graph
g.addV('user').property('id','user_1')
g.addV('user').property('id','user_2')
g.addV('user').property('id','user_3')
g.addV('user').property('id','user_4')
g.addV('event').property('id','event_1')
g.addV('event').property('id','event_2')
g.addV('event').property('id','event_3')
g.addV('event').property('id','event_4')
g.V('user_1').addE('raisedEvent').to(g.V('event_1')).V('event_1').addE('raisedByUser').to(g.V('user1'))
g.V('user_2').addE('raisedEvent').to(g.V('event_2')).V('event_1').addE('raisedByUser').to(g.V('user1'))
g.V('user_2').addE('raisedEvent').to(g.V('event_3'))
g.V('event_4').addE('raisedByUser').to(g.V('user_3'))
// i.e.
// (user_1) <--> (event_1)
// (event_2) <--> (user_2) ---> (event_3)
// (event_4) ---> (user_3)
// (user_4)
// Then, the query should match with user_2 and user_3...
// ...as they contain uni-directional links to events
Edit: Note - The cosmosdb implementation of the 'is()' operation doesn't support giving traversal results as an input I.e. queries such as
where(_.outE('raisedEvent').count().is(__.out('raisedEvent').outE('raisedByUser').count()))
Are currently unsupported in cosmosdb.
If possible, it would also be great to get a list of which pairs of vertices have a bad link (e.g. in this case [(user_2, event_3), (user_3, event_4)]), but just knowing which vertices have a bad link will be very useful already.
Thanks to Kelvin Lawrence, I ended up using this pattern to get a list of vertex id pairs that are only uni-directionally connected from a to b:
g.V().haslabel("user").as('a').out('raisedEvent').where(__.not(out('raisedByUser').as('a'))).as('b').select('a','b').by('id')
I want to map a timestamp t and an identifier id to a certain state of an object. I can do so by mapping a tuple (t,id) -> state_of_id_in_t. I can use this mapping to access one specific (t,id) combination.
However, sometimes I want to know all states (with matching timestamps t) of a specific id (i.e. id -> a set of (t, state_of_id_in_t)) and sometimes all states (with matching identifiers id) of a specific timestamp t (i.e. t -> a set of (id, state_of_id_in_t)). The problem is that I can't just put all of these in a single large matrix and do linear search based on what I want. The amount of (t,id) tuples for which I have states is very large (1m +) and very sparse (some timestamps have many states, others none etc.). How can I make such a dict, which can deal with accessing its contents by partial keys?
I created two distinct dicts dict_by_time an dict_by_id, which are dicts of dicts. dict_by_time maps a timestamp t to a dict of ids, which each point to a state. Similiarly, dict_by_id maps an id to a dict of timestamps, which each point to a state. This way I can access a state or a set of states however I like. Notice that the 'leafs' of both dicts (dict_by_time an dict_by_id) point to the same objects, so its just the way I access the states that's different, the states themselves however are the same python objects.
dict_by_time = {'t_1': {'id_1': 'some_state_object_1',
'id_2': 'some_state_object_2'},
't_2': {'id_1': 'some_state_object_3',
'id_2': 'some_state_object_4'}
dict_by_id = {'id_1': {'t_1': 'some_state_object_1',
't_2': 'some_state_object_3'},
'id_2': {'t_1': 'some_state_object_2',
't_2': 'some_state_object_4'}
Again, notice the leafs are shared across both dicts.
I don't think it is good to do it using two dicts, simply because maintaining both of them when adding new timestamps or identifiers result in double work and could easily lead to inconsistencies when I do something wrong. Is there a better way to solve this? Complexity is very important, which is why I can't just do manual searching and need to use some sort of HashMap magic.
You can always trade add complexity with lookup complexity. Instead of using a single dict, you can create a Class with an add method and a lookup method. Internally, you can keep track of the data using 3 different dictionaries. One uses the (t,id) tuple as key, one uses t as the key and one uses id as the key. Depending on the arguments given to lookup, you can return the result from one of the dictionaries.
I have a graph representing levels in a game. Some examples of types of nodes I have are "Level", "Room", "Chair". The types are represented by VertexCollections where one collection is called "Level" etc. They are connected by edges from separate EdgeCollections, for example "Level" -roomEdge-> "Room" -chairEdge-> "Chair". Rooms can also contain rooms etc. so the depth is unknown.
I want to start at an arbitrary "Level"-vertex and traverse the whole subtree and find all chairs that belong to the level.
I'm trying to see if ArangoDB would work better than OrientDB for me, in OrientDB I use the query:
SELECT FROM (TRAVERSE out() FROM startNode) WHERE #class = 'Chair'
I have tried the AQL query:
FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v;
It does however seem to be executing much slower compared to the OrientDB query(~1 second vs ~0.1 second).
The code im using for the query is the following:
String statement = "FOR v IN 1..6 OUTBOUND #startVertex GRAPH 'testDb' FILTER IS_SAME_COLLECTION('Chair', v) == true RETURN v";
timer.start();
ArangoCursor<BaseDocument> cursor = db.query(statement, new MapBuilder().put("startVertex", "Level/"+startNode.getKey()).get(), BaseDocument.class);
timer.saveTime();
Both solutions are running on the same hardware without any optimization done, both databases are used "out of the box". Both cases use the same data (~1 million vertices) and return the same result.
So my question is if I'm doing things correctly in my AQL query or is there a better way to do it? Have I misunderstood the concept of VertexCollections and how they're used?
Is there a reason you have multiple collections for each entity type, e.g. one collection for Rooms, one for Levels, one for Chairs?
One option is to have a single collection that contains your entities, and you identify the type of entity it is with a type: "Chair" or type: "Level" key on the document.
You then have a single relationship collection, that holds edges both _to and _from the entity collection.
You can then start at a given node (for example a Level) and find all entities of type Chair that it is connected to with a query like:
FOR v, e, p IN 1..6 OUTBOUND Level_ID Relationship_Collection
FILTER p.vertices[-1].Type == 'Chair'
RETURN v
You could return v (final vertex) or e (final edge) or p (all paths).
I'm not sure you need to use a graph object, rather use a relationships collection that adds relationships to your entity collection.
Graphs are good if you need them, but not necessary for traversal queries. Read the documentation at ArangoDB to see if you need them, usually I don't use them as using a graph can slow performance down a little bit.
Remember to look at indexes, and use the 'Explain' feature in the UI to see how your indexes are being used. Maybe add a hash index to the 'Type' key.
Take US cities for example and say I want the traversal of all cities and roads that go through NYC, Chicago and Seattle.
This can be done with TRAVERSAL AQL function (using filterVertices). However this function only takes the ID and not the vertex example as in GRAPH_TRAVERSAL.
The GRAPH_TRAVERSAL doesn't have a filter option, so my question is there a way to filter the results using graph operations?
the feature is actually there but was somehow not documented. I added it to our documentation which should be updated soon. Sorry for the inconvenience.
filterVertices takes a list of vertex examples.
Note: you can also give the name of a custom AQL function. with signature function(config, vertex, path). For more specific filtering.
vertexFilterMethod defines what should be done with all other vertices:
"prune" will not follow edges attached to these vertices. (Used here)
"exclude" will not include this specific vertex.
["prune", "exclude"] both of the above. (default)
An example query for your question is the following (airway is my graph):
FOR x in GRAPH_TRAVERSAL("airway", "a/SFO", "outbound", {filterVertices: [{_key: "SFO"}, {_key: "NYC"}, {name: "Chicago"}, {name: "Seattle"}], vertexFilterMethod: "prune"}) RETURN x
Hint: Make sure you include the start vertex in the filter as well. Otherwise it will always return with an empty array (the first visited vertex is directly pruned)
Hoping that someone here will be able to provide some mysql advice...
I am working on a categorical searchtag system. I have tables like the following:
EXERCISES
exerciseID
exerciseTitle
SEARCHTAGS
searchtagID
parentID ( -> searchtagID)
searchtag
EXERCISESEARCHTAGS
exerciseID (Foreign key -> EXERCISES)
searchtagID (Foreign key -> SEARCHTAGS)
Searchtags can be arranged in an arbitrarily deep tree. So for example I might have a tree of searchtags that looks like this...
Body Parts
Head
Neck
Arm
Shoulder
Elbow
Leg
Hip
Knee
Muscles
Pecs
Biceps
Triceps
Now...
I want to select all of the searchtags in ONE branch of the tree that reference at least ONE record in the subset of records referenced by a SINGLE searchtag in a DIFFERENT branch of the tree.
For example, let's say the searchtag "Arm" points to a subset of exercises. If any of the exercises in that subset are also referenced by searchtags from the "Muscles" branch of SEARCHTAGS, I would like to select for them. So my query could potentially return "Biceps," "Triceps".
Two questions:
1) What would the SELECT query for something like this look like? (If such a thing is even possible without creating a lot of slow down. I'm not sure where to start...)
2) Is there anything I should do to tweak my datastructure to ensure this query will continue to run fast - even as the tables get big?
Thanks in advance for your help, it's much appreciated.
An idea: consider using a cache table that saves all ancestor relationships in your searchtags:
CREATE TABLE SEARCHTAGRELATIONS (
parentID INT,
descendantID INT
);
Also include the tag itself as parent and descendant (so, for searchtag with id 1, the relations table includes a row with (1,1).
That way, you get rid of the parent/descendant relationships and can join a flat table. Assuming "Muscles" has the ID 5,
SELECT descendantID FROM SEARCHTAGRELATIONS WHERE parentID=5
returns all searchtags contained in muscles.
Alternatively, use modified preorder tree traversal, also known as the nested set model. It requires two fields (left and right) instead of one (parent id), and makes certain operations harder, but makes selecting whole branches much easier.