I'm new to gremlin and trying to find out how to get an article along with the author and attachments in the same result using Azure Cosmos DB with GraphSON.
My graph looks like this:
[User] <- (edge: author) - [Article] - (edge: attachments) -> [File1, File2]
I would like to fetch everything I need in the UI to show an article along with author and info about attachments in on request.
What I'm trying to fetch is something similar to this pseudo-code:
{
article: {...},
author: [{author1}],
attachment: [{file1}, {file2}]
}
My attempt so far:
g.V().hasLabel('article').as('article').out('author', 'attachments').as('author','attachments').select('article', 'author', 'attachments')
How can I write the query to get the distinct values?
When asking questions about Gremlin it is always helpful to provide some sample data in a form like this:
g.addV('user').property('name','jim').as('jim').
addV('user').property('name','alice').as('alice').
addV('user').property('name','bill').as('bill').
addV('article').property('title','Gremlin for Beginners').as('article').
addV('file').property('file','/files/a.png').as('a').
addV('file').property('file','/files/b.png').as('b').
addE('authoredBy').from('article').to('jim').
addE('authoredBy').from('article').to('alice').
addE('authoredBy').from('article').to('bill').
addE('attaches').from('article').to('a').
addE('attaches').from('article').to('b').iterate()
Note that I modified your edge label names to be more verb-like so that they distinguish themselves better from the noun-like vertex labels. It tends to read nicely with the direction of the edge, as in: article --authoredBy-> user
Anyway, your problem is most easily solved with the project() step:
gremlin> g.V().has('article','title','Gremlin for Beginners').
......1> project('article','authors','attachments').
......2> by().
......3> by(out('authoredBy').fold()).
......4> by(out('attaches').fold())
==>[article:v[6],authors:[v[0],v[2],v[4]],attachments:[v[10],v[8]]]
In the above code, note the use of fold() within the by() steps - that will force the full iteration of the inner traversal and get it into a list. If you miss that step you will get just one result (i.e. the first).
Going one step further, I added valueMap() and next'd the result so that you could better see the properties contained in the vertices above.
gremlin> g.V().has('article','title','Gremlin for Beginners').
......1> project('article','authors','attachments').
......2> by(valueMap()).
......3> by(out('authoredBy').valueMap().fold()).
......4> by(out('attaches').valueMap().fold()).next()
==>article={title=[Gremlin for Beginners]}
==>authors=[{name=[jim]}, {name=[alice]}, {name=[bill]}]
==>attachments=[{file=[/files/b.png]}, {file=[/files/a.png]}]
Related
I recently started using Arango since I want to make use of the advantages of graph databases. However, I'm not yet sure what's the most elegant and efficient approach to query an item from a document collection and applying fields to it that are part of a relation.
I'm used to make use of population or joins in SQL and NoSQL databases, but I'm not sure how it works here.
I created a document collection called posts. For example, this is a post:
{
"title": "Foo",
"content": "Bar"
}
And I also have a document collection called tags. A post can have any amount of tags, and my goal is to fetch either all or specific posts, but with their tags included, so for example this as my returning query result:
{
"title": "Foo",
"content": "Bar",
"tags": ["tag1", "tag2"]
}
I tried creating those two document collections and an edge collection post-tags-relation where I added an item for each tag from the post to the tag. I also created a graph, although I'm not yet sure what the vertex field is used for.
My query looked like this
FOR v, e, p IN 1..2 OUTBOUND 'posts/testPost' GRAPH post-tags-relation RETURN v
And it did give me the tag, but my goal is to fetch a post and include the tags in the same document...The path vertices do contain all tags and the post, but in separate arrays, which is not nice and easy to use (and probably not the right way). I'm probably missing something important here. Hopefully someone can help.
You're really close - it looks like your query to get the tags is correct. Now, just add a bit to return the source document:
FOR post IN posts
FILTER post._key == 'testPost'
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
Or, if you want to skip the FOR/FILTER process:
LET post = DOCUMENT('posts/testPost')
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
As for graph definition, there are three required fields:
edge definitions (an edge collection)
from collections (where your edges come from)
to collections (where your edges point to)
The non-obvious vertex collections field is there to allow you to include a set of vertex-only documents in your graph. When these documents are searched and how they're filtered remains a mystery to me. Personally, I've never used this feature (my data has always been connected) so I can't say when it would be valuable, but someone thought it was important to include.
I'm using Amazon Neptune to create and query a simple graph database. I'm currently running my code in an AWS Jupyter Notebook but will eventually move the code to Python (gremlin_python). As you can probably guess I'm pretty new to Gremlin and graph databases in general.
I have the following data
g.addV('person').property(id, 'john')
.addV('person').property(id, 'jim')
.addV('person').property(id, 'pam')
.addV('game').property(id, 'G1')
.addV('game').property(id, 'G2')
.addV('game').property(id, 'G3').iterate()
g.V('john').as('p').V('G1').addE('bought').from('p').iterate()
g.V('john').as('p').V('G2').addE('bought').from('p').iterate()
g.V('john').as('p').V('G3').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G1').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G2').addE('bought').from('p').iterate()
g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()
3 persons and 3 games in the database. My goal is, given a person, tell me which persons have bought the same games as them and which games are those
After looking at sample code (mostly from https://tinkerpop.apache.org/docs/current/recipes/#recommendation) I have the following code that tries to find games bought by
g.V('john').as('target') Target person we are interested in comparing against
.out('bought').aggregate('target_games') // Games bought by target
.in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates)
.group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games
.unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games
Which gives me the results:
{v[jim]: 2}
{v[pam]: 1}
Which tells me that jim has 2 of the same games as john while pam only has 1. But I want my query to return the actual games they have in common like so (still in order of most common games):
{v[jim]: ['G1', 'G2']}
{v[pam]: ['G1]}
Thanks for your help.
There are a few different ways this query could be written. Here is one way that uses a mid traversal V step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> V().
......4> hasLabel('person').
......5> where(neq('j')).
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
However, the mid traversal V approach is not really needed as you can just look at the incoming vertices from the games that Jown owns
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> group().
......6> by(id).
......7> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
Finally, here is a third way, where the dedup step is applied sooner. This is likely to be the most efficient of the three.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> dedup().
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group from a projection like this:
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> project('johnGames','personPurchasedJohnGames').
......5> by(select('johnGames')).
......6> by(select('personPurchasedJohnGames')).
......7> group().
......8> by(select('personPurchasedJohnGames')).
......9> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
but actually you can further reduce this to
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> group().
......5> by(select('personPurchasedJohnGames')).
......6> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j') in the other queries not really a big deal.
EDITED AGAIN to add ordering of results
g.V('john').as('j').
out().as('johnGames').
in('bought').
where(neq('j')).as('personPurchasedJohnGames').
group().
by(select('personPurchasedJohnGames')).
by(select('johnGames').fold()).
unfold().
order().
by(select(values).count(local),desc)
{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}
I'm currently reading The Practitioner's Guide to Graph Data and am trying to solve the following problem (just for learning purposes). The following is in the context of the books movie dataset, which in this example makes use of a "Tag" vertex, a "Movie" vertex and a "rated" edge which has a rating property of a value 1-5 .
Just for practice, and to extend my understanding of concepts from the book, I would like to get all movies tagged with "comedy" and calculate the mean NPS. To do this, I want to aggregate all positive (+1) and neutral or negative (-1) ratings into a list. Then I wish to divide the sum of these values by the amount of variables in this list (the mean). This is what I attempted:
dev.withSack{[]}{it.clone()}. // create a sack with an empty list that clones when split
V().has('Tag', 'tag_name', 'comedy').
in('topic_tagged').as('film'). // walk to movies tagged as comedy
inE('rated'). // walk to the rated edges
choose(values('rating').is(gte(3.0)),
sack(addAll).by(constant([1.0])),
sack(addAll).by(constant([-1.0]))). // add a value or 1 or -1 to this movies list, depending on the rating
group().
by(select('film').values('movie_title')).
by(project('a', 'b').
by(sack().unfold().sum()). // add all values from the list
by(sack().unfold().count()). // Count the values in the list
math('a / b')).
order(local).
by(values, desc)
This ends up with each movie either being "1.0" or "-1.0".
"Journey of August King The (1995)": "1.0",
"Once Upon a Time... When We Were Colored (1995)": "1.0", ...
In my testing, it seems the values aren't aggregating into the collection how I expected. I've tried various approaches but none of them achieve my expected result.
I am aware that I can achieve this result by adding and subtracting from a sack with an initial value of "0.0", then dividing by the edge count, but I am hoping for a more efficient solution by using a list and avoiding an additional traversal to the edges to get the count.
Is it possible to achieve my result using a list? If so, how?
Edit 1:
The much simpler code below, taken from Kelvins example, will aggregate each rating by simply using the fold step:
dev.V().
has('Tag', 'tag_name', 'comedy').
in('topic_tagged').
project('movie', 'result').
by('movie_title').
by(inE('rated').
choose(values('rating').is(gte(3.0)),
constant(1.0),
constant(-1.0)).
fold()) // replace fold() with mean() to calculate the mean, or do something with the collection
I feel a bit embarrassed that I completely forgot about the fold step, as folding and unfolding are so common. Overthinking, I guess.
You might consider a different approach using aggregate rather than sack. You can also use the mean step to avoid needing the math step. As I don't have your data I made an example that uses the air-routes data set and uses the airport elevation instead of the movie rating in your case.
gremlin> g.V().hasLabel('airport').limit(10).values('elev')
==>1026
==>151
==>542
==>599
==>19
==>143
==>14
==>607
==>64
==>313
Using a weighting system similar to yours yields
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1))
==>1
==>-1
==>1
==>1
==>-1
==>-1
==>-1
==>1
==>-1
==>-1
Those results can be aggregated into a bulk set
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1)).
......4> aggregate('x').
......5> cap('x')
==>[1,1,1,1,-1,-1,-1,-1,-1,-1]
From there we can take the mean value
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1)).
......4> aggregate('x').
......5> cap('x').
......6> unfold().
......7> mean()
==>-0.2
Now, this is of course contrived as you would not usually do the aggregate('x').cap('x').unfold().mean() you would just use mean() by itself. However using this pattern you should be able to solve your problem.
EDITED TO ADD
Thinking about this more you can probably write the query without even needing an aggregate - something like this (below). I used the air route distance edge property to simulate something similar to your query. The example just uses one airport to keep it simple. First just creating the list of scores...
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> choose(values('dist').is(gt(350)),
......5> constant(1),
......6> constant(-1)).
......7> fold())
==>[airport:SAF,mean:[1,1,1,-1]]
and finally creating the mean value
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> choose(values('dist').is(gt(350)),
......5> constant(1),
......6> constant(-1)).
......7> mean())
==>[airport:SAF,mean:0.5]
Edited again
If the edge property may not exist, you can do something like this...
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> coalesce(values('x'),constant(100)).
......5> choose(identity().is(gt(350)),
......6> constant(1),
......7> constant(-1)).
......8> fold())
==>[airport:SAF,mean:[-1,-1,-1,-1]]
We can update a vertex for example:
g.V(vertex_id).property("name","Marko")
is there any way to replace a vertex?
So you want to replace all properties of one vertex by properties of another vertex (at least that's how I understand your question together with your comment).
To delete all properties you simply have to drop them:
g.V(vertex_id).properties().drop().iterate()
and we can see how to copy all properties from one vertex to another in this response by Daniel Kuppitz to a question on how to merge two vertices:
g.V(vertex_with_new_properties).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(vertex_id).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
When we combine those two traversals, then we get a traversal that drops the old properties and copies over the new properties from the other vertex:
g.V(vertex_id).
sideEffect(properties().drop()).
V(vertex_with_new_properties).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(vertex_id).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
In action for the modern graph:
// properties before for both vertices:
gremlin> g.V(1).valueMap(true)
==>{id=1, label=person, name=[marko], age=[29]}
gremlin> g.V(2).valueMap(true)
==>{id=2, label=person, name=[vadas], age=[27]}
// Replace all properties of v[1]:
gremlin> g.V(1).
sideEffect(properties().drop()).
V(2).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(1).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
// v[1] properties after:
gremlin> g.V(1).valueMap(true)
==>{id=1, label=person, name=[vadas], age=[27]}
We use OrientDB and when using the Gremlin terminal, we cannot query for a single user id.
We have this
gremlin> g.V('#class','PERSON')[0..<5].map();
==>{id=50269488}
==>{id=55225663}
==>{id=6845786}
==>{id=55226938}
==>{id=55226723}
gremlin> g.V('#class','PERSON').has('id',50269488)[0..<5].map();
gremlin>
As you can see I tried filtering for that first id, but it doesn't return anything. I even tried typecasting to 50269488L as suggested here
any tips what to try?
I guess it's because property id is reserved somehow.
An example:
gremlin> g.V.id
==>#15:0
==>#15:1
...
This returns the RecordId instead of the property id.
From studio, e.g.:
create class PERSON extends V
create Property PERSON.id2 long
create vertex PERSON set id2 = 12345
Then this should work:
gremlin> g.V('#class','PERSON').has('id2',12345L)[0..<5].map();
==>{id2=12345}
UPDATE:
A workaround to this problem is to filter with getProperty method:
g.V('#class','PERSON').filter{it.getProperty("id")==12345}[0..<5].map();