Find people who bought the same games as someone else - python-3.x

I'm using Amazon Neptune to create and query a simple graph database. I'm currently running my code in an AWS Jupyter Notebook but will eventually move the code to Python (gremlin_python). As you can probably guess I'm pretty new to Gremlin and graph databases in general.
I have the following data
g.addV('person').property(id, 'john')
.addV('person').property(id, 'jim')
.addV('person').property(id, 'pam')
.addV('game').property(id, 'G1')
.addV('game').property(id, 'G2')
.addV('game').property(id, 'G3').iterate()
g.V('john').as('p').V('G1').addE('bought').from('p').iterate()
g.V('john').as('p').V('G2').addE('bought').from('p').iterate()
g.V('john').as('p').V('G3').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G1').addE('bought').from('p').iterate()
g.V('jim').as('p').V('G2').addE('bought').from('p').iterate()
g.V('pam').as('p').V('G1').addE('bought').from('p').iterate()
3 persons and 3 games in the database. My goal is, given a person, tell me which persons have bought the same games as them and which games are those
After looking at sample code (mostly from https://tinkerpop.apache.org/docs/current/recipes/#recommendation) I have the following code that tries to find games bought by
g.V('john').as('target') Target person we are interested in comparing against
.out('bought').aggregate('target_games') // Games bought by target
.in('bought').where(P.neq('target')).dedup() // Persons who bought same games as target (excluding target and without duplicates)
.group().by().by(out("bought").where(P.within("target_games")).count()) // Find persons, group by number of co owned games
.unfold().order().by(values, desc).toList() // Unfold to create list, order by greatest number of common games
Which gives me the results:
{v[jim]: 2}
{v[pam]: 1}
Which tells me that jim has 2 of the same games as john while pam only has 1. But I want my query to return the actual games they have in common like so (still in order of most common games):
{v[jim]: ['G1', 'G2']}
{v[pam]: ['G1]}
Thanks for your help.

There are a few different ways this query could be written. Here is one way that uses a mid traversal V step having found John's games to find all the other people who are not John, look at their games and see if they intersect with games that John owns.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> V().
......4> hasLabel('person').
......5> where(neq('j')).
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
However, the mid traversal V approach is not really needed as you can just look at the incoming vertices from the games that Jown owns
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> group().
......6> by(id).
......7> by(out('bought').where(within('owns')).dedup().fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
Finally, here is a third way, where the dedup step is applied sooner. This is likely to be the most efficient of the three.
gremlin> g.V('john').as('j').
......1> out().
......2> aggregate('owns').
......3> in('bought').
......4> where(neq('j')).
......5> dedup().
......6> group().
......7> by(id).
......8> by(out('bought').where(within('owns')).fold())
==>[pam:[v[G1]],jim:[v[G1],v[G2]]]
UPDATED based on comments discussion. I'm not sure that this is a simpler query but you can extract a group from a projection like this:
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> project('johnGames','personPurchasedJohnGames').
......5> by(select('johnGames')).
......6> by(select('personPurchasedJohnGames')).
......7> group().
......8> by(select('personPurchasedJohnGames')).
......9> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
but actually you can further reduce this to
gremlin> g.V('john').as('j').
......1> out().as('johnGames').
......2> in('bought').
......3> where(neq('j')).as('personPurchasedJohnGames').
......4> group().
......5> by(select('personPurchasedJohnGames')).
......6> by(select('johnGames').fold())
==>[v[pam]:[v[G1]],v[jim]:[v[G1],v[G2]]]
So now we have many choices to pick from! It will be interesting to measure these and see if any are faster than others. In general I have a tendency to avoid use of as steps as that causes path tracking to be turned on (using up memory) but as we already have an as('j') in the other queries not really a big deal.
EDITED AGAIN to add ordering of results
g.V('john').as('j').
out().as('johnGames').
in('bought').
where(neq('j')).as('personPurchasedJohnGames').
group().
by(select('personPurchasedJohnGames')).
by(select('johnGames').fold()).
unfold().
order().
by(select(values).count(local),desc)
{v[jim]: [v[G1], v[G2]]}
{v[pam]: [v[G1]]}

Related

Tinkerpop Gremlin - How to aggregate variables into traversal independant collections

I'm currently reading The Practitioner's Guide to Graph Data and am trying to solve the following problem (just for learning purposes). The following is in the context of the books movie dataset, which in this example makes use of a "Tag" vertex, a "Movie" vertex and a "rated" edge which has a rating property of a value 1-5 .
Just for practice, and to extend my understanding of concepts from the book, I would like to get all movies tagged with "comedy" and calculate the mean NPS. To do this, I want to aggregate all positive (+1) and neutral or negative (-1) ratings into a list. Then I wish to divide the sum of these values by the amount of variables in this list (the mean). This is what I attempted:
dev.withSack{[]}{it.clone()}. // create a sack with an empty list that clones when split
V().has('Tag', 'tag_name', 'comedy').
in('topic_tagged').as('film'). // walk to movies tagged as comedy
inE('rated'). // walk to the rated edges
choose(values('rating').is(gte(3.0)),
sack(addAll).by(constant([1.0])),
sack(addAll).by(constant([-1.0]))). // add a value or 1 or -1 to this movies list, depending on the rating
group().
by(select('film').values('movie_title')).
by(project('a', 'b').
by(sack().unfold().sum()). // add all values from the list
by(sack().unfold().count()). // Count the values in the list
math('a / b')).
order(local).
by(values, desc)
This ends up with each movie either being "1.0" or "-1.0".
"Journey of August King The (1995)": "1.0",
"Once Upon a Time... When We Were Colored (1995)": "1.0", ...
In my testing, it seems the values aren't aggregating into the collection how I expected. I've tried various approaches but none of them achieve my expected result.
I am aware that I can achieve this result by adding and subtracting from a sack with an initial value of "0.0", then dividing by the edge count, but I am hoping for a more efficient solution by using a list and avoiding an additional traversal to the edges to get the count.
Is it possible to achieve my result using a list? If so, how?
Edit 1:
The much simpler code below, taken from Kelvins example, will aggregate each rating by simply using the fold step:
dev.V().
has('Tag', 'tag_name', 'comedy').
in('topic_tagged').
project('movie', 'result').
by('movie_title').
by(inE('rated').
choose(values('rating').is(gte(3.0)),
constant(1.0),
constant(-1.0)).
fold()) // replace fold() with mean() to calculate the mean, or do something with the collection
I feel a bit embarrassed that I completely forgot about the fold step, as folding and unfolding are so common. Overthinking, I guess.
You might consider a different approach using aggregate rather than sack. You can also use the mean step to avoid needing the math step. As I don't have your data I made an example that uses the air-routes data set and uses the airport elevation instead of the movie rating in your case.
gremlin> g.V().hasLabel('airport').limit(10).values('elev')
==>1026
==>151
==>542
==>599
==>19
==>143
==>14
==>607
==>64
==>313
Using a weighting system similar to yours yields
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1))
==>1
==>-1
==>1
==>1
==>-1
==>-1
==>-1
==>1
==>-1
==>-1
Those results can be aggregated into a bulk set
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1)).
......4> aggregate('x').
......5> cap('x')
==>[1,1,1,1,-1,-1,-1,-1,-1,-1]
From there we can take the mean value
gremlin> g.V().hasLabel('airport').limit(10).
......1> choose(values('elev').is(gt(500)),
......2> constant(1),
......3> constant(-1)).
......4> aggregate('x').
......5> cap('x').
......6> unfold().
......7> mean()
==>-0.2
Now, this is of course contrived as you would not usually do the aggregate('x').cap('x').unfold().mean() you would just use mean() by itself. However using this pattern you should be able to solve your problem.
EDITED TO ADD
Thinking about this more you can probably write the query without even needing an aggregate - something like this (below). I used the air route distance edge property to simulate something similar to your query. The example just uses one airport to keep it simple. First just creating the list of scores...
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> choose(values('dist').is(gt(350)),
......5> constant(1),
......6> constant(-1)).
......7> fold())
==>[airport:SAF,mean:[1,1,1,-1]]
and finally creating the mean value
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> choose(values('dist').is(gt(350)),
......5> constant(1),
......6> constant(-1)).
......7> mean())
==>[airport:SAF,mean:0.5]
Edited again
If the edge property may not exist, you can do something like this...
gremlin> g.V().has('airport','code','SAF').
......1> project('airport','mean').
......2> by('code').
......3> by(outE().
......4> coalesce(values('x'),constant(100)).
......5> choose(identity().is(gt(350)),
......6> constant(1),
......7> constant(-1)).
......8> fold())
==>[airport:SAF,mean:[-1,-1,-1,-1]]

If there are vertices(eg: Star, Movie) and edges(eg: star_in, director, ...) in ArangoDB, how to query movies which starring and directed by someone?

If there are vertices(eg: Star, Movie) and edges(eg: star_in, director, producer) in ArangoDB, and I want to get movies which starring and directed by Stephen Chow, how to write the query statement?
In this case you can use the AQL NEIGHBORS function:
FOR n IN ANY #startId ##edgeCollection OPTIONS {bfs:true,uniqueVertices: 'global'}
RETURN n._id
ANY/INBOUND/OUTBOUND determines the direction of the edges while #startId is your start vertex (in this case Stephen Crow) and ##edgecollection is your used edge collection.
When two conditions should be applied (starring and directed) a INTERSECTION of two NEIGHBOUR queries could be used.
The following AQL query is a draft for your use case:
FOR x IN INTERSECTION
((FOR y IN ANY 'star/StephenChow' star_in OPTIONS {bfs: true, uniqueVertices: 'global'} RETURN y._id),
(FOR y IN ANY 'star/StephenChow' director OPTIONS {bfs: true, uniqueVertices: 'global'} RETURN y._id))
RETURN x
A working Actor/Movie example can be found in the Cookbook section of the documentation.

I am using JanusGraph 0.2 with Cassandra 3.9. How can I replace a vertex?

We can update a vertex for example:
g.V(vertex_id).property("name","Marko")
is there any way to replace a vertex?
So you want to replace all properties of one vertex by properties of another vertex (at least that's how I understand your question together with your comment).
To delete all properties you simply have to drop them:
g.V(vertex_id).properties().drop().iterate()
and we can see how to copy all properties from one vertex to another in this response by Daniel Kuppitz to a question on how to merge two vertices:
g.V(vertex_with_new_properties).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(vertex_id).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
When we combine those two traversals, then we get a traversal that drops the old properties and copies over the new properties from the other vertex:
g.V(vertex_id).
sideEffect(properties().drop()).
V(vertex_with_new_properties).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(vertex_id).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
In action for the modern graph:
// properties before for both vertices:
gremlin> g.V(1).valueMap(true)
==>{id=1, label=person, name=[marko], age=[29]}
gremlin> g.V(2).valueMap(true)
==>{id=2, label=person, name=[vadas], age=[27]}
// Replace all properties of v[1]:
gremlin> g.V(1).
sideEffect(properties().drop()).
V(2).
sideEffect(properties().group("p").by(key).by(value())).
cap("p").unfold().as("kv").
V(1).
property(select("kv").select(keys), select("kv").select(values)).
iterate()
// v[1] properties after:
gremlin> g.V(1).valueMap(true)
==>{id=1, label=person, name=[vadas], age=[27]}

Output vertice and adjacent vertices in same query with tinkerpop3

I'm new to gremlin and trying to find out how to get an article along with the author and attachments in the same result using Azure Cosmos DB with GraphSON.
My graph looks like this:
[User] <- (edge: author) - [Article] - (edge: attachments) -> [File1, File2]
I would like to fetch everything I need in the UI to show an article along with author and info about attachments in on request.
What I'm trying to fetch is something similar to this pseudo-code:
{
article: {...},
author: [{author1}],
attachment: [{file1}, {file2}]
}
My attempt so far:
g.V().hasLabel('article').as('article').out('author', 'attachments').as('author','attachments').select('article', 'author', 'attachments')
How can I write the query to get the distinct values?
When asking questions about Gremlin it is always helpful to provide some sample data in a form like this:
g.addV('user').property('name','jim').as('jim').
addV('user').property('name','alice').as('alice').
addV('user').property('name','bill').as('bill').
addV('article').property('title','Gremlin for Beginners').as('article').
addV('file').property('file','/files/a.png').as('a').
addV('file').property('file','/files/b.png').as('b').
addE('authoredBy').from('article').to('jim').
addE('authoredBy').from('article').to('alice').
addE('authoredBy').from('article').to('bill').
addE('attaches').from('article').to('a').
addE('attaches').from('article').to('b').iterate()
Note that I modified your edge label names to be more verb-like so that they distinguish themselves better from the noun-like vertex labels. It tends to read nicely with the direction of the edge, as in: article --authoredBy-> user
Anyway, your problem is most easily solved with the project() step:
gremlin> g.V().has('article','title','Gremlin for Beginners').
......1> project('article','authors','attachments').
......2> by().
......3> by(out('authoredBy').fold()).
......4> by(out('attaches').fold())
==>[article:v[6],authors:[v[0],v[2],v[4]],attachments:[v[10],v[8]]]
In the above code, note the use of fold() within the by() steps - that will force the full iteration of the inner traversal and get it into a list. If you miss that step you will get just one result (i.e. the first).
Going one step further, I added valueMap() and next'd the result so that you could better see the properties contained in the vertices above.
gremlin> g.V().has('article','title','Gremlin for Beginners').
......1> project('article','authors','attachments').
......2> by(valueMap()).
......3> by(out('authoredBy').valueMap().fold()).
......4> by(out('attaches').valueMap().fold()).next()
==>article={title=[Gremlin for Beginners]}
==>authors=[{name=[jim]}, {name=[alice]}, {name=[bill]}]
==>attachments=[{file=[/files/b.png]}, {file=[/files/a.png]}]

Gremlin access a property "id"

We use OrientDB and when using the Gremlin terminal, we cannot query for a single user id.
We have this
gremlin> g.V('#class','PERSON')[0..<5].map();
==>{id=50269488}
==>{id=55225663}
==>{id=6845786}
==>{id=55226938}
==>{id=55226723}
gremlin> g.V('#class','PERSON').has('id',50269488)[0..<5].map();
gremlin>
As you can see I tried filtering for that first id, but it doesn't return anything. I even tried typecasting to 50269488L as suggested here
any tips what to try?
I guess it's because property id is reserved somehow.
An example:
gremlin> g.V.id
==>#15:0
==>#15:1
...
This returns the RecordId instead of the property id.
From studio, e.g.:
create class PERSON extends V
create Property PERSON.id2 long
create vertex PERSON set id2 = 12345
Then this should work:
gremlin> g.V('#class','PERSON').has('id2',12345L)[0..<5].map();
==>{id2=12345}
UPDATE:
A workaround to this problem is to filter with getProperty method:
g.V('#class','PERSON').filter{it.getProperty("id")==12345}[0..<5].map();

Resources