How to use UNWIND to execute block composed of a MATCH and two FOREACHs? - node.js

I'm running neo4j queries from node.js using the neo4j-driver. A lot of things were simplified to cut irrelevant information, but what is needed is here.
I have been trying to make a query to ingest a data set with some quirks, defined as follows:
Curriculum: A list of Publications
Publication: Contains data about a publication and a field that is a list of Authors
Author: Relevant fields are externalId and normalizedFullName.
externalId is an id that comes from the data's origin system. It is not guaranteed to be present, but if it is, it will uniquely identify a node
normalizedFullName will always be present and it's ok to assume the same author will always have the same name wherever it appears; it is also acceptable that full name may not be unique and that at some point two different persons may be stored as the same node
It is possible for an author to be part of a publication with only it's normalizedFullName and be part of another with normalizedFullName AND externalId. As you can see, it is not very consistent data, but this is not a problem for the ends I need it.
It will look like this: (don't mind any syntax error)
"curriculum": [
{
"data": {
"fieldA": "a",
"fieldB": "b"
},
"authors": [
{
"externalId": "",
"normalizedFullName": "namea namea"
},
{
"externalId": "123456",
"normalizedFullName": "nameb nameb"
}
]
},
{
"data": {
"fieldA": "d",
"fieldB": "e"
},
"authors": [
{
"externalId": "123321",
"normalizedFullName": "namea namea"
},
{
"externalId": "123456",
"normalizedFullName": "nameb nameb"
}
]
}
]
Merging everything
Merging the publication part is trivial, but things get complicated when it comes to the authors since I have to follow this logic (simplified here) to merge an author:
IF author don't have externalId OR isn't already a node created with his externalId THEN
merge by normalizedFullName
ELSE IF there is already a node with this externalId THEN
merge by externalId
So, acknowledging that I would need some kind of conditional merge, finding that it could be achieved by "the foreach trick", I was able to come up with this little monster (comments added to clarify):
// For each publication, merge it
UNWIND {publications} as publication
MERGE (p:Publication { fieldA: publication.data.fieldA, fieldB: publication.data.fieldB })
ON CREATE SET p = publication.data
WITH p, publication.authors AS authors
// Then, for each author in this publication
UNWIND authors AS author
// IF author don't have externalId OR isn't already a node created with his externalId THEN
MATCH (a:Author) WHERE a.externalId = author.data.externalId AND a.externalId <> '' WITH count(a) as found, author, p
// Merge by name
FOREACH(ignoreMe IN CASE WHEN found = 0 THEN [1] ELSE [] END |
MERGE (aa:Author { normalizedFullName: author.data.normalizedFullName })
ON CREATE SET aa = author.data
MERGE (aa)-[:CONTRIBUTED]->(p)
)
// Else, merge by externalId
FOREACH(ignoreMe IN CASE WHEN found > 0 THEN [1] ELSE [] END |
MERGE (aa:Author { externalId: autor.dadta.externalId })
ON CREATE SET aa = author.data
MERGE (aa)-[:CONTRIBUTED]->(p)
)
Note: This is not the real query i'm using, just shows the exact structures.
The Problem
It doesn't work. It only creates the publications (corretly) and never the authors. It seems the MATCH, FOREACH or a combination of both is messing up with the loop I expected to happen because of UNWIND.
I'm at a point where I can't find a way to do it properly. I also can't find what is wrong, even checking the documentation available.
So, what do I do?
(let me know if anymore information is needed)
Thanks in advance for any insight!

First of all: author.data.externalIddoes not exists. The right property path is author.externalId(without data). The same for author.data.normalizedFullName.
I simulated your scenario here putting your data set as a parameter in the Neo4j browser interface. After it I ran your query. As expected the author are never created.
I corrected your query doing these steps:
Changed author.data.externalId to author.externalId and author.data.normalizedFullName to author.normalizedFullName.
Changed MATCH (a:Author) to OPTIONAL MATCH (a:Author) to ensure that the query will continue even no results found.
Removed count(a) as found (not necessary) and changed tests from found = 0 to a IS NULL and from found > 0 to a IS NOT NULL.
Your corrected query:
UNWIND {publications} as publication
MERGE (p:Publication { fieldA: publication.data.fieldA, fieldB: publication.data.fieldB })
ON CREATE SET p = publication.data
WITH p, publication.authors AS authors
UNWIND authors AS author
OPTIONAL MATCH (a:Author) WHERE a.externalId = author.externalId AND a.externalId <> '' WITH a, author, p
FOREACH(ignoreMe IN CASE WHEN a IS NULL THEN [1] ELSE [] END |
MERGE (aa:Author { normalizedFullName: author.normalizedFullName })
ON CREATE SET aa = author
MERGE (aa)-[:CONTRIBUTED]->(p)
)
FOREACH(ignoreMe IN CASE WHEN a IS NOT NULL THEN [1] ELSE [] END |
MERGE (aa:Author { externalId: author.dadta.externalId })
ON CREATE SET aa = author
MERGE (aa)-[:CONTRIBUTED]->(p)
)
The data set created after I ran this query:

I think the problem (or at least one problem) is that if your author MATCH fails, the entire row for that author will be wiped out, and the rest of the query will not execute for that author.
Try using OPTIONAL MATCH instead, that will preserve the row and allow the query to finish for those rows.
As for additional options on how to do conditional cypher operations, we actually just released new versions of APOC Procedures with conditional cypher execution, so take a look at apoc.do.when() when you get the chance.

Related

ArangoDB populate relation as field over graph query

I recently started using Arango since I want to make use of the advantages of graph databases. However, I'm not yet sure what's the most elegant and efficient approach to query an item from a document collection and applying fields to it that are part of a relation.
I'm used to make use of population or joins in SQL and NoSQL databases, but I'm not sure how it works here.
I created a document collection called posts. For example, this is a post:
{
"title": "Foo",
"content": "Bar"
}
And I also have a document collection called tags. A post can have any amount of tags, and my goal is to fetch either all or specific posts, but with their tags included, so for example this as my returning query result:
{
"title": "Foo",
"content": "Bar",
"tags": ["tag1", "tag2"]
}
I tried creating those two document collections and an edge collection post-tags-relation where I added an item for each tag from the post to the tag. I also created a graph, although I'm not yet sure what the vertex field is used for.
My query looked like this
FOR v, e, p IN 1..2 OUTBOUND 'posts/testPost' GRAPH post-tags-relation RETURN v
And it did give me the tag, but my goal is to fetch a post and include the tags in the same document...The path vertices do contain all tags and the post, but in separate arrays, which is not nice and easy to use (and probably not the right way). I'm probably missing something important here. Hopefully someone can help.
You're really close - it looks like your query to get the tags is correct. Now, just add a bit to return the source document:
FOR post IN posts
FILTER post._key == 'testPost'
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
Or, if you want to skip the FOR/FILTER process:
LET post = DOCUMENT('posts/testPost')
LET tags = (
FOR v IN 1..2 OUTBOUND post
GRAPH post-tags-relation
RETURN v.value
)
RETURN MERGE(
post,
{ tags }
)
As for graph definition, there are three required fields:
edge definitions (an edge collection)
from collections (where your edges come from)
to collections (where your edges point to)
The non-obvious vertex collections field is there to allow you to include a set of vertex-only documents in your graph. When these documents are searched and how they're filtered remains a mystery to me. Personally, I've never used this feature (my data has always been connected) so I can't say when it would be valuable, but someone thought it was important to include.

Efficiently count Documents with different values for a given field

I am trying to count the number of documents that are in each possible state in a particular Arango collection.
This should be possible in 1 pass over all of the documents using a bucket-sort like strategy where you iterate over all documents, if the value for the state hasn't been seen before, you add a counter with a value of 1 to a list. If you have seen that state before, you increment the counter. Once you've reached the end, you'll have a counter for each possible state in the DB that indicates how many documents are currently stored with that state.
I can't seem to figure out how to write this type of logic in AQL to submit as a query. Current strategy is like this:
Loop over all documents, filtering only docs of a particular state.
Loop over all documents, filtering only docs of a different particular state.
...
All states have been filtered.
Return size of each set
This works, but I'm sure it's much slower than it should be. This also means that if we add a new state, we have to update the query to loop over all docs an additional time, filtering based on the new state. A bucket-sort like query would be quick, and would need no updating as new states are created as well.
If these were the documents:
{A}
{B}
{B}
{C}
{A}
Then I'd like the result to be
{ A:2, B:2, C:1 }
Where A,B,&C are values for a particular field. Current strategy filters like so
LET docsA = (
FOR doc in collection
FILTER doc.state == A
RETURN doc
)
Then manually construct the return object calling LENGTH on each list of docs
Any help or additional info would be greatly appreciated
What about using a COLLECT function? (see docs here)
FOR doc IN collection
COLLECT s = doc.state WITH COUNT INTO c
RETURN { state: s, count: c }
This would return something like:
[
{ state: 'A', count: 23 },
{ state: 'B', count: 2 },
{ state: 'C', count: 45 }
]
Would that accomplish what you are after?

ArangoDb AQL Graph queries traversal example

I am having some trouble wrapping my head around how to traverse a certain graph to extract some data.
Given a collection of "users" and a collection of "places".
And a "likes" edge collection to denote that a user likes a certain place. The "likes" edge collection also has a "review" property to store a user's review about the place.
And a "follows" edge collection to denote that a user follows another user.
How can I traverse the graph to fetch all the places that I like with my review of the place and the reviews of the users I follow that also like the same place.
for example, in the above graph. I am user 6327 and I reviewed both places(7968 and 16213)
I also follow user 6344 which also happens to have reviewed the place 7968.
How can I get all the places that I like and the reviews of the people that I follow who also reviewed the same place that I like.
an expected output would be something like the following:
[
{
name:"my name",
place: "place 1",
id: 1
review,"my review about place 1"
},
{
name:"my name",
place: "place 2",
id: 2
review,"my review about place 2"
},
{
name:"name of the user I follow",
place: "place 2",
id: 2
review,"review about place 2 from the user I follow"
}
]
There are a number of ways to do this query, and it also depends on where you want to add parameters, but for the sake of simplicity I've built this quite verbose query below to help you understand one way of approaching the problem.
One way is to determine the _id of your user record, then find all the _id's of the friends you follow, and then to work out all related reviews in one query.
I take a different approach below, and that is to:
Determine the reviews you have written
Determine who you follow
Determine the reviews the people you follow have written
Merge together your reviews with those of the people you follow
It is possible to merge these queries together more optimally, but I thought it worth breaking them out like this (and showing the output of each stage as well as the final answer) to help you see what data is available.
A key thing to understand about AQL graph queries is how you have access to vertices, edges, and paths when you perform a query.
A path is an object in it's own right and it's worth investigating the contents of that object to better understand how to exploit it for path information.
This query assumes:
users document collection contains users
places document collection contains places
follows edge collection tracks users following other users
reviews edge collection tracks reviews people wrote
Note: When providing an id on each record I used the id of the review, because if you know that id you can fetch the edge document and get the id of both the user and the place as well as read all the data about the review.
LET my_reviews = (
FOR vertices, edges, paths IN 1..1 OUTBOUND "users/6327" reviews
RETURN {
name: FIRST(paths.vertices).name,
review_id: FIRST(paths.edges)._id,
review: FIRST(paths.edges).review,
place: LAST(paths.vertices).place
}
)
LET who_i_follow = (
FOR v IN 1..1 OUTBOUND "users/6327" follows
RETURN v
)
LET reviews_of_who_i_follow = (
FOR users IN who_i_follow
FOR vertices, edges, paths in 1..1 OUTBOUND users._id reviews
RETURN {
name: FIRST(paths.vertices).name,
review_id: FIRST(paths.edges)._id,
review: FIRST(paths.edges).review,
place: LAST(paths.vertices).place
}
)
RETURN {
my_reviews: my_reviews,
who_i_follow: who_i_follow,
reviews_of_who_i_follow: reviews_of_who_i_follow,
merged_reviews: UNION(my_reviews, reviews_of_who_i_follow)
}
The first vertex in paths.vertices is the starting vertex (users/6327)
The last vertex in paths.vertices is the end of the path, e.g. who you follow
The first edge in paths.edges is the review that the user made of the place
Here is another more compact version of the query that takes a param, the _id of the user that is 'you'.
LET target_users = APPEND(TO_ARRAY(#user), (
FOR v IN 1..1 OUTBOUND #user follows RETURN v._id
))
LET selected_reviews = (
FOR u IN target_users
FOR vertices, edges, paths in 1..1 OUTBOUND u reviews
LET user = FIRST(paths.vertices)
LET place = LAST(paths.vertices)
LET review = FIRST(paths.edges)
RETURN {
name: user.name,
review_id: review._id,
review: review.review,
place: place.place
}
)
RETURN selected_reviews

How do I keep existing data in couchbase and only update the new data without overwriting

So, say I have created some records/documents under a bucket and the user updates only one column out of 10 in the RDBMS, so I am trying to send only that one columns data and update it in couchbase. But the problem is that couchbase is overwriting the entire record and putting NULL`s for the rest of the columns.
One approach is to copy all the data from the exisiting record after fetching it from Cbase, and then overwriting the new column while copying the data from the old one. But that doesn`t look like a optimal approach
Any suggestions?
You can use N1QL update Statments google for Couchbase N1QL
UPDATE replaces a document that already exists with updated values.
update:
UPDATE keyspace-ref [use-keys-clause] [set-clause] [unset-clause] [where-clause] [limit-clause] [returning-clause]
set-clause:
SET path = expression [update-for] [ , path = expression [update-for] ]*
update-for:
FOR variable (IN | WITHIN) path (, variable (IN | WITHIN) path)* [WHEN condition ] END
unset-clause:
UNSET path [update-for] (, path [ update-for ])*
keyspace-ref: Specifies the keyspace for which to update the document.
You can add an optional namespace-name to the keyspace-name in this way:
namespace-name:keyspace-name.
use-keys-clause:Specifies the keys of the data items to be updated. Optional. Keys can be any expression.
set-clause:Specifies the value for an attribute to be changed.
unset-clause: Removes the specified attribute from the document.
update-for: The update for clause uses the FOR statement to iterate over a nested array and SET or UNSET the given attribute for every matching element in the array.
where-clause:Specifies the condition that needs to be met for data to be updated. Optional.
limit-clause:Specifies the greatest number of objects that can be updated. This clause must have a non-negative integer as its upper bound. Optional.
returning-clause:Returns the data you updated as specified in the result_expression.
RBAC Privileges
User executing the UPDATE statement must have the Query Update privilege on the target keyspace. If the statement has any clauses that needs data read, such as SELECT clause, or RETURNING clause, then Query Select privilege is also required on the keyspaces referred in the respective clauses. For more details about user roles, see Authorization.
For example,
To execute the following statement, user must have the Query Update privilege on travel-sample.
UPDATE `travel-sample` SET foo = 5
To execute the following statement, user must have the Query Update privilege on the travel-sample and Query Select privilege on beer-sample.
UPDATE `travel-sample`
SET foo = 9
WHERE city = (SELECT raw city FROM `beer-sample` WHERE type = "brewery"
To execute the following statement, user must have the Query Update privilege on `travel-sample` and Query Select privilege on `travel-sample`.
UPDATE `travel-sample`
SET city = “San Francisco”
WHERE lower(city) = "sanfrancisco"
RETURNING *
Example
The following statement changes the "type" of the product, "odwalla-juice1" to "product-juice".
UPDATE product USE KEYS "odwalla-juice1" SET type = "product-juice" RETURNING product.type
"results": [
{
"type": "product-juice"
}
]
This statement removes the "type" attribute from the "product" keyspace for the document with the "odwalla-juice1" key.
UPDATE product USE KEYS "odwalla-juice1" UNSET type RETURNING product.*
"results": [
{
"productId": "odwalla-juice1",
"unitPrice": 5.4
}
]
This statement unsets the "gender" attribute in the "children" array for the document with the key, "dave" in the tutorial keyspace.
UPDATE tutorial t USE KEYS "dave" UNSET c.gender FOR c IN children END RETURNING t
"results": [
{
"t": {
"age": 46,
"children": [
{
"age": 17,
"fname": "Aiden"
},
{
"age": 2,
"fname": "Bill"
}
],
"email": "dave#gmail.com",
"fname": "Dave",
"hobbies": [
"golf",
"surfing"
],
"lname": "Smith",
"relation": "friend",
"title": "Mr.",
"type": "contact"
}
}
]
Starting version 4.5.1, the UPDATE statement has been improved to SET nested array elements. The FOR clause is enhanced to evaluate functions and expressions, and the new syntax supports multiple nested FOR expressions to access and update fields in nested arrays. Additional array levels are supported by chaining the FOR clauses.
Example
UPDATE default
SET i.subitems = ( ARRAY OBJECT_ADD(s, 'new', 'new_value' )
FOR s IN i.subitems END )
FOR s IN ARRAY_FLATTEN(ARRAY i.subitems
FOR i IN items END, 1) END;
If you're using structured (json) data, you need to read the existing record then update the field you want in your program's data structure and then send the record up again. You can't update individual fields in the json structure without sending it all up again. There isn't a way around this that I'm aware of.
It is indeed true, to update individual items in a JSON doc, you need to fetch the entire document and overwrite it.
We are working on adding individual item updates in the near future.

Searching required data in couchdb

I have documents like,
{_id:1,
name:"john"
}
{_id:2,
name:"john boss"
}
{_id:3,
name:"jim"
}
I have to search the data where ever john is stored in documents. Suppose, if i search "john" the documents should get _id:1 & _id:2 related data. Please guide me to get the result.
I appreciate if any one provide the solutions.
I suggest a CouchDB view to show you all "words" from the "name" field.
function(doc) {
// map function: _design/example/_view/names
if(!doc.name) // Optionally do more testing for doc type, etc. here.
return
// Emit one row per word in the name field (first name, last name, etc.).
var words = doc.name.split(/\s+/)
for(var i = 0; i < words.length; i++)
emit(words[i].toLowerCase(), doc._id)
}
Now if you query /db/_design/example/_view/names?key="john", you will get two rows: one for doc id 1, and another for id 2. I also added a conversion to lower case, so searching for "john" will match people named "John".
Duplicates are possible: the same doc ID listed multiple times, e.g. for {"name":"John John"}; however you are guaranteed that all duplicate rows will be adjacent.
You can also add ?include_docs=true to your request to get the full document for each row.

Resources