Combining new ArangoSearch views and graph traversals - arangodb

I've read through the ArangoDB 3.4 docs and the ArangoSearch view tutorial, but I'm still unclear on if/how views can be combined with graph traversals. There is an example of a graph/view join in the tutorial; however, what I need to do is to simply filter the candidate pool resulting from a traversal with a view-based text search. For example:
"for i in 2..2 outbound start_doc edges1, inbound edges2 [filter by view] return i"
The initial 2-hop traversal from the "start_doc" vertex will result in a much smaller candidate pool than the entire collection. I want to then perform a text search on this candidate pool using a configured view (probably "text_en" analyzer).
Would i just define the view expression after the traversal? Or would I need to use a "union_distinct" function to combine the traversal and the search results? (This seem like it would be very inefficient given a potentially very large result set from the view.)
Thanks!

This is how I solved a similar problem, perhaps it will work for you too:
for i in 2..2 outbound start_doc edges1, inbound edges2
filter (
for x in view
search i._key == x._key and search_condition
limit 1
return x
) != []
return i

Related

django remove m2m instance when there are no more relations

In case we had the model:
class Publication(models.Model):
title = models.CharField(max_length=30)
class Article(models.Model):
publications = models.ManyToManyField(Publication)
According to: https://docs.djangoproject.com/en/4.0/topics/db/examples/many_to_many/, to create an object we must have both objects saved before we can create the relation:
p1 = Publication(title='The Python Journal')
p1.save()
a1 = Article(headline='Django lets you build web apps easily')
a1.save()
a1.publications.add(p1)
Now, if we called delete in either of those objects the object would be removed from the DB along with the relation between both objects. Up until this point I understand.
But is there any way of doing that, if an Article is removed, then, all the Publications that are not related to any Article will be deleted from the DB too? Or the only way to achieve that is to query first all the Articles and then iterate through them like:
to_delete = []
qset = a1.publications.all()
for publication in qset:
if publication.article_set.count() == 1:
to_delete(publication.id)
a1.delete()
Publications.filter(id__in=to_delete).delete()
But this has lots of problems, specially a concurrency one, since it might be that a publication gets used by another article between the call to .count() and publication.delete().
Is there any way of doing this automatically, like doing a "conditional" on_delete=models.CASCADE when creating the model or something?
Thanks!
I tried with #Ersain answer:
a1.publications.annotate(article_count=Count('article_set')).filter(article_count=1).delete()
Couldn't make it work. First of all, I couldn't find the article_set variable in the relationship.
django.core.exceptions.FieldError: Cannot resolve keyword 'article_set' into field. Choices are: article, id, title
And then, running the count filter on the QuerySet after filtering by article returned ALL the tags from the article, instead of just the ones with article_count=1. So finally this is the code that I managed to make it work with:
Publication.objects.annotate(article_count=Count('article')).filter(article_count=1).filter(article=a1).delete()
Definetly I'm not an expert, not sure if this is the best approach nor if it is really time expensive, so I'm open to suggestions. But as of now it's the only solution I found to perform this operation atomically.
You can remove the related objects using this query:
a1.publications.annotate(article_count=Count('article_set')).filter(article_count=1).delete()
annotate creates a temporary field for the queryset (alias field) which aggregates a number of related Article objects for each instance in the queryset of Publication objects, using Count function. Count is a built-in aggregation function in any SQL, which returns the number of rows from a query (a number of related instances in this case). Then, we filter out those results where article_count equals 1 and remove them.

Indexing arrays in CosmosDB

Why doesn't CosmosDB index arrays by default? The default index path is
"path": "/*"
Doesn't that mean "index everything"? Not "index everything except arrays".
If I add my array field to the index with something like this:
"path": "/tags/[]/?"
It will work and start indexing that particular array field.
But my question is why doesn't "index everything" index everything?
EDIT: Here's a blog post that describes the behavior I'm seeing. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains queries are very slow, clearly not using the index. If you add the field in question to the index explicitly then the queries are fast (clearly they start using the index).
"New" index layout
As stated in Index Types
Azure Cosmos containers support a new index layout that no longer uses
the Hash index kind. If you specify a Hash index kind on the indexing
policy, the CRUD requests on the container will silently ignore the
index kind and the response from the container only contains the Range
index kind. All new Cosmos containers use the new index layout by
default.
The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in 36.55 RUs). However pre-existing collections may still be using the old layout.
"Old" index layout
I was able to reproduce the issue with ARRAY_CONTAINS that you are asking about.
Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)
{
"id": "50614926",
"title": "Indexing arrays in CosmosDB",
/*Other irrelevant properties omitted */
"tags": [
"azure",
"azure-cosmosdb"
]
}
And then performing the following query
SELECT COUNT(1)
FROM t IN c.tags
WHERE t = 'sql-server'
The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)
{
"path": "/tags/[]/?",
"indexes": [
{
"kind": "Hash",
"dataType": "String",
"precision": -1
}
]
}
However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.
The range index uses keys based on partial forward paths. So will contain paths such as the following.
tags/0/azure
tags/0/c#
tags/0/oracle
tags/0/sql-server
tags/1/azure-cosmosdb
tags/1/c#
tags/1/sql-server
With this index structure it starts at tags/0/sql-server and then reads all of the remaining tags/0/ entries and the entirety of the entries for tags/n/ where n is an integer greater than 0. Each distinct document mapping to any of these needs to be retrieved and evaluated.
By contrast the hash index uses reverse paths (more details - PDF)
StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are
sql-server/0/tags
sql-server/1/tags
sql-server/2/tags
sql-server/3/tags
sql-server/4/tags
With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.
In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.
The following query took 97 RUs with default indexing policy in my test collection.
SELECT COUNT(1)
FROM c
WHERE 'sql-server' IN (c.tags[0], c.tags[1], c.tags[2], c.tags[3], c.tags[4])
Cosmos DB does indexes all the element of an Array. By, default, All Azure Cosmos DB data is indexed. Read more here https://learn.microsoft.com/en-us/azure/cosmos-db/indexing-policies

Search in hierarchical tree on neo4j

I have a database organized in several hierarchical trees.
Nodes are organized by number.Nodes that begin with the same number are interconnected by relationships. For example: (5)-[connect]-(50)-[connect]-(507)... etc. I want to search, for example, the node 301 starting from the first parent node: the node 3. How do I do this query in cypher?
If you want to search for a specific node starting from the first parent I would suggest following query:
MATCH (n {number:1})-[:CONNECT*0..]->(n1) return n, n1;
This query searches for the node with property number = 1 and searches for all children which are related through CONNECT relationship. If you want to search for a specific child node you have to change the query this way:
MATCH (n {number:1})-[:CONNECT*0..]->(n1 {number:101}) return n, n1;
In the *0.. part you can define until what depth you want to search, so you can also search for depth=n with *0..n. This documentation is a good place to start with the match/path clause: https://neo4j.com/docs/developer-manual/current/cypher/clauses/match/

Search Documents from two collections in MarkLogic

In Marklogic, I want to search between two collections by joining the id element of doc from collection1 to id element of doc from collection2. When it is matched i need the resulting document from both collections.
I have the below code, but it is very slow. How to use cts:search or search:search to achieve the same
for $i in collection('demographic')/individual,
$j in collection('membership')/membership[enrolleIndividualId/id/text() = $i/individual/id/text()])
return {$i,$j}
Update:
I should note that your sample is not valid XQuery: return element root { $i, $j } would be valid. Also, you should not use the /text() node selector, as it's behavior can be counterintuitive. You can compare elements directly in an XPath predicate ([enrolleIndividualId/id eq $i/individual/id]). Use /fn:string() in place of /text() if you need the contents of an element as a string. I'd also recommend using the atomic equality operator eq in place of the sequence equality operator = when directly comparing individual elements.
Original Answer:
There are several approaches to implementing joins in MarkLogic, but I would first question your data model. From the names of the elements in your sample query, it looks like you are using a relational model (individuals have memberships). MarkLogic is a document database, and it's optimized for denormalized documents. You will be much better served to process your data and generate new individual documents that each contain the relevant membership data.
That being said, here's how you could join your documents:
First, you will need range indices to write performant joins. If the id element from your sample query is not unique to individuals, you will need path range indices on enrolledIndividualId/id and individual/id, otherwise, a simple element range index on id will do.
The most common join pattern in MarkLogic uses a "shotgun-OR" query; first retrieving values from the lexicon backing a range index, and then constructing an or-query from those values to retrieve the relevant documents. This won't work directly in your case, as you want to retrieve both sides of the join. You can either run a search for each pair of documents, or run a single search for one side, and then an additional document read for each document.
pairs:
for $value in cts:values(cts:path-reference("individual/id"))
return
cts:search(/,
cts:or-query((
cts:and-query((
cts:collection-query("demographic"),
cts:path-range-query("individual/id", "=", $value))),
cts:and-query((
cts:collection-query("membership"),
cts:path-range-query("enrolledIndividualId/id", "=", $value))))),
"unfiltered")
shotgun-OR plus iteration:
for $doc in
cts:search(/,
cts:and-query((
cts:collection-query("demographic"),
cts:path-range-query("individual/id", "=",
cts:values(cts:path-reference("individual/id"))))),
"unfiltered")
return
cts:search(/,
cts:and-query((
cts:collection-query("membership"),
cts:path-range-query("enrolledIndividualId/id", "=", $doc/individual/id))),
"unfiltered")
As you can see, each approach requires I/O proportionate to the number of docs/values you want to join. If you only needed the shotgun-OR (ie, a query for documents based on criteria from other documents), you would only need to make two requests, the initial cts:values() call to retrieve values from a lexicon, and the cts:search() call using a query built from those values.
Note: the cts:query objects used in these examples could be used in conjunction with the Search API by means of the search:resolve() function.
Given your apparent data model, you will be much better served by processing your data into individual, de-normalized documents.

Couchdb filter using reduce functions/linked documents

Considering:
doc profile
{
_id:"1",
name:"john",
likes: ["2222","1111"]
}
doc likes
{
_id:"2222",
value:"true"
}
{
_id:"1111",
value:"false"
}
I have a filter on my xamarin app to get the profile, and it works well but I need to include the "children" (linked) docs... I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data, but I can't make the filter use the reduce function.. So, any idea?
the expected result would be:
doc profile
{
_id:"1",
name:"john",
likes: {
{_id:"2222",
value:"true"},
{_id:"1111",
value:"false"]
}
}
Thanks!
I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication
You might already know this but you can use couchdb views as filters.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data
The reduce function is for "reducing" the values that are returned by the map function. The map function returns a key and a value like so:
emit(key,value)
The reduce function only gets the keys and the values that are returned from a map function. For example if you call a view with
?key=abc
and it returns results like
[{
_id:...,
type: abc
},
{
_id:...,
type:abc
}
....
]
You already have all the documents filtered by the key "abc". The reduce function will get as inputs the key, the value and a rereduce parameters. If you use the reduce function as a post map processing step to further filter the results from the view there will be two problems:
There is no way to pass a parameter to a reduce. The keys that you specify will only be used by the map function and then passed as they are to reduce.
It is not a good idea anyway. With reduce you want to return a small value that aggregates the results you get from a view. So taking the above example if you return say an integer as a value from the map function ( in emit(key,value)//suppose that the value is an integer) the reduce function may return a sum or aggregate of those values. But trying to return a modified document is not what reduce function is for. From the docs
"A reduce function must reduce the input values to a smaller output value. If you are building a composite return structure in your reduce, or only transforming the values field, rather than summarizing it, you might be misusing this feature. "
List functions might be more suited to what you are trying to do. If you want to process the results of the view query before returning them they are they way to go.
In list functions you get a set of results returned by the view function. You can even pass additional parameters if you'd like to apply complex filters on them. But you won't be able to use list functions for replication.
Finally replication works on a document level. Documents have _rev fields that is used by the replicator process to check what version the document is in before the replication is performed. So you won't be able to replicate the results returned by a view. Only the documents will be replicated.

Resources