Define Graph Schema in AWS Neptune to prevent data duplication - tinkerpop3

When using TinkerPop/JanusGraph I am able to define, VertexLabels and Property Keys which I can than use to create composite indexes. I read somewhere on the Neptune documentation that indexes are not necessary (or supported).
My question is then how do I prevent duplication when loading data into the database? The only examples I found on the AWS documentation involves loading data where an Unique ID is already provided for each record, which for me seems like I would need to first extract data from a RDBMS in order to have all the IDs and their relationships before I can load it.
Am I understanding this correctly, if not how could I solve this?

Yes your understanding is correct. Uniqueness constraint for vertices & edges applies on their ~id property i.e. IDs are unique.
There are two ways to insert data into Neptune. You can either use the loader interface(recommended) or insert via Gremlin.
Case#1: Insert via bulk loader (recommended)
Inserting via loader only supports CSV format for now and as you observed, it does necessarily require user defined IDs for Vertices and Edges.
Case#2: Insert via Gremlin
For insertion via Gremlin providing IDs is optional. If you do not provide an ID, then Neptune will automatically assign a unique ID to the vertex or the edge.
e.g. g.addV() adds a vertex and assigns a unique identifier to it.
Further regarding case#2, you can add the two vertices and the relationship in the same query. This does not require knowledge of the ID auto-assigned to the vertex by the database.
g.addV().as("node1").property("name","Simba").addV().as("node2").property("name","Mufasa").addE("knows").from("node1").to("node2")
Alternatively, use a unique property identifier to query for nodes from the DB:
g.addV().property("name","Simba");
g.addV().property("name","Mufasa");
g.V().has("name","Simba").as("node1").V().has("name","Mufasa").as("node2").addE("knows").from("node1").to("node2");

Related

What is the difference between gremlin in neptune to azure selecting a single vertex?

we switched our database from azure to neptune. In azure you could select one vertex and the gremlinquery returned the id, the label and all properties of this vertex. If you do the same on neptune, just the id and the label is returned. How can I get neptune to return the id, the label and all properties of a vertex? Is there a option you can choose in the neptune configuration? If there is no option, which query I have to execute to get the id, the label and all properties of a vertex?
The difference might have to do with a number of things. First thing that comes to mind is that if you went from CosmosDB to Neptune you might be using bytecode based traversals in which case they don't return properties (just references which means id and label as you are seeing). If you didn't switch then it's possible that Neptune may be more aligned with TinkerPop in terms of serialization semantics which calls for references only in newer versions.
Either way, it's considered a best practice to only return the data that you need in the form you want it, rather than a graph element with all properties. The reasoning is similar to why you wouldn't do SELECT * FROM table in SQL - you would specify the column names.

Cosmos DB with multiple partition keys

We're looking at potentially using a single Cosmos DB collection to hold multiple document types in a multi-tenanted environment using a tenant ID as the partition key. The path to tenant id may change in each document type and I am therefore looking at various was of exposing the partition key to Cosmos DB to enable correct partitioning / querying.
I have noticed that the Paths property of DocumentCollection.PartitionKey is a collection and was therefore wondering whether it is possible to pass multiple paths during the creation of a document collection and what the behaviour of this might be. Ideally, I would like Cosmos to scan each of these paths and use the first value or aggregate of values as the partition key but cannot find any documentation suggesting that this is indeed the behaviour.
The MSDN documentation for this property is pretty useless and none of the associated documentation seems to answer the question. Does anyone know about or previously used multiple partition key paths in a collection?
To be clear, I'm looking for links to additional documentation about and/or direct experience of the Cosmos DB's behaviour when specifying multiple partition keys in the PartitionKey.Paths collection when creating a DocumentCollection.
This question has also been posted in the Azure Community Support forums.
Thanks, Ian
The best way to do this is to assign a generic partition key like “pk”, then assign this value based on each of your object types. You can for example, manage this during serialization by having different properties for each class to be serialized to “pk”.
The reason partition key is an array in DocumentCollection.PartitionKey is to allow us to introduce compound partition keys, where the combination of multiple properties like (“firstName”, “lastName”) form the partition key. This is a little different from what you need.
Further to the above, I ended up adding a partition key property to the document container as suggested by Aravind and then used David Fowler's excellent QueryInteceptor nuget package to apply an ExpressionVisitor which translated any equivalence expression relating to the specific document type's tenant id property into a equivalence expression on the partition key property. This ensured that queries would be performed against only the single, correct partition. Furthermore, I was able to use the ExpressionVisitor as a safety feature in that it is able to enforce that all queries provide a filter on tenant id (as, obviously, tenants should never be able to see each others documents) and if none has been specified then no records are returned (an invalid equivalence expression is added to the partition key property).
This has been tested and seems to be working well.

CosmosDB: Enforce unique constraints

Is it possible to enforce unique constraints on Azure CosmosDB's graph model? If I'm registering new users and need to ensure only unique email addresses/usernames/etc. are used, how can this be accomplished there?
Depending on your partitioning strategy you can use these values to enforce uniqueness either by using them in the partition key directly or by using them as an id inside of a known partition for your users.
We recently launched unique key support for Cosmos DB. This should work seamlessly for graph collections as well.
https://learn.microsoft.com/en-us/azure/cosmos-db/unique-keys
Just, create a graph with desired unique key path. After that adding a vertex with the same unique key compared to an existing vertex, should fail.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

Blazegraph Tinkerpop 3 Indexing

I am trying to learn about Blazegraph. At the moment I am puzzled how I can optimise simple lookups.
Suppose all my vertices have a property id, which is unique. This property is set by the user. Is there any way to speed up finding a vertex of a particular id while still sticking to the Tinkerpop APIs?
Is the search API defined here the only way?
My previous experience is in TitanDB and in Titan's case it's possible to define an index which the Tinkerpop APIs integrate with flawlessly. Is there any way to achieve the same results in Blazegraph without using the Search API?
Whether a mid-traversal V() uses an index or not, depends on a)
whether suitable index exists and b) if the particular graph system
provider implemented this functionality.
Gremlin (Tinkerpop) does not specify how to set indexes although the documentation presents things like the following
graph.createIndex("username",Vertex.class)
But may be reserved for the ThinkerGraph implementation, as a matter of fact it says
Each graph system will have different mechanism by which indices and
schemas are defined. TinkerPop3 does not require any conformance in
this area. In TinkerGraph, the only definitions are around indices.
With other graph systems, property value types, indices, edge labels,
etc. may be required to be defined a priori to adding data to the
graph.
There is an example for Neo4J
TinkerPop3 does not provide method interfaces for defining
schemas/indices for the underlying graph system. Thus, in order to
create indices, it is important to call the Neo4j API directly.
But the code is very specific for that plugin
graph.cypher("CREATE INDEX ON :person(name)")
Note that for BlazeGraph the search uses a built in full-text index

Resources