Titan+Cassandra and String Vertex Ids - cassandra

It looks like we can get Titan 1.0 use custom long ids by setting "graph.set-vertex-id" to true. Is there some way to use non-long (i.e. String) ids as Vertex Ids? Seeing that the Tinkerpop api supports Strings, and there's a feature called "StringIds", is there some way of enabling that feature? I'm using Titan with Cassandra.

I think this goes against Titan's internal structure. One of the Titan devs recommends here to just use your own indexed property. This is reiterated here and here stating that unique indexed properties should be used.
I think the reason for this is that the internal ids actually refer to locations on the system. As stated here:
The (64 bit) vertex id (which Titan uniquely assigns to every vertex) is the key which points to the row containing the vertex’s adjacency list.

No, String identifiers are not supported in the StandardTitanGraph.features(). You could consider using an indexed String property as an alternative.

Related

Define Graph Schema in AWS Neptune to prevent data duplication

When using TinkerPop/JanusGraph I am able to define, VertexLabels and Property Keys which I can than use to create composite indexes. I read somewhere on the Neptune documentation that indexes are not necessary (or supported).
My question is then how do I prevent duplication when loading data into the database? The only examples I found on the AWS documentation involves loading data where an Unique ID is already provided for each record, which for me seems like I would need to first extract data from a RDBMS in order to have all the IDs and their relationships before I can load it.
Am I understanding this correctly, if not how could I solve this?
Yes your understanding is correct. Uniqueness constraint for vertices & edges applies on their ~id property i.e. IDs are unique.
There are two ways to insert data into Neptune. You can either use the loader interface(recommended) or insert via Gremlin.
Case#1: Insert via bulk loader (recommended)
Inserting via loader only supports CSV format for now and as you observed, it does necessarily require user defined IDs for Vertices and Edges.
Case#2: Insert via Gremlin
For insertion via Gremlin providing IDs is optional. If you do not provide an ID, then Neptune will automatically assign a unique ID to the vertex or the edge.
e.g. g.addV() adds a vertex and assigns a unique identifier to it.
Further regarding case#2, you can add the two vertices and the relationship in the same query. This does not require knowledge of the ID auto-assigned to the vertex by the database.
g.addV().as("node1").property("name","Simba").addV().as("node2").property("name","Mufasa").addE("knows").from("node1").to("node2")
Alternatively, use a unique property identifier to query for nodes from the DB:
g.addV().property("name","Simba");
g.addV().property("name","Mufasa");
g.V().has("name","Simba").as("node1").V().has("name","Mufasa").as("node2").addE("knows").from("node1").to("node2");

Cosmos DB with multiple partition keys

We're looking at potentially using a single Cosmos DB collection to hold multiple document types in a multi-tenanted environment using a tenant ID as the partition key. The path to tenant id may change in each document type and I am therefore looking at various was of exposing the partition key to Cosmos DB to enable correct partitioning / querying.
I have noticed that the Paths property of DocumentCollection.PartitionKey is a collection and was therefore wondering whether it is possible to pass multiple paths during the creation of a document collection and what the behaviour of this might be. Ideally, I would like Cosmos to scan each of these paths and use the first value or aggregate of values as the partition key but cannot find any documentation suggesting that this is indeed the behaviour.
The MSDN documentation for this property is pretty useless and none of the associated documentation seems to answer the question. Does anyone know about or previously used multiple partition key paths in a collection?
To be clear, I'm looking for links to additional documentation about and/or direct experience of the Cosmos DB's behaviour when specifying multiple partition keys in the PartitionKey.Paths collection when creating a DocumentCollection.
This question has also been posted in the Azure Community Support forums.
Thanks, Ian
The best way to do this is to assign a generic partition key like “pk”, then assign this value based on each of your object types. You can for example, manage this during serialization by having different properties for each class to be serialized to “pk”.
The reason partition key is an array in DocumentCollection.PartitionKey is to allow us to introduce compound partition keys, where the combination of multiple properties like (“firstName”, “lastName”) form the partition key. This is a little different from what you need.
Further to the above, I ended up adding a partition key property to the document container as suggested by Aravind and then used David Fowler's excellent QueryInteceptor nuget package to apply an ExpressionVisitor which translated any equivalence expression relating to the specific document type's tenant id property into a equivalence expression on the partition key property. This ensured that queries would be performed against only the single, correct partition. Furthermore, I was able to use the ExpressionVisitor as a safety feature in that it is able to enforce that all queries provide a filter on tenant id (as, obviously, tenants should never be able to see each others documents) and if none has been specified then no records are returned (an invalid equivalence expression is added to the partition key property).
This has been tested and seems to be working well.

CosmosDB: Enforce unique constraints

Is it possible to enforce unique constraints on Azure CosmosDB's graph model? If I'm registering new users and need to ensure only unique email addresses/usernames/etc. are used, how can this be accomplished there?
Depending on your partitioning strategy you can use these values to enforce uniqueness either by using them in the partition key directly or by using them as an id inside of a known partition for your users.
We recently launched unique key support for Cosmos DB. This should work seamlessly for graph collections as well.
https://learn.microsoft.com/en-us/azure/cosmos-db/unique-keys
Just, create a graph with desired unique key path. After that adding a vertex with the same unique key compared to an existing vertex, should fail.

Blazegraph Tinkerpop 3 Indexing

I am trying to learn about Blazegraph. At the moment I am puzzled how I can optimise simple lookups.
Suppose all my vertices have a property id, which is unique. This property is set by the user. Is there any way to speed up finding a vertex of a particular id while still sticking to the Tinkerpop APIs?
Is the search API defined here the only way?
My previous experience is in TitanDB and in Titan's case it's possible to define an index which the Tinkerpop APIs integrate with flawlessly. Is there any way to achieve the same results in Blazegraph without using the Search API?
Whether a mid-traversal V() uses an index or not, depends on a)
whether suitable index exists and b) if the particular graph system
provider implemented this functionality.
Gremlin (Tinkerpop) does not specify how to set indexes although the documentation presents things like the following
graph.createIndex("username",Vertex.class)
But may be reserved for the ThinkerGraph implementation, as a matter of fact it says
Each graph system will have different mechanism by which indices and
schemas are defined. TinkerPop3 does not require any conformance in
this area. In TinkerGraph, the only definitions are around indices.
With other graph systems, property value types, indices, edge labels,
etc. may be required to be defined a priori to adding data to the
graph.
There is an example for Neo4J
TinkerPop3 does not provide method interfaces for defining
schemas/indices for the underlying graph system. Thus, in order to
create indices, it is important to call the Neo4j API directly.
But the code is very specific for that plugin
graph.cypher("CREATE INDEX ON :person(name)")
Note that for BlazeGraph the search uses a built in full-text index

DocumentDB: get all documents of same entity type

I'm storing documents of several different types (entity types?) in a single collection. What would be the best way get all documents of a certain type (like you would do with select * from a table).
Options I see so far:
Include the type as a property. But that would mean a looking into every document when getting the documents, right?
Prepend the type name to the document id and try searching by id with typename*.
Is there a better way to do this?
There's no built-in entity-type property, but you can certainly create your own, and ensure that it's indexed. At this point, it's as straightforward as adding a WHERE clause:
WHERE docs.docType = "SomeType"
Assuming it's a hash-based index, this should provide efficient lookups and filter out unwanted document types.
While you can embed the type into a property (such as document id), you'd then have to do partial string matches, which won't be as efficient as an indexed-property comparison.
If you're curious to know what this query is costing you, the RU value is displayed both in the portal and via x-ms-request-charge return header.
I agree with David's answer and using a single docType field is what I did when I first started using DocumentDB. However, there is another option that I started using after doing some experiments. That is to create an is<Type> field and setting its value to true. This is slightly more efficient for queries than using a single string field, because the indexes themselves are smaller partial indexes, but could potentially take up slightly more storage space.
The other advantage to this approach is that it provides advantages for inheritance and mixins. For example, I have both isLookup=true and isState=true on certain entities. I also have other lookup types. Then in my application code, some behaviors are common for all lookup fields and other behaviors are only applicable to the State type.
If you index the type property on the collection, it will not be a complete scan.

Resources