Bulk import of graph data with ArangoDB java driver - arangodb

I have a question regarding bulk import when working with a graph layer of ArangoDB and its java driver. I'm using Arango 3.4.5 with java driver 5.0.0.
In a document layer, it's possible to use ArangoCollection.importDocuments to insert several documents at once. However, for the collections of the graph layer, the ArangoEdgeCollection and the ArangoVertexCollection, the importDocuments function (or a corresponding importVertices/importEdges function) does not exist. So, if I want to pursue a bulk import of my graph data, I have to ignore the graph layer and use the importDocuments function on vertex collections, *_ELEMENT-PROPERTIES, *_ELEMENT-HAS-PROPERTIES, and edge collections separately by myself.
Furthermore, when the edge collections already exist in the database, it's even not possible to perform a bulk import, because the existing collection is already defined as an edge collection.
Or maybe it's not true what I'm writing and I overlooked something essential?
If not, is there a reason why the bulk import is not implemented for the graph layer? Or is a graph bulk import just among items of a nice-to-have list which hasn't been implemented yet?
Based on my findings described above, the bulk import of graph data with java driver is IMO not possible if the graph collections already exist (because of the edge collections) (?). It would be possible to carry out the bulk import only if we created edge collections from scratch as ordinary collections, which, however, already smells of necessity to sequentially write my own basic graph layer (which I don't want to do, of course).
I guess another way is then the import of JSON data which I haven't analyzed much so far because it seems to me inconvenient when I need to manipulate (or create) the data with java before storing them. Therefore, I would really like to work with the java driver.
Thank you very much for any reply, opinion or corrections.

Related

With ArangoDB what is the practical difference between 1 named graph with x edge definitions vs x named graphs with 1 edge definition?

Is the difference only logical / housekeeping related?
I did read this question but the answer there only deals with 1 edge definition vs multiple edge definitions within a graph which is now already covered in the documentation. So I'm curious.
I have used Arango for 6 years and don't use Graph objects, all my queries are just AQL queries, which means you don't need to use a Graph to use the benefits of graph databases and to perform traversals.
The way I think of a 'Graph' in Arango is that it's a limited / curated view of your collections that is query-able, but also is helpful if you want it to manage some level of integrity on deletes.
Overall, it slows down a Traversal, so I find it better to avoid them. A key driver for my decision is that I don't need views, and I don't need the system to handle the deletion of edges if I delete a connected vertex, but that's just my use case.

Are the new fields on Processes and Tasks in Viewflow 1.6.0 for library users, or for internal use only

Viewflow 1.6.0 introduces new fields ("data" is a JSON field, and "artifact" support for a generic foreign key). They are present on both Processes and Tasks.
Are these intended to be available to library users, or are they Viewflow internal-use-only? I did not see anything in the docs or the github issues list to clarify the matter, so a pointer would be appreciated if I missed it.
Yep, it's for library users, that allows using proxy models instead of real tables for keeping process-only data
Data field is the JSON. So it could be used with jsonstore field - https://github.com/viewflow/jsonstore that makes JSON data exposed as a real Django field. So it could be used with ModelForms as usual
Ex: https://github.com/viewflow/viewflow/blob/master/demo/helloworld/models.py#L6
Articact allow to link process and your data models, without creating a separate table for that.
All of those allow avoiding joins to build all tasks from different flows for a user.

Python 3 : Storing Data without loading it into memory

I am currently building a flask app that will have lots of data that I think I cannot load into memory. I have searched many places, and have found the SQL seems to be a good solution. Sadly I cannot use SQL for this project due to some limitations of SQL.
My project consists many entries of
database[lookupkey1][lookupkey2]...and more lookup keys
My current plan is to override __getitem__ and __setitem__ and __delitem__ and replace them with calls to the database. Is their any kind of database that can store large amounts of maps/dictionaries like
{"spam":{"bar":["foo","test"],"foo":["bar"]}}
I am also currently using JSON to save data, so it would be appreciated if the database had a easy way to migrate my current database.
Sorry that I'm not very good at writing stack overflow questions.
Most document-oriented DBs like MongoDB would allow you to save data as nested dict-list-like objects and query them using their keys and indexes.
P.S. Accessing such a DB through a Python's dict accessors is a bad idea as it would produce a redundant DB query for each step which is highly ineffective and may lead to performance problems. Try looking at ORM for a DB you choose as most ORMs would allow you to access document-oriented DB's data in a way similar to accessing dicts and lists.

Blazegraph Tinkerpop 3 Indexing

I am trying to learn about Blazegraph. At the moment I am puzzled how I can optimise simple lookups.
Suppose all my vertices have a property id, which is unique. This property is set by the user. Is there any way to speed up finding a vertex of a particular id while still sticking to the Tinkerpop APIs?
Is the search API defined here the only way?
My previous experience is in TitanDB and in Titan's case it's possible to define an index which the Tinkerpop APIs integrate with flawlessly. Is there any way to achieve the same results in Blazegraph without using the Search API?
Whether a mid-traversal V() uses an index or not, depends on a)
whether suitable index exists and b) if the particular graph system
provider implemented this functionality.
Gremlin (Tinkerpop) does not specify how to set indexes although the documentation presents things like the following
graph.createIndex("username",Vertex.class)
But may be reserved for the ThinkerGraph implementation, as a matter of fact it says
Each graph system will have different mechanism by which indices and
schemas are defined. TinkerPop3 does not require any conformance in
this area. In TinkerGraph, the only definitions are around indices.
With other graph systems, property value types, indices, edge labels,
etc. may be required to be defined a priori to adding data to the
graph.
There is an example for Neo4J
TinkerPop3 does not provide method interfaces for defining
schemas/indices for the underlying graph system. Thus, in order to
create indices, it is important to call the Neo4j API directly.
But the code is very specific for that plugin
graph.cypher("CREATE INDEX ON :person(name)")
Note that for BlazeGraph the search uses a built in full-text index

Concerns about Core Data

I'm getting ready to dive into my first Core Data adventure. While evaluating the framework two questions came up that really got me thinking about using Core Data at all for this project or to stick with SQLite.
My app will heavily rely upon importing data from an external source. I'm aware that one can import into Core Data but handling complex relationships seems complicated and tedious. Is there an easy way to accomplish complex imports?
The app has to be able to execute complex queries spanning multiple tables or having multiple conditions. Building these predicates and expressions simply scares me...
Is it worth to take the plunge and use Core Data or should I stick with SQLite?
As I and others have said before, Core Data is really an object-graph management framework. It manages the relationships between model objects, including constraints on their cardinality, and manages cascading deletes etc. It also manages constraints on individual attributes. Core Data just happens to also be able to persist that object graph to disk. It can do this in a number of formats, including XML, binary, and via SQLite. Thus, Core Data is really orthogonal to SQLite. If your task is dealing with an embedded SQL-compatible database, go with SQLite. If your task is managing the model layer of an MVC app, go with Core Data. In specific answers to your questions:
There is no magic that can automatically import complex data into any model. That said, it is relatively easy in Core Data. Taking a multi-pass approach and using the SQLite backend can help with memory consumption by allowing you to keep only a subset of the data in memory at a time. If the data sets can be kept in memory, you can write a custom persistent store format that reads/writes directly to your legacy data format from within Core Data (see the Atomic Store Programming Guide).
Building a complex NSPredicate declaratively is somewhat verbose but shouldn't scare you. The Predicate Programming Guide is a good place to start. You can, of course, also write predicates using a string format, much like a string-formatted SQL statement. It's worth noting that, as described above, the predicates in Core Data are on the objects and object graph, not on the SQL tables. If you really want to think at the level of tables, stick with SQLite and write your own wrapper.
I can't really speak to your first point.
However, regarding your second point, using Core Data means you don't have to really worry about complex queries since you can just pretend that all the relationships are properly established in memory already (Apple's implementation details aside). It doesn't matter how complex a join it might be in a database environment because you really aren't in a database environment. If you need to get the fourth child of the grandparent of your current object and then find that child's pet's name and breed, all you do is traverse up the object tree in code using a series of messages or properties. No worries about joins or anything. The only problem is it might be really slow depending on your objects' relationships, but I can't really speak accurately to that since I haven't actually implemented anything using Core Data (I've just read about it extensively on Apple's and others' websites).
If the data importer from an external source is written based on the same core data model (for the targeted/destination side of the import) - nothing will be conceptually different as compare to using/updating the same data (through the core data stack from your actual application).
If you create the data importer without using the core data stack, make sure you learn well the db schema that would be generated/expected by the core data based model. There is nothing magic there - just make sure you follow how the cross entity relationships are implemented and how entity hierarchies are stored.
I had to create recently a data importer from Access database into the core data based Sqlite store as a .NET app. Once my destination core data model was define, I created a small app that populated the Sqlite store with randomly generated entities (including all the expected relationships). Then, I reverse engineered how the core data actually created the Sqlite store for the model and how it handles the relationships by learning from the generated and persisted data. Then, I implemented the .NET based importer/data-transformer according to my observations. At the end, I got perfect core data friendly data store that could be open an modified from the application that was using the core data stack on Mac OSX.

Resources