Titan 1.0.0 to Datastax Enterprise Migration - groovy

I have some existing code that I have written in Groovy for data ingestion into Titan w/ Cassandra + Elasticsearch backend. With the release of Datastax Enterprise 5.0, I was looking to see if the existing code for Titan could be migrated over.
The primary use of the code was to parse out some fields,transform some of the values (ex: datetimestamp -> epoch), and checking for edge uniqueness when adding new edges (ex: 'A likes Apples' relation should only appear once in the graph even though multiple 'A likes Apples' relations may appear in the raw file).
What I have tried so far:
Using the DSE Graph Loader with edge label multiplicity as single (no properties) and vertices multiplicity as single:
data = File.text(filepath).delimiter(',').header('a', 'b', 'c')
load(data).asVertices { }
load(data).asEdges { }
Using this template, vertices are unique (one vertex per vertex label). However, edge labels defined in the schema as single will throw an exception every time the "same" edge is attempted to be added. Is it possible to add checks within the loading script for uniqueness?
Loading data through the gremlin console
:load filepath
I'm finding that my pre-existing code throws quite a few exceptions upon executing the load command. After getting rid of a few Java/Titan classes that weren't importing (TitanManagement, SimpleDateFormat could not be imported), I am getting a
org.apache.tinkerpop.gremlin.groovy.plugin.RemoteException
Any tips on getting gremlin-console integration working?
One last question: Are there any functions that have been removed with the Datastax acquisition of Titan?
Thanks in advance!

We are looking at a feature enhancement to the Graph Loader to support the duplicate edge check. If your edges are only single cardinality, you can enforce that using cardinality property of an edge .single()
For the second item, are you using the DSE supplied Gremlin Console? Is your console local and your cluster located on another machine? What was the setup of your Titan environment?
For context, DataStax did not purchase Titan. Titan is an open source Graph Database and remains an open source Graph Database. DataStax acquired the Aurelius team, the creators of Titan. The Aurelius team built a new Graph Database that was inspired by Titan and is compliant with TinkerPop. There are feature and implementation detail differences between DSE Graph and Titan which can be found here - http://docs.datastax.com/en/latest-dse/datastax_enterprise/graph/graphTOC.html
One that may interest you is the integration of DSE Search and DSE Graph.

Related

How to export large Neo4j datasets for analysis in an automated fashion

I've run into a technical challenge around Neo4j usage that has had me stumped for a while. My organization uses Neo4j to model customer interaction patterns. The graph has grown to a size of around 2 million nodes and 7 million edges. All nodes and edges have between 5 and 10 metadata properties. Every day, we export data on all of our customers from Neo4j to a series of python processes that perform business logic.
Our original method of data export was to use paginated cypher queries to pull the data we needed. For each customer node, the cypher queries had to collect many types of surrounding nodes and edges so that the business logic could be performed with the necessary context. Unfortunately, as the size and density of the data grew, these paginated queries began to take too long to be practical.
Our current approach uses a custom Neo4j procedure to iterate over nodes, collect the necessary surrounding nodes and edges, serialize the data, and place it on a Kafka queue for downstream consumption. This method worked for some time but is now taking long enough so that it is also becoming impractical, especially considering that we expect the graph to grow an order of magnitude in size.
I have tried the cypher-for-apache-spark and neo4j-spark-connector projects, neither of which have been able to provide the query and data transfer speeds that we need.
We currently run on a single Neo4j instance with 32GB memory and 8 cores. Would a cluster help mitigate this issue?
Does anyone have any ideas or tips for how to perform this kind of data export? Any insight into the problem would be greatly appreciated!
As far as I remember Neo4j doesn't support horizontal scaling and all data is stored in a single node. To use Spark you could try to store your graph in 2+ nodes and load the parts of the dataset from these separate nodes to "simulate" the parallelization. I don't know if it's supported in both of connectors you quote.
But as told in the comments of your question, maybe you could try an alternative approach. An idea:
Find a data structure representing everything you need to train your model.
Store such "flattened" graph in some key-value store (Redis, Cassandra, DynamoDB...)
Now if something changes in the graph, push the message to your Kafka topic
Add consumers updating the data in the graph and in your key-value store directly after (= make just an update of the graph branch impacted by the change, no need to export the whole graph or change the key-value store at the same moment but it would very probably lead to duplicate the logic)
Make your model querying directly the key-value store.
It depends also on how often your data changes, how deep and breadth is your graph ?
Neo4j Enterprise supports clustering, you could use the Causal Cluster feature and launch as many read replicas as needed, run the queries in parallel on the read replicas, see this link: https://neo4j.com/docs/operations-manual/current/clustering/setup-new-cluster/#causal-clustering-add-read-replica

Where is slow queries data in OpsCenter read from?

Since our former data model is not very correct, the Slow queries panel shows that there are some queries which are performing slowly.
As I am planing to redesign the data model, I want to clear out the old information displayed in this panel, so I can see only information about my new data model. However, I do not know where OpsCenter is reading this data from.
My idea is that if this information is stored in a table or file, I can truncate or delete them. Or am I totally wrong with that assumption and this could be done by a configuration file modification or something similar instead?
OpsCenter Version: 6.0.3
Cassandra Version: 2.1.15.1423
DataStax Enterprise Version: 4.8.10
It follows dse_perf.node_slow_log. Each node will track new events in the log as they occur, and store their top X. When viewing it in UI it gets the top X from each node and merges them. To "reset" you can truncate the log and restart the datastax agents to clear its current top X. There is a feature to reset for you in future but in 6.0.3 its a little difficult.

how to write Spark data frame to Neo4j database

I'd like to build this workflow:
preprocess some data with Spark, ending with a data frame
write such dataframe to Neo4j as a set of nodes
My idea is really basic: write each row in the df as a node, where each column value represents the value of the node's attribute
I have seen many articles, including neo4j-spark-connector and Introducing the Neo4j 3.0 Apache Spark Connector but they all focus on importing into Spark data from a Neo4j db... so far, I wasn't able to find a clear example of writing a Spark data frame to a Neo4j database.
Any pointer to documentation or very basic examples are much appreciated.
Read this issue to answer my question.
Long story short, neo4j-spark-connector can write Spark data to Neo4j db, and yes, there is a lack in the documentation of the new release.
you can write some routine and use an opensource neo4j java driver
https://github.com/neo4j/neo4j-java-driver
for example.
Simple serialise the result of an RDD (using rdd.toJson) and then use the above driver to create your neo4j nodes and push into your neo4j instance.
I know the question is pretty old but I don't think the neo4j-spark-connector can solve your issue. The full story, sample code and the details are available here but to cut the long story short if you look carefully at the Neo4jDataFrame.mergeEdgeList example (which has been suggested), you'll noticed that what it does is to instantiate a driver for each row in the dataframe. That will work in a unit test with 10 rows but you can't expect it to work in a real case scenario with millions or billions of rows. Besides there are other defects explained in the link above where you can find a csv based solution. Hope it helps.

Can GraphX be used to store, process, query and update large distributed graphs?

Can GraphX store, process, query and update large distributed graphs?
Does GraphX support these features or the data must be extracted from a Graph database source that will be subsequently processed by GraphX?
I want to avoid costs related to network communication and data movement.
It can actually be done, albeit with a pretty complicated measures. MLnick from GraphFlow posted on titan mail group here that he managed to use Spark 0.8 on Titan/Cassandra graph using FaunusVertex and TitanCassandraInputFormat, and that there was a problem in groovy 1.8.9 and newer Kryo version.
In his GraphFlow presentation in spark-summit, he seemed to have made the Titan/HBase over Spark 0.7.x works.
Or if you're savvy enough to implement the TitanInputFormat/TitanOutputFormat from Titan 0.5, perhaps you could keep us in the loop. And Titan developers said they do want to support Spark but haven't got the time/resources to do so.
Using Spark on Titan database is pretty much the only option I can think of regarding your question.
Spark doesn't really have support in itself for long term storage yet, other than through HDFS(technically it doesn't need to run on HDFS but it is heavily integrated with it). So you could just store all the edges and vertices in files, but this is clearly not the most efficient way. Another option is to use a graph database like neo4j

Using Pig with Cassandra CQL3

When trying to run PIG against a CQL3 created Cassandra Schema,
-- This script simply gets a row count of the given column family
rows = LOAD 'cassandra://Keyspace1/ColumnFamily/' USING CassandraStorage();
counted = foreach (group rows all) generate COUNT($1);
dump counted;
I get the following Error.
Error: Column family 'ColumnFamily' not found in keyspace 'KeySpace1'
I understand that this is by design, but I have been having trouble finding the correct method to load CQL3 tables into PIG.
Can someone point me in the right direction? Is there a missing bit of documentation?
This is now supported in Cassandra 1.2.8
As you mention this is by design because if thrift was updated to allow for this it would compromise backwards computability. Instead of creating keyspaces and column families using CQL (I'm guessing you used cqlsh) try using the C* CLI.
Take a look at these issues as well:
https://issues.apache.org/jira/browse/CASSANDRA-4924
https://issues.apache.org/jira/browse/CASSANDRA-4377
Per this https://github.com/alexliu68/cassandra/pull/3, it appears that this fix is planned for the 1.2.6 release of Cassandra. It sounds like they're trying to get that out in the reasonably near future, but of course there's no certain ETA.
As e90jimmy said, its supported in Cassandra 1.2.8, but we have a issue when using counter column type. This was fixed by Alex Liu but due to regression problem in 1.2.7 the patch doesn't go ahead:
https://issues.apache.org/jira/browse/CASSANDRA-5234
To correct this, wait until 2.0 become production ready or download the source, apply the patch from the above link by yourself and rebuild the cassandra .jar. Worked for me by now...
The best way to access Cql3 Tables in Pig is by using the CqlStorage Handler
The syntax is similar to what you have a above
row = Load 'cql://Keyspace/ColumnFamily/' Using CqlStorage()
More info In the Dev Blog Post

Resources