Best way to calculate nth degree connections on a graph - apache-spark

I have a graph database on ArangoDB which have a node depth of around 100 levels and around 205k nodes for some users. For normal uses the AQL used to traverse the graph works well. But there are some scenarios on which the graph traversal is taking too much time, which are:
Calculating the maximum depth for some user
Calculating weight for a root node which have a node depth around 50+ and child node 200k+. Here weight is calculated by summing the weight of each child node.
Fetching all the nodes on a specific level for some node.
The solutions tried by me is as following:
For #1 and #2, I am calculating the maximum depth and weight in the background and keeping them to a cache to avoid real time processing. Also these are recalculated if there any changes in the graph.
For #3, I have tried placing the graph on shards which actually worsen the performance ( and it was expected because graph can not utilize sharding benefits)
I need suggestions for the following:
Is it a good idea to pre-calculate the user ids on each level and placing it to cache for each user?
Is there a way to split the graph on different shards ( some better way) so that my query can run in parallel to finish the task of fetching the node data early?
Can the tools like Elastic search or Spark, be helpful to improve the performance of a Graph query?
Thanks

Related

How does Apache Cassandra perform on a single read of millions of records?

Much has been written about how Cassandra's redundancy provides good performance for thousands of incoming requests from different locations, but I haven't found anything on the throughput of a single big request. That's what this question is about.
I am assessing Apache Cassandra's potential as a database solution to the following problem:
The client would be a single-server application with exclusive access to the Cassandra database, co-located in the same datacentre. The Cassandra instance might be a few nodes, but likely not more than 5.
When a certain feature runs on the application (triggered occasionally by a human) it will populate Cassandra with up to 5M records representing short arrays of float data, as well as delete such records. The records will not be updated and we never need to access individual elements of an array. The arrays can be of different lengths, but will typically have around 100 elements, and each row might represent 0-20 arrays.
For example:
id array1 array2
123 [1.0, 2.5, ..., 10.8] [0.0, 0.5, ..., 1.0]
Bonus question: Should I use a list of doubles to represent this, or should I serialize the arrays to Json?
At some point the user requests a report and the server should read all 5M records, interpret the arrays, do some aggregation, and plot some data on the screen. Might the read operation take <1s, <10s, <100s? How can I estimate the throughput in this case, assuming it is the bottleneck?
Let me start with your second use case, As your data is distributed across the nodes if you have a broad range query without having a narrowed down partition, Cassandra is going to perform slow.
Cassandra is well suitable for Querying and suitable for searching if
you know the partition key.
Even you are having a 5M records, Assuming this gets scattered around
5 different nodes, For your reporting use case Cassandra has to go
through all the nodes and aggregate it. Eventually it gets timed out.
This specific use case is not viable in Cassandra but if you can
aggregate in your service and make multiple calls to partition and
bucket. it is going to perform super fast.
Generally, the accessing pattern matters, Read wins. The data can be
formatted in any form but reading it wisely is matters to Cassandra.
So answered your second part. Thank you.

Cassandra performance on querying fewer or more nodes

Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.

How to export large Neo4j datasets for analysis in an automated fashion

I've run into a technical challenge around Neo4j usage that has had me stumped for a while. My organization uses Neo4j to model customer interaction patterns. The graph has grown to a size of around 2 million nodes and 7 million edges. All nodes and edges have between 5 and 10 metadata properties. Every day, we export data on all of our customers from Neo4j to a series of python processes that perform business logic.
Our original method of data export was to use paginated cypher queries to pull the data we needed. For each customer node, the cypher queries had to collect many types of surrounding nodes and edges so that the business logic could be performed with the necessary context. Unfortunately, as the size and density of the data grew, these paginated queries began to take too long to be practical.
Our current approach uses a custom Neo4j procedure to iterate over nodes, collect the necessary surrounding nodes and edges, serialize the data, and place it on a Kafka queue for downstream consumption. This method worked for some time but is now taking long enough so that it is also becoming impractical, especially considering that we expect the graph to grow an order of magnitude in size.
I have tried the cypher-for-apache-spark and neo4j-spark-connector projects, neither of which have been able to provide the query and data transfer speeds that we need.
We currently run on a single Neo4j instance with 32GB memory and 8 cores. Would a cluster help mitigate this issue?
Does anyone have any ideas or tips for how to perform this kind of data export? Any insight into the problem would be greatly appreciated!
As far as I remember Neo4j doesn't support horizontal scaling and all data is stored in a single node. To use Spark you could try to store your graph in 2+ nodes and load the parts of the dataset from these separate nodes to "simulate" the parallelization. I don't know if it's supported in both of connectors you quote.
But as told in the comments of your question, maybe you could try an alternative approach. An idea:
Find a data structure representing everything you need to train your model.
Store such "flattened" graph in some key-value store (Redis, Cassandra, DynamoDB...)
Now if something changes in the graph, push the message to your Kafka topic
Add consumers updating the data in the graph and in your key-value store directly after (= make just an update of the graph branch impacted by the change, no need to export the whole graph or change the key-value store at the same moment but it would very probably lead to duplicate the logic)
Make your model querying directly the key-value store.
It depends also on how often your data changes, how deep and breadth is your graph ?
Neo4j Enterprise supports clustering, you could use the Causal Cluster feature and launch as many read replicas as needed, run the queries in parallel on the read replicas, see this link: https://neo4j.com/docs/operations-manual/current/clustering/setup-new-cluster/#causal-clustering-add-read-replica

Cosmos DB Graph Edge partitioning

Cosmos DB has pre-announced general availability of Gremlin (Graph API). Probably by the end of 2017 it will get out of preview, so we might consider it stable enough for production. That brings me to the following:
We are designing a system with an estimated user-base up to 100 million users. Each user will have some documents in Cosmos to store user-related data, those documents are partitioned on the id of the user (a Guid). So when estimations come true we will end up with at least 100 million partitions, each containing a bunch of documents.
Not only will we store user-related data but also interrelated data (relationships) between users. On paper Cosmos should be very well suited for these kinds of scenarios, utilizing it cross-api with Document API for normal data and Graph API purely for the relationships.
An example of one of these relationships is a Follow. For instance UserX can Follow UserY. To realize this relationship, we created a Gremlin query that creates an Edge:
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}')
.addE('follow').to(g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}'))
The resulting Edge automatically gets assigned to the partition of UserX, because UserX is the out-vertex.
When querying on outgoing edges (all the users that UserX is following), all is fine and well because the query is limited to the partition for UserX.
g.V().hasId('{userX.Id}').has('pkey','{userX.Partition}').outE('follow').inV()
However when inverting the query (find all followers of UserY), looking for incoming edges, the situation changes - to my knowledge this will result in a full cross-partition query:
g.V().hasId('{userY.Id}').has('pkey','{userY.Partition}').inE('follow').outV()
In my opinion a full cross-partition query with 100 million partitions is unacceptable.
I have tried putting the Edge between UserX and UserY inside its own partition, but the Graph API does not let me do this. (Edit: Changed Cosmos to Graph API)
Now I have come to the point of implementing a pair of edges between UserX and UserY, one outgoing Edge for UserX and one outgoing Edge for UserY, trying to keep them in-sync. All this in order to optimize the speed of my queries, but also introducing more work to achieve eventual consistency.
Then again I am wondering if the Graph API is really up to these kinds of scenario's - or I am really missing on something here?
I will start by clearing a slight misconception you have regarding CosmosDB partitioning. 100 Million users doesn’t mean 100 million partitions. They simply mean 100 million partition keys. When you create a cosmos dB graph it starts with 10 physical partitions ( this is starting default which can be changed upon request), and then scales automatically as data grows.
In this case 100 million users will be distributed among 10 physical partitions. Hence the full cross partition query will hit on 10 physical partition. Also note that these partitions will be hit in parallel, so the expected latency would be similar to hitting one partition, unless operation is similar to aggregates in nature.
This is a classic partitioning dilemma, not unique to Cosmos/Graph.
If your usage pattern is lots of queries with small scope then cross-partition is bad. If it is returning large data sets then cross-partition overhead is probably insignificant against the benefits of parallelism. Unless you have a constant high volume of queries then I think the cross-partition overhead is overstated (MS seem to think everyone is building the next Facebook on Cosmos).
In the OP case you can optimise for x follows y, or x is followed by y, or both by having an edge each way. Note that RUs are reserved on a per partition basis (i.e. total RU / number of partitions) so to use them efficiently you need either high volume, evenly distributed, single partition queries or queries that span multiple partitions.

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Resources