I have bipartite graph of urban traffic and I want to implement simrank for this graph. I have two RDDs each containing nodes and neighbors of a part of graph. To be more specific, one RDD contains cars as nodes and cameras that each car has passed and the other RDD contains cameras as nodes and cars that passed each camera. I am using random walk. So I start with a query (for example id of a camera) and I want to find similar cameras. So I lookup that camera in the RDD containing cameras as nodes and between its neighbors choose one randomly. Then look up randomly selected car id in RDD containing cars as nodes and this process keep going until I run to the same camera a few times so I conclude this is similar to the query. But the problem is that the code is very slow and it takes 2 hours.
Do you have any suggestions?
I have a graph database on ArangoDB which have a node depth of around 100 levels and around 205k nodes for some users. For normal uses the AQL used to traverse the graph works well. But there are some scenarios on which the graph traversal is taking too much time, which are:
Calculating the maximum depth for some user
Calculating weight for a root node which have a node depth around 50+ and child node 200k+. Here weight is calculated by summing the weight of each child node.
Fetching all the nodes on a specific level for some node.
The solutions tried by me is as following:
For #1 and #2, I am calculating the maximum depth and weight in the background and keeping them to a cache to avoid real time processing. Also these are recalculated if there any changes in the graph.
For #3, I have tried placing the graph on shards which actually worsen the performance ( and it was expected because graph can not utilize sharding benefits)
I need suggestions for the following:
Is it a good idea to pre-calculate the user ids on each level and placing it to cache for each user?
Is there a way to split the graph on different shards ( some better way) so that my query can run in parallel to finish the task of fetching the node data early?
Can the tools like Elastic search or Spark, be helpful to improve the performance of a Graph query?
Thanks
I am trying to build a Watson Conversation for an application. I have created a single intent and it has multiple child dialog nodes. I am having two sibling dialog nodes having same child nodes and the hierarchy would be repeated.
So, is there any way to handle this situation? (I mean to reduce duplicate nodes or to reuse the existing nodes.) Because it repeats the nodes multiple times for each sibling dialog nodes.
Below image is self-explanatory.
When you look at the image below, you see there are two dialog nodes are similar for both siblings nodes(#boolean:yes / #boolean:no).
So, Without creating two similar nodes, how can I create a common node which will be used by both siblings?
Any help, please...
To solve your issue you can use a continue from and point it to the input node prior to where you want to continue on with the tree.
I am new to rtree/btree data structures. The creation of the tree is a bottom-to-up process but searching for a node/range search/knn search are all top-to-bottom process. I am using knn search but wanting to do some improvement: my data are a trajectory of points, which are spatially close to each other. In order to search the KNNs for every point on the entire trajectory, I want to search one point first, then for other points, I don't want to start from the root again, instead I want to start from the results of the first point, and go upper to their parents. This will enable me to avoid searching a lot of unnecessary pages. The problem here is how can i go upper from the child to its parent in a rtree/btree structure? Should I change the tree creation process and whenever the split happens, fill the parent[] property of the child? Is there any other simpler ways for this problem?
You could:
Store a pointer to parent node in children node to know how to move upwards in the nodes structure. So between queries store some pointer to the last leaf node and from there using the pointer to parent move upwards, check the parent node, then again move upwards etc. until a node where a different subtree should be picked.
Store only pointers to children nodes in every node. Then between queries save the whole path of nodes used to get from the root to a leaf in the last query. Then having a last path you could go backwards in this collection which would represent going upwards from the leaf used in the last query to a node where you should pick a different subtree.
This question is about the routing table creation at a node in a p2p network based on Pastry.
I'm trying to simulate this scheme of routing table creation in a single JVM. I can't seem to understand how these routing tables are created from the point of joining of the first node.
I have N independent nodes each with a 160 bit nodeId generated as a SHA-1 hash and a function to determine the proximity between these nodes. Lets say the 1st node starts the ring and joins it. The protocol says that this node should have had its routing tables set up at this time. But I do not have any other nodes in the ring at this point, so how does it even begin to create its routing tables?
When the 2nd node wishes to join the ring, it sends a Join message(containing its nodeID) to the 1st node, which it passes around in hops to the closest available neighbor for this 2nd node, already existing in the ring. These hops contribute to the creation of routing table entries for this new 2nd node. Again, in the absence of sufficient number of nodes, how do all these entries get created?
I'm just beginning to take a look at the FreePastry implementation to get these answers, but it doesn't seem very apparent at the moment. If anyone could provide some pointers here, that'd be of great help too.
My understanding of Pastry is not complete, by any stretch of the imagination, but it was enough to build a more-or-less working version of the algorithm. Which is to say, as far as I can tell, my implementation functions properly.
To answer your first question:
The protocol says that this [first] node should have had its routing tables
set up at this time. But I do not have any other nodes in the ring at
this point, so how does it even begin to create its routing tables?
I solved this problem by first creating the Node and its state/routing tables. The routing tables, when you think about it, are just information about the other nodes in the network. Because this is the only node in the network, the routing tables are empty. I assume you have some way of creating empty routing tables?
To answer your second question:
When the 2nd node wishes to join the ring, it sends a Join
message(containing its nodeID) to the 1st node, which it passes around
in hops to the closest available neighbor for this 2nd node, already
existing in the ring. These hops contribute to the creation of routing
table entries for this new 2nd node. Again, in the absence of
sufficient number of nodes, how do all these entries get created?
You should take another look at the paper (PDF warning!) that describes Pastry; it does a rather good job of explain the process for nodes joining and exiting the cluster.
If memory serves, the second node sends a message that not only contains its node ID, but actually uses its node ID as the message's key. The message is routed like any other message in the network, which ensures that it quickly winds up at the node whose ID is closest to the ID of the newly joined node. Every node that the message passes through sends their state tables to the newly joined node, which it uses to populate its state tables. The paper explains some in-depth logic that takes the origin of the information into consideration when using it to populate the state tables in a way that, I believe, is intended to reduce the computational cost, but in my implementation, I ignored that, as it would have been more expensive to implement, not less.
To answer your question specifically, however: the second node will send a Join message to the first node. The first node will send its state tables (empty) to the second node. The second node will add the sender of the state tables (the first node) to its state tables, then add the appropriate nodes in the received state tables to its own state tables (no nodes, in this case). The first node would forward the message on to a node whose ID is closer to that of the second node's, but no such node exists, so the message is considered "delivered", and both nodes are considered to be participating in the network at this time.
Should a third node join and route a Join message to the second node, the second node would send the third node its state tables. Then, assuming the third node's ID is closer to the first node's, the second node would forward the message to the first node, who would send the third node its state tables. The third node would build its state tables out of these received state tables, and at that point it is considered to be participating in the network.
Hope that helps.