How to find the indirect nodes connected to a particular node in Spark Graphx - apache-spark

I want to find the indirect nodes that are connected to a particular node.
I tried using the connected components class of Graph like below...
graph.connectedComponents
However, it is giving for all the graph..but i want for a particular node.
I have tried doing like below also.
graph.edges.filter(_.srcId == x).map(_.dstId)
This gives the direct nodes of a particular node and i have to recursive this by using RDD operations only.
Could any one please help on this ?

Try something like this:
graph.edges.filter(_.srcId == x).map(e => (e.dstId, null)).join(
graph.collectNeighborIds(EdgeDirection.Either)
).flatMap{t => t._2._2}.collect.toSet
If you want to go deeper than this, I would use something like the Pregel API. Essentially, it lets you repeatedly send messages from node to node and aggregate the results.
Edit: Pregel Solution
I finally got the the iterations to stop on their own. Edits below. Given this graph:
graph.vertices.collect
res46: Array[(org.apache.spark.graphx.VertexId, Array[Long])] = Array((4,Array()), (8,Array()), (1,Array()), (9,Array()), (5,Array()), (6,Array()), (2,Array()), (3,Array()), (7,Array()))
graph.edges.collect
res47: Array[org.apache.spark.graphx.Edge[Double]] = Array(Edge(1,2,0.0), Edge(2,3,0.0), Edge(3,4,0.0), Edge(5,6,0.0), Edge(6,7,0.0), Edge(7,8,0.0), Edge(8,9,0.0), Edge(4,2,0.0), Edge(6,9,0.0), Edge(7,9,0.0))
We are going to send messages of the type Array[Long] -- an array of all the VertexIds of connected nodes. Messages are going to go upstream -- the dst will send the src its VertexId along with all of the other downstream VertexIds. If the upstream node already knows about the connection, no message will be sent. Eventually, every node knows about every connected node and no more messages will be sent.
First we define our vprog. According to the docs:
the user-defined vertex program which runs on each vertex and receives
the inbound message and computes a new vertex value. On the first
iteration the vertex program is invoked on all vertices and is passed
the default message. On subsequent iterations the vertex program is
only invoked on those vertices that receive messages.
def vprog(id: VertexId, orig: Array[Long], newly: Array[Long]) : Array[Long] = {
(orig ++ newly).toSet.toArray
}
Then we define our sendMsg -- edited: swapped src & dst
a user supplied function that is applied to out edges of vertices that
received messages in the current iteration
def sendMsg(trip: EdgeTriplet[Array[Long],Double]) : Iterator[(VertexId, Array[Long])] = {
if (trip.srcAttr.intersect(trip.dstAttr ++ Array(trip.dstId)).length != (trip.dstAttr ++ Array(trip.dstId)).toSet.size) {
Iterator((trip.srcId, (Array(trip.dstId) ++ trip.dstAttr).toSet.toArray ))
} else Iterator.empty }
Next our mergeMsg:
a user supplied function that takes two incoming messages of type A
and merges them into a single message of type A. This function must be
commutative and associative and ideally the size of A should not
increase.
Unfortunately, we're going to break the rule in the last sentence above:
def mergeMsg(a: Array[Long], b: Array[Long]) : Array[Long] = {
(a ++ b).toSet.toArray
}
Then we run pregel -- edited: removed maxIterations, defaults to Int.MaxValue
val result = graph.pregel(Array[Long]())(vprog, sendMsg, mergeMsg)
And you can look at the results:
result.vertices.collect
res48: Array[(org.apache.spark.graphx.VertexId, Array[Long])] = Array((4,Array(4, 2, 3)), (8,Array(8, 9)), (1,Array(1, 2, 3, 4)), (9,Array(9)), (5,Array(5, 6, 9, 7, 8)), (6,Array(6, 7, 9, 8)), (2,Array(2, 3, 4)), (3,Array(3, 4, 2)), (7,Array(7, 8, 9)))

Related

Max Aggregation with Hazelcast-jet

I want to do a simple max across an entire dataset. I started with the Kafka example at: https://github.com/hazelcast/hazelcast-jet-code-samples/blob/0.7-maintenance/kafka/src/main/java/avro/KafkaAvroSource.java
I just changed the pipeline to:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> (Integer) user.get(2))
.drainTo(Sinks.list("result"));
and the go to:
IListJet<Integer> res = jet.getList("result");
SECONDS.sleep(10);
System.out.println(res.get(0));
SECONDS.sleep(15);
System.out.println(res.get(0));
cancel(job);
to get the largest age of people in the topic. It however doesn't return 20 and seems to return different values on different runs. Any idea why?
You seem to be using rollingAggregate, which produces a new output item every time it receives some input, but all you check is the first item it emitted. You must instead find the latest item it emitted. One way to achieve it is by pushing the result into an IMap sink, using the same key every time:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.withoutTimestamps()
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> entry("user", (Integer) user.get(2)))
.drainTo(Sinks.map("result"));
You can fetch the latest result with
IMap<String, Integer> result = jet.getMap("result");
System.out.println(result.get("user");

GML room_goto() Error, Expecting Number

I'm trying to make a game that chooses a room from a pool of rooms using GML, but I get the following error:
FATAL ERROR in action number 3 of Create Event for object obj_control:
room_goto argument 1 incorrect type (5) expecting a Number (YYGI32)
at gml_Object_obj_control_CreateEvent_3 (line 20) - room_goto(returnRoom)
pool = ds_list_create()
ds_list_insert(pool, 0, rm_roomOne)
ds_list_insert(pool, 1, rm_roomTwo)
ds_list_insert(pool, 2, rm_roomThree)
ds_list_insert(pool, 3, rm_roomFour)
var returnIndex;
var returnRoom;
returnIndex = irandom(ds_list_size(pool))
returnRoom = ds_list_find_value(pool, returnIndex)
if (ds_list_size(pool) == 0){
room_goto(rm_menu_screen)
}else{
room_goto(returnRoom)
}
I don't get the error message saying it's expecting a number.
This is weird indeed... I think this should actually work.. But I have no GM around to test :(
For now you can also solve this using "choose". This saves a list (and saves memory, because you're not cleaning up the list by deleting it - thus it resides in memory)
room_goto(choose(rm_roomOne, rm_roomTwo, rm_roomThree, rm_roomFour));
choose basically does exactly what you're looking for. Might not be the best way to go if you're re-using the group of items though.

Torch - Multithreading to load tensors into a queue for training purposes

I would like to use the library threads (or perhaps parallel) for loading/preprocessing data into a queue but I am not entirely sure how it works. In summary;
Load data (tensors), pre-process tensors (this takes time, hence why I am here) and put them in a queue. I would like to have as many threads as possible doing this so that the model is not waiting or not waiting for long.
For the tensor at the top of the queue, extract it and forward it through the model and remove it from the queue.
I don't really understand the example in https://github.com/torch/threads enough. A hint or example as to where I would load data into the queue and train would be great.
EDIT 14/03/2016
In this example "https://github.com/torch/threads/blob/master/test/test-low-level.lua" using a low level thread, does anyone know how I can extract data from these threads into the main thread?
Look at this multi-threaded data provider:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua
It runs this file in the thread:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L18
by calling it here:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L30-L43
And afterwards, if you want to queue a job into the thread, you provide two functions:
https://github.com/soumith/dcgan.torch/blob/master/data/data.lua#L84
The first one runs inside the thread, and the second one runs in the main thread after the first one completes.
Hopefully that makes it a bit more clear.
If Soumith's examples in the previous answer are not very easy to use, I suggest you build your own pipeline from scratch. I provide here an example of two synchronized threads : one for writing data and one for reading data:
local t = require 'threads'
t.Threads.serialization('threads.sharedserialize')
local tds = require 'tds'
local dict = tds.Hash() -- only local variables work here, and only tables or tds.Hash()
dict[1] = torch.zeros(4)
local m1 = t.Mutex()
local m2 = t.Mutex()
local m1id = m1:id()
local m2id = m2:id()
m1:lock()
local pool = t.Threads(
1,
function(threadIdx)
end
)
pool:addjob(
function()
local t = require 'threads'
local m1 = t.Mutex(m1id)
local m2 = t.Mutex(m2id)
while true do
m2:lock()
dict[1] = torch.randn(4)
m1:unlock()
print ('W ===> ')
print(dict[1])
collectgarbage()
collectgarbage()
end
return __threadid
end,
function(id)
end
)
-- Code executing on master:
local a = 1
while true do
m1:lock()
a = dict[1]
m2:unlock()
print('R --> ')
print(a)
end

Akka: How to ensure that message has been received?

I have an actor Dispenser. What it does is it
dispenses some objects by request
listens to arriving new ones
Code follows
class Dispenser extends Actor {
override def receive: Receive = {
case Get =>
context.sender ! getObj()
case x: SomeType =>
addObj(x)
}
}
In real processing it doesn't matter whether 1 ms or even few seconds passed since new object was sent until the dispenser starts to dispense it, so there's no code tracking it.
But now I'm writing test for the dispenser and I want to be sure that firstly it receives new object and only then it receives a Get request.
Here's the test code I came up with:
val dispenser = system.actorOf(Props.create(classOf[Dispenser]))
dispenser ! obj
Thread.sleep(100)
val task = dispenser ? Get()
val result = Await.result(task, timeout)
check(result)
It satisfies one important requirement - it doesn't change original code. But it is
At least 100ms seconds slow even on very high performance boxes
Unstable and fails sometimes because 100 ms or any other constant doesn't provide any guaranties.
And the question is how to make a test that satisfies requirement and doesn't have cons above (neither any other obvious cons)
You can take out the Thread.sleep(..) and your test will be fine. Akka guarantees the ordering you need.
With the code
dispenser ! obj
val task = dispenser ? Get()
dispenser will process obj before Get deterministically because
The same thread puts obj then Get in the actor's mailbox, so they're in the correct order in the actor's mailbox
Actors process messages sequentially and one-at-a-time, so the two messages will be received by the actor and processed in the order they're queued in the mailbox.
(..if there's nothing else going on that's not in your sample code - routers, async processing in getObj or addObj, stashing, ..)
Akka FSM module is really handy for testing underlying state and behavior of the actor and does not require to change its implementation specifically for tests.
By using TestFSMRef one can get actors current state and and data by:
val testActor = TestFSMRef(<actors constructor or Props>)
testActor.stateName shouldBe <state name>
testActor.stateData shouldBe <state data>
http://doc.akka.io/docs/akka/2.4.1/scala/fsm.html

Timeouts on traversing from node with millions of edges

I have a graph that has some nodes with millions of incident edges, using Titan 0.5.2 on top of Cassandra DB. E.g. this reproduces such graph:
mgmt = g.getManagementSystem()
vidp = mgmt.makePropertyKey('vid').dataType(Integer.class).make()
mgmt.buildIndex('by_vid',Vertex.class).addKey(vidp).buildCompositeIndex()
mgmt.commit()
def v0 = g.addVertex([vid: 0, type: 'start'])
def random = new Random()
for(i in 1..10000000) {
def v = g.addVertex([vid: i, type: 'claim'])
v.addEdge('is-a', v0)
def n = random.nextInt(i)
def vr = g.V('vid', n).next()
v.addEdge('test', vr)
if (i%10000 == 0) { g.commit(); }
}
So we have 10M vertices that all link to v0 and with some random links between the vertices. This query: g.V('vid', 0).in('is-a')[0] - works fine, and so is g.V('vid', 0).in('is-a')[100] or g.V('vid', 0).in('is-a')[1000]. However if I try to traverse further - i.e., g.V('vid', 0).in('is-a').out('test')[0] - then the lookup gets stuck and eventually I get read timeout exception from Cassandra:
com.thinkaurelius.titan.core.TitanException: Could not execute operation due to backend exception
Caused by: com.thinkaurelius.titan.diskstorage.TemporaryBackendException: Could not successfully complete backend operation due to repeated temporary exceptions after Duration[4000 ms]
at com.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:86)
at com.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:42)
Caused by: com.netflix.astyanax.connectionpool.exceptions.TimeoutException: TimeoutException: [host=127.0.0.1(127.0.0.1):9160, latency=10000(10001), attempts=1]org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:188)
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:
I also get a high load on Cassandra process and it becomes unresponsive (i.e., trying to connect to it returns timeout). So, my question is why it is impossible to traverse further from this node even though the step that actually has lots of nodes is fine - and how I could make it work?
It appears you have effectively simulated a supernode. When you call the function
g.V('vid', 0).in('is-a')[0]
You are only requesting one object, which is a fast lookup. Likewise:
g.V('vid', 0).in('is-a')[100]
Also only requests one object, which is still fast. When you make the query:
g.V('vid', 0).in('is-a').out('test')[0]
You've just made it request "find me all the vertices connected from the outgoing edge from the million of vertices and return the first one". The first step it will do is traverse all one million of those edges before it can return the "first" vertex you request. Try doing this:
g.V('vid', 0).in('is-a')[0].out('test')[0]
This will not iterate through all one million vertices.

Resources