Max Aggregation with Hazelcast-jet - hazelcast-jet

I want to do a simple max across an entire dataset. I started with the Kafka example at: https://github.com/hazelcast/hazelcast-jet-code-samples/blob/0.7-maintenance/kafka/src/main/java/avro/KafkaAvroSource.java
I just changed the pipeline to:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> (Integer) user.get(2))
.drainTo(Sinks.list("result"));
and the go to:
IListJet<Integer> res = jet.getList("result");
SECONDS.sleep(10);
System.out.println(res.get(0));
SECONDS.sleep(15);
System.out.println(res.get(0));
cancel(job);
to get the largest age of people in the topic. It however doesn't return 20 and seems to return different values on different runs. Any idea why?

You seem to be using rollingAggregate, which produces a new output item every time it receives some input, but all you check is the first item it emitted. You must instead find the latest item it emitted. One way to achieve it is by pushing the result into an IMap sink, using the same key every time:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.withoutTimestamps()
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> entry("user", (Integer) user.get(2)))
.drainTo(Sinks.map("result"));
You can fetch the latest result with
IMap<String, Integer> result = jet.getMap("result");
System.out.println(result.get("user");

Related

How to correctly update a map field in firestore using python?

I have a map in firestore and I want to regularly update it (not overwrite previous keys). Most of the times it works, however sometimes it fails and it does not throw any exception. The only indication that something went wrong is that the result (https://cloud.google.com/firestore/docs/reference/rest/v1/WriteResult) has an update_time which I can compare to now() and if the difference is too large I know it did not do an update now. The problem is after that the whole map is missing (all previous keys are gone). So not only it failed to add the current keys but somehow it wiped out the whole field.
Below is the full code:
error_keys = []
for key in data.keys():
# continue if set is empty
if not data[key]:
continue
try:
new_keys = {
f'myMap.{k}': v for k, v in data[key].items()}
result = self.db.collection(u'myCollection').document(key).update(
new_keys
)
now = datetime.now(tz=pytz.utc)
dt_string = now.strftime("%d/%m/%Y %H:%M:%S.%fZ")
duration = now - result.update_time #
duration_in_s = duration.total_seconds()
minutes = divmod(duration_in_s, 60)[0]
if minutes > 1.0:
logger.warning("Diff to update time is larger that 1 min")
logger.info(f'Now: {dt_string}')
logger.info(f'Duration in minutes: {minutes}')
logger.info(f'Adding {key} to error_keys')
error_keys.append(key)
logger.info(f'KEY: {key}: {data[key]} Update_time: {result.update_time} Diff_minutes: {minutes}')
except:
logger.warning(
f'Could not key: {key} with data {data[key]} to firebase.')
error_keys.append(key)
logger.exception('Exception writing keys')
logger.info(f'ERROR_KEYS: {error_keys}')
return error_keys
I am using:
google-cloud-firestore 2.1.0
Python 3.7.3

Dump series back into InfluxDB after querying with replaced field value

Scenario
I want to send data to an MQTT Broker (Cloud) by querying measurements from InfluxDB.
I have a field in the schema which is called status. It can either be 1 or 0. status=0 indicated that series has not been sent to the cloud. If I get an acknowlegdment from the MQTT Broker then I wish to rewrite the query back into the database with status=1.
As mentioned in FAQs for InfluxDB regarding Duplicate data If the information has the same timestamp as the previous query but with a different field value => then the update field will be shown.
In order to test this I created the following:
CREATE DATABASE dummy
USE dummy
INSERT meas_1, type=t1, status=0,value=123 1536157064275338300
query:
SELECT * FROM meas_1
provides
time status type value
1536157064275338300 0 t1 234
now if I want to overwrite the series I do the following:
INSERT meas_1, type=t1, status=1,value=123 1536157064275338300
which will overwrite the series
time status type value
1536157064275338300 1 t1 234
(Note: this is not possible via Tags currently in InfluxDB)
Usage
Query some information using the client with "status"=0.
Restructure JSON to be sent to the cloud
Send the information to cloud
If successful then write the output from Step 1. back into the DB but with status=1.
I am using the InfluxDBClient Python3 to create the Application (MQTT + InfluxDB)
Within the write_points API there is a parameter which mentions batch_size which require int as input.
I am not sure how can I use this with the Application that I want. Can someone guide me with this or with the Schema of the DB so that I can upload actual and non-redundant information to the cloud ?
The batch_size is actually the length of the list of the measurements that needs to passed to write_points.
Steps
Create client and query from measurement (here, we query gps information)
client = InfluxDBClient(database='dummy')
op = client.query('SELECT * FROM gps WHERE "status"=0', epoch='ns')
Make the ResultSet into a list:
batch = list(op.get_points('gps'))
create an empty list for update
updated_batch = []
parse through each measurement and change the status flag to 1. Note, default values in InfluxDB are float
for each in batch:
new_mes = {
'measurement': 'gps',
'tags': {
'type': 'gps'
},
'time': each['time'],
'fields': {
'lat': float(each['lat']),
'lon': float(each['lon']),
'alt': float(each['alt']),
'status': float(1)
}
}
updated_batch.append(new_mes)
Finally dump the points back via the client with batch_size as the length of the updated_batch
client.write_points(updated_batch, batch_size=len(updated_batch))
This overwrites the series because it contains the same timestamps with status field set to 1

GML room_goto() Error, Expecting Number

I'm trying to make a game that chooses a room from a pool of rooms using GML, but I get the following error:
FATAL ERROR in action number 3 of Create Event for object obj_control:
room_goto argument 1 incorrect type (5) expecting a Number (YYGI32)
at gml_Object_obj_control_CreateEvent_3 (line 20) - room_goto(returnRoom)
pool = ds_list_create()
ds_list_insert(pool, 0, rm_roomOne)
ds_list_insert(pool, 1, rm_roomTwo)
ds_list_insert(pool, 2, rm_roomThree)
ds_list_insert(pool, 3, rm_roomFour)
var returnIndex;
var returnRoom;
returnIndex = irandom(ds_list_size(pool))
returnRoom = ds_list_find_value(pool, returnIndex)
if (ds_list_size(pool) == 0){
room_goto(rm_menu_screen)
}else{
room_goto(returnRoom)
}
I don't get the error message saying it's expecting a number.
This is weird indeed... I think this should actually work.. But I have no GM around to test :(
For now you can also solve this using "choose". This saves a list (and saves memory, because you're not cleaning up the list by deleting it - thus it resides in memory)
room_goto(choose(rm_roomOne, rm_roomTwo, rm_roomThree, rm_roomFour));
choose basically does exactly what you're looking for. Might not be the best way to go if you're re-using the group of items though.

How to find the indirect nodes connected to a particular node in Spark Graphx

I want to find the indirect nodes that are connected to a particular node.
I tried using the connected components class of Graph like below...
graph.connectedComponents
However, it is giving for all the graph..but i want for a particular node.
I have tried doing like below also.
graph.edges.filter(_.srcId == x).map(_.dstId)
This gives the direct nodes of a particular node and i have to recursive this by using RDD operations only.
Could any one please help on this ?
Try something like this:
graph.edges.filter(_.srcId == x).map(e => (e.dstId, null)).join(
graph.collectNeighborIds(EdgeDirection.Either)
).flatMap{t => t._2._2}.collect.toSet
If you want to go deeper than this, I would use something like the Pregel API. Essentially, it lets you repeatedly send messages from node to node and aggregate the results.
Edit: Pregel Solution
I finally got the the iterations to stop on their own. Edits below. Given this graph:
graph.vertices.collect
res46: Array[(org.apache.spark.graphx.VertexId, Array[Long])] = Array((4,Array()), (8,Array()), (1,Array()), (9,Array()), (5,Array()), (6,Array()), (2,Array()), (3,Array()), (7,Array()))
graph.edges.collect
res47: Array[org.apache.spark.graphx.Edge[Double]] = Array(Edge(1,2,0.0), Edge(2,3,0.0), Edge(3,4,0.0), Edge(5,6,0.0), Edge(6,7,0.0), Edge(7,8,0.0), Edge(8,9,0.0), Edge(4,2,0.0), Edge(6,9,0.0), Edge(7,9,0.0))
We are going to send messages of the type Array[Long] -- an array of all the VertexIds of connected nodes. Messages are going to go upstream -- the dst will send the src its VertexId along with all of the other downstream VertexIds. If the upstream node already knows about the connection, no message will be sent. Eventually, every node knows about every connected node and no more messages will be sent.
First we define our vprog. According to the docs:
the user-defined vertex program which runs on each vertex and receives
the inbound message and computes a new vertex value. On the first
iteration the vertex program is invoked on all vertices and is passed
the default message. On subsequent iterations the vertex program is
only invoked on those vertices that receive messages.
def vprog(id: VertexId, orig: Array[Long], newly: Array[Long]) : Array[Long] = {
(orig ++ newly).toSet.toArray
}
Then we define our sendMsg -- edited: swapped src & dst
a user supplied function that is applied to out edges of vertices that
received messages in the current iteration
def sendMsg(trip: EdgeTriplet[Array[Long],Double]) : Iterator[(VertexId, Array[Long])] = {
if (trip.srcAttr.intersect(trip.dstAttr ++ Array(trip.dstId)).length != (trip.dstAttr ++ Array(trip.dstId)).toSet.size) {
Iterator((trip.srcId, (Array(trip.dstId) ++ trip.dstAttr).toSet.toArray ))
} else Iterator.empty }
Next our mergeMsg:
a user supplied function that takes two incoming messages of type A
and merges them into a single message of type A. This function must be
commutative and associative and ideally the size of A should not
increase.
Unfortunately, we're going to break the rule in the last sentence above:
def mergeMsg(a: Array[Long], b: Array[Long]) : Array[Long] = {
(a ++ b).toSet.toArray
}
Then we run pregel -- edited: removed maxIterations, defaults to Int.MaxValue
val result = graph.pregel(Array[Long]())(vprog, sendMsg, mergeMsg)
And you can look at the results:
result.vertices.collect
res48: Array[(org.apache.spark.graphx.VertexId, Array[Long])] = Array((4,Array(4, 2, 3)), (8,Array(8, 9)), (1,Array(1, 2, 3, 4)), (9,Array(9)), (5,Array(5, 6, 9, 7, 8)), (6,Array(6, 7, 9, 8)), (2,Array(2, 3, 4)), (3,Array(3, 4, 2)), (7,Array(7, 8, 9)))

Akka: How to ensure that message has been received?

I have an actor Dispenser. What it does is it
dispenses some objects by request
listens to arriving new ones
Code follows
class Dispenser extends Actor {
override def receive: Receive = {
case Get =>
context.sender ! getObj()
case x: SomeType =>
addObj(x)
}
}
In real processing it doesn't matter whether 1 ms or even few seconds passed since new object was sent until the dispenser starts to dispense it, so there's no code tracking it.
But now I'm writing test for the dispenser and I want to be sure that firstly it receives new object and only then it receives a Get request.
Here's the test code I came up with:
val dispenser = system.actorOf(Props.create(classOf[Dispenser]))
dispenser ! obj
Thread.sleep(100)
val task = dispenser ? Get()
val result = Await.result(task, timeout)
check(result)
It satisfies one important requirement - it doesn't change original code. But it is
At least 100ms seconds slow even on very high performance boxes
Unstable and fails sometimes because 100 ms or any other constant doesn't provide any guaranties.
And the question is how to make a test that satisfies requirement and doesn't have cons above (neither any other obvious cons)
You can take out the Thread.sleep(..) and your test will be fine. Akka guarantees the ordering you need.
With the code
dispenser ! obj
val task = dispenser ? Get()
dispenser will process obj before Get deterministically because
The same thread puts obj then Get in the actor's mailbox, so they're in the correct order in the actor's mailbox
Actors process messages sequentially and one-at-a-time, so the two messages will be received by the actor and processed in the order they're queued in the mailbox.
(..if there's nothing else going on that's not in your sample code - routers, async processing in getObj or addObj, stashing, ..)
Akka FSM module is really handy for testing underlying state and behavior of the actor and does not require to change its implementation specifically for tests.
By using TestFSMRef one can get actors current state and and data by:
val testActor = TestFSMRef(<actors constructor or Props>)
testActor.stateName shouldBe <state name>
testActor.stateData shouldBe <state data>
http://doc.akka.io/docs/akka/2.4.1/scala/fsm.html

Resources