I have a number of events that are based on values from devices. They are read in intervals, e.g. every hour. The events are delivered to an Event Hub, which is used as an input to a Stream Analytics (SA) job.
I want to aggregate and calculate an average value in SA. Currently, I aggregate and group the events in SA using an origin id and other properties to create the correct groups and averages. The problem is that the averages are not correct. I think the events a either not complete and/or not correlated correct.
Using a TumblingWindow will produce a number of static windows based on time, but the events I need to aggregate might come across two or more windows.
Using a SlidingWindow, as I understand, will trigger output upon a specific condition and the "look back" for a specified interval. Is this correct? If it is correct, I could attach the same id, like a JobId, to each event that I need aggregated and a value indicating whether it is the last event. When the last event enters SA, the SlidingWindow is triggered and we can "look back" for all the events with the same id. Is this possible?
Are there other options in this case? Basically I need to correlate a number of events based on other characteristics than time.
I hope you can help me.
Related
I am building a job scheduler and I am stuck between two approaches. I have two types of jobs, ones that are scheduled for a specific date and ones that run hourly. For the specific date ones, I poll my database table that stores the jobs and post the results to a rabbitmq message broker where specific workers process them. This works well for more defined tasks like sending reminder notifications or emails. For the hourly jobs, I have a cron expression based job running and have the logic directly in the function, so it does not go to a queue. Usually, these are jobs to clean up my database or set certain values based on previous day activity, etc.
I am wondering what the best way to architect this is. Does it make sense to have all these smaller jobs running on a cadence as microservices and listen on a queue? Should I group all of them together into one service? Should I combine all logic of both types into one large worker app?
In my opinion as described you have two different systems that are doing two different things and this is a problem.
For example consider sending an email. With the design you have right now you have to write your email code twice -- once for messages that are sent to the queue and once for the ones sent from the cron.
Ideally you want to make supporting the system in your design as easy as possible. For example if your design uses a queue then ALL actions should work the same way -- get a command message and parameters off the queue and execute them. Your cron job and your scheduler would both add messages to the queue. Supporting a new action would just mean coding it once and then both systems could add a message to the queue.
The same should be true of your data model. The data model should support both types of schedule but share as much as possible.
The last time I created a system like this I did something along the following lines:
I had an events table -- this table had a list of events to fire are specific dates and times in the future..
I had a retur table -- this table had a list of recurring events (eg every week on a tuesday at this time.)
There was a process -- it would look at the events table to see if there was something that needed to fire. If there was it would fire it. THEN it would remove it from the events table (no longer in the future) and log it had been run. It would also check if this was a event from the recur table -- if it was it would add the next future event to the events table.
Note, this design only has two tables here in this simplified explanation of how the events table worked, but in reality there were a number of others in the complete data model. For example I had a number of tables to support different event types. (Email templates for the email events, etc).
Is there a way to get count of events in Azure Event Hub for a particular time period? I need to get the count of events which come for an hour.
No, there is no way to get it as of now, if you look at the docs, EventHub is a high-Throughput, low-latency durable stream of events on Azure, so getting a count may be not correct at that given point of time
unlike queues, there is no concept of queue length in Azure Event Hubs
because the data is processed as a stream
I'm not sure that the context of this question is correct, as a consumer group is just a logical grouping of those reading from an Event Hub and nothing gets published to it. For the remainder of my thoughts, I'm assuming that the nomenclature was a mistake, and what you're interested in is understanding what events were published to a partition of the Event Hub.
There's no available metric or report that I'm aware of that would surface the requested information. However, assuming that you know the time range that you're interested in, you can write a utility to compute this:
Connect one of the consumer types from the Event Hubs SDK of your choice to the partition, using FromEnqueuedTime or the equivalent as your starting position. (sample)
For each event read, inspect the EnqueuedTime or equivalent and compare it to the window that you're interested in.
If the event was enqueued within your window, increase your count and repeat, starting at #2. If the event was later than your window, you're done.
The count that you've accumulated would be the number of events that the Event Hubs broker received during that time interval.
Note: If you're doing this across partitions, you'll want to read each partition individually and count. The ReadEvents method on the consumer does not guarantee ordering or fairness; it would be difficult to have a deterministic spot to decide that you're done reading.
Assuming you really just need to know "how many events did my Event Hub receive in a particular hour" - you can take a look at the metrics published by the Event Hub. Be aware that, like mentioned in the other answers, the count might not be 100% accurate given the nature of the system.
Take a look at the Incoming Messages metric. If you take this for any given hour, it will give you the count of messages that were received during this time period. You can split the namespace metric by EventHub, and every consumer group will receive every single message, so you should be fine.
This is an example of how it can look in the UI, though you should also be able to export it to a log analytics workspace.
An IoT sensor takes 1000 measurements # 1kHZ in every 10min and sends the values in ten separate messages into Azure IoT Hub. I am supposed to concatenate the ten separate messages back to one for further processing e.g. calculating RMS and FFT.
The messages have the following structure:
{
"SampleID" : 12344,
"PartionIdx" : 2,
"NbrPartitions": 10,
"Values" : [12,13,14,13,12,11,10,9]
}
So, the values of all messages having same SampleID should be concatenated together by PartitionIdx order, after all ten have been received. I tried to use Stream Analytics but failed.
Is this too complex task for Stream Analytics? If yes, is there any other options than just coding a Web Job doing the concatenation.
There are two aspects to the question
Knowing when all the values have arrived.
Concatenating them together.
For 1,
If the events for a particular device Id arrives in order to eventhub and to the same eventhub partition always or if you have an idea about how out of order the data can be, you can use TIMESTAMP BY OVER() clause to create a different timeline for every device. This will block output for a deviceId until enough data from that partition has been received.
For 2, you can either use Collect() like #js-azure mentioned. If you want the data to be formatted in a specific instead of array of records, you can use javascript aggregates.
I am trying to develop microservices for order / transaction process using event sourcing concept. The staff could place an order / transaction for a customer by phone. The system also record the number of order that grouped by customer. It is using AWS Kinesis to send the customer id in orderCreated event to the service of customer data so we could increment the number of created order. We separate the order processing and customer based on DDD concept. But, we should anticipate human error when staff select wrong customer id for the order. So, there is a feature to change the customer for related order.
The problem is the orderUpdated event just contains the latest data of the order. It means that the event only has the new customer id. We could increment the number of order for new customer. But, we should decrement the number of order for previous customer id.
How to solve this problem? Could you give me some suggestions?
Thanks in advance
It sounds like an OrderUpdated event is not granular enough to meet your needs. Based on the information you provided, you likely want to have a more specific event like OrderCustomerChanged that contains both the new and the old customer Id.
I am still trying to wrap my head around how to apply DDD and, most recently, CQRS to a real production business application. In my case, I am working on an inventory management system. It runs as a server-based application exposed via a REST API to several client applications. My focus has been on the domain layer with the API and clients to follow.
The command side of the domain is used to create a new Order and allows modifications, cancellation, marking an Order as fulfilled and shipped/completed. I, of course, have a query that returns a list of orders in the system (as read-only, lightweight DTOs) from the repository. Another query returns a PickList used by warehouse employees to pull items from the shelves to fulfill specific orders. In order to create the PickList, there are calculations, rules, etc that must be evaluated to determine which orders are ready to be fulfilled. For example, if all order line items are in stock. I need to read the same list of orders, iterate over the list and apply those rules and calculations to determine which items should be included in the PickList.
This is not a simple query, so how does it fit into the model?
UPDATE
While I may be able to maintain (store) a set of PickLists, they really are dynamic until an employee retrieves the next PickList. Consider the following scenario:
The first Order of the day is received. I can raise a domain event that triggers an AssemblePickListCommand which applies all of the rules and logic to create one or more PickLists for that Order.
A second Order is received. The event handler should now REPLACE the original PickLists with one or more new PickLists optimized across both pending Orders.
Likewise after a third Order is received.
Let's assume we now have two PickLists in the 'queue' because the optimization rules split the lists because components are at opposite ends of the warehouse.
Warehouse employee #1 requests a PickList. The first PickList is pulled and printed.
A fourth Order is received. As before, the handler removes the second PickList from the queue (the only one remaining) and regenerates one or more PickLists based on the second PickList and the new Order.
The PickList 'assembler' will repeat this logic whenever a new Order is received.
My issue with this is that a request must either block while the PickList queue is being updated or I have an eventual consistency issue that goes against the behavior the customer wants. Each time they request a PickList, they want it optimized based on all of the Order received to that point in time.
While I may be able to maintain (store) a set of PickLists, they really are dynamic until an employee retrieves the next PickList. Consider the following scenario:
The first Order of the day is received. I can raise a domain event that triggers an AssemblePickListCommand which applies all of the rules and logic to create one or more PickLists for that Order.
A second Order is received. The event handler should now REPLACE the original PickLists with one or more new PickLists optimized across both pending Orders.
This sounds to me like you are getting tangled trying to use a language that doesn't actually match the domain you are working in.
In particular, I don't believe that you would be having these modeling problems if the PickList "queue" was a real thing. I think instead there is an OrderItem collection that lives inside some aggregate, you issue commands to that aggregate to generate a PickList.
That is, I would expect a flow that looks like
onOrderPlaced(List<OrderItems> items)
warehouse.reserveItems(List<OrderItems> items)
// At this point, the items are copied into an unasssigned
// items collection. In other words, the aggregate knows
// that the items have been ordered, and are not currently
// assigned to any picklist
fire(ItemsReserved(items))
onPickListRequested(Id<Employee> employee)
warehouse.assignPickList(Id<Employee> employee, PickListOptimizier optimizer)
// PickListOptimizer is your calculation, rules, etc that know how
// to choose the right items to put into the next pick list from a
// a given collection of unassigned items. This is a stateless domain
// *domain service* -- it provides the query that the warehouse aggregate needs
// to figure out the right change to make, but it *doesn't* change
// the state of the aggregate -- that's the aggregate's responsibility
List<OrderItems> pickedItems = optimizer.chooseItems(this.unassignedItems);
this.unassignedItems.removeAll(pickedItems);
// This mockup assumes we can consider PickLists to be entities
// within the warehouse aggregate. You'd need some additional
// events if you wanted the PickList to have its own aggregate
Id<PickList> = PickList.createId(...);
this.pickLists.put(id, new PickList(id, employee, pickedItems))
fire(PickListAssigned(id, employee, pickedItems);
onPickListCompleted(Id<PickList> pickList)
warehouse.closePicklist(Id<PickList> pickList)
this.pickLists.remove(pickList)
fire(PickListClosed(pickList)
onPickListAbandoned(Id<PickList> pickList)
warehouse.reassign(Id<PickList> pickList)
PickList list = this.pickLists.remove(pickList)
this.unassignedItems.addAll(list.pickedItems)
fire(ItemsReassigned(list.pickedItems)
Not great languaging -- I don't speak warehouse. But it covers most of your points: each time a new PickList is generated, it's being built from the latest state of pending items in the warehouse.
There's some contention - you can't assign items to a pick list AND change the unassigned items at the same time. Those are two different writes to the same aggregate, and I don't think you are going to get around that as long as the client insists upon a perfectly optimized picklist each time. It might be worth while to sit down with the domain experts and explore the real cost to the business if the second best pick list is assigned from time to time. After all, there's already latency between the placing the order and its arrival at the warehouse....
I don't really see what your specific question is. But the first thing that comes to mind is that pick list creation is not just a query but a full blown business concept that should be explicitly modeled. It then could be created with AssemblePicklist command for instance.
You seem to have two roles/processes and possibly also two aggregate roots - salesperson works with orders, warehouse worker with picklists.
AssemblePicklistsCommand() is triggered from order processing and recreates all currently unassigned picklists.
Warehouse worker fires a AssignPicklistCommand(userid) which tries to choose the most appropriate unassigned picklist and assign it to him (or doing nothing if he already has an active picklist). He could then use GetActivePicklistQuery(userid) to get the picklist, pick items with PickPicklistItemCommand(picklistid, item, quantity) and finally MarkPicklistCompleteCommand() to signal order he's done.
AssemblePicklist and AssignPicklist should block each other (serial processing, optimistic concurency?) but the relation between AssignPicklist and GetActivePicklist is clean - either you have a picklist assigned or you don't.