What is DAGScheduler.messageProcessingTime? - apache-spark

Can some point me where I can find the descriptions for the metrics I get from spark sink ?

This is timer that tracks the time to process messages in the DAGScheduler's event loop
Note that this seems to include all kinds of events (JobSubmitted, JobCancelled, etc.)
It was introduced by SPARK-8344 to help troubleshoot issue and delays in DAGScheduler's event loop. At least as far as I can understand.
I was hoping to be able to use it for payload processing time but that does not seem to be the correct metric for that.

Related

Event sourcing concurrency issue across multiple instances

I am new to Event sourcing concept so there are a couple of moments I don't understand. One of them is how to handle following scenario:
I've got 2 instances of a service. Both of them listen to a event queue. There are two messages: CreateUser and UpdateUser. First instance picks up CreateUser and second instance picks up UpdateUser. For some reason second instance will handle its command quicker but there will be no User to update, since it was not created.
What am I getting wrong here?
What am I getting wrong here?
Review: Race Conditions Don't Exist
A microsecond difference in timing shouldn’t make a difference to core business behaviors.
In other words, what you want is logic such that the order of the messages doesn't change the final result, and a first writer wins policy (aka compare-and-swap), so that when you have two processes trying to update the same resource, the loser of the data race has to start over.
As a general rule, events should be understood to support multiple observers - all subscribers get to see all events. So a queue with competing consumers isn't the usual approach unless you are trying to distribute a specific subscriber across multiple processes.
You do not have a concurrency issue you can solve. This totally runs down to either using bad tools or not reading the documentation.
Both of them listen to a event queue.
And that queue should support that. Example are azure queues, where I Can listen AND TELL THE QUEUE not to show the event to anyone else for X seconds (which is enough for me to decide whether i handled it or not). If I do not answer -> event is reinserted after that time. If I kill it first, there is no concurrency.
So, you need a backend queue that can handle this.

A few specific ServiceBus questions AutoRenewTimeout, Filtering, DeleteSubscription, PeekAndLockMode

I have a few specific questions related to service bus, that I wanted to get more insights into. I have read a lot of blogs, and a lot of documentation in regards to Service Bus (SB), and have implemented a solution that works using it. The questions that I am going to ask, how not been answered, or have been answered in an unclear way, so that's why I want to get more into specifics. So here they go:
1) Just wanted to get more detail in regards to "OnMessage" function for a ServiceBus queue. From what I understand it's not a typical while()/Receive() loop with a Sleep() interval, but in fact more of a "push" based approach, according to the documentation and my testing. So my question is if I set "AutoRenewTimeout" to 1 minute, then what happens to the thread that is waiting for a message to be received for that 1 minute before issuing a new Receive() operation..Does it sleep & wait for a message from ServiceBus to get "pushed" to it? or is it more of an asynchronous approach here where a thread is released back to the pool and only grabbed again to do work when a message arrives say 47 seconds into the 1 minute AutoRenewTimeout window that I set? (No, I am not talking about "OnMessageAsync" here) I guess this question touches the internals of WCF on which SB is implemented.
2) In regards to question 1), what would the significance be of setting "AutoRenewTimeout" to 5 minutes, or 10 minutes, or 60 minutes, besides the obvious cost of Receive() operation that would be triggered at those intervals? I guess from a customer's perspective, the less frequently I trigger receive operations the more cost effective it is for me, but as far as "waiting for the message" during this period from a service bus perspective / application perspective, what is the recommened "AutoRenewTimeout" perioud? Is SB more prone to drop a connection if "AutoRenewTimeout" is set to 1 hour vs. 1 minute? are more resources / thread usages being used up when the "AutoRenewTimeout" is longer? I wasn't able to find any clear answers to this on any post/blog/documentation that I've read.
3) I have a Topic/Subscription application deployed on each server that both sends/receive messages to/from a topic. Now, I don't want each client(server) to receive the messages that it itself sent out, I only want to receive the messages that other servers(clients) sent out. I was able to implement a solution using a SqlFilter(), but I was wondering if there's any other/better way of doing this? While talking about SqlFilters(), what is the performance impact, if any, on the speed of message delivery between subscription clients when using a simple comparison SqlFilter()? (I guess I could benchmark and see for myself, but I am just curious to take a peek at the internals)
4) In regards to PeekAndLock mode on message receiver, if in a hypothetical scenario, where a message is being marked completed at the time it is received on SB end, someone was able to physically cut the wire between SB end and the application VM, then that message would be lost forever, because the application would not receive it, yet the SB would have checked it off as being Completed. Is my understanding correct, that in PeekAndLock mode the message is being marked off as completed without the application end having an sure guarantee of ever receiving it?
5) In regards to deleting a subscription for a topic, that lets say has 100 messages in it that were never received, would that count as 1 operation, or as 101 operations?
Thank you for your time and effort in taking to read this. I would be glad to clarify any point if necessary. Abhishek, you seem to be one of the architects of SB, so if you see this post, I would really appreciate if you would drop a few comments:)
Thank you!

Process all file change events within specified time frame with TPL Dataflow

I am monitoring multiple log files across multiple directories. I need to trigger an SSIS package when a file has fired an onchange event. Easy enough, but the complication is I don't want to trigger the SSIS package every time there is a change on the file. I want to wait and capture at least 5 minutes worth of changes to a specific file.
Having used FilewSystemWatcher before I know it triggers each onchange event in a new thread - My thought is to pass these events into a TPL block and have it wait for a specified time interval and then trigger an SSIS package. Basically triggering a related SSIS package every 5 minutes if there have been file change events.
If anyone could point me in the right direction as a starting point I would greatly appreciate it!
I can't tell from your question whether TPL Dataflow is a requirement or just an idea.
I'd probably keep it simple and avoid TPL dataflow. Just poll using a System.Threading.Timer and have it check System.IO.File.GetLastWriteTime on the file.
If you want to get fancy you could use Rx, convert the FilewSystemWatcher event to an Observable, and use the Buffer(TimeSpan) method.
TPL Dataflow doesn't have any intrinsic support for time windows you'd have to roll your own, probably using one of the aforementioned two methods to build it out. My experience with TPL Dataflow is that it's too big and cumbersome for small tasks, and too rudimentary for big tasks, so I'd avoid taking that approach.

How to get events count from Microsoft Azure EventHub?

I want to get events count from Microsoft Azure EventHub.
I can use EventHubReceiver.Receive(maxcount) but it is slow on big number of big events.
There is NamespaceManager.GetEventHubPartition(..).EndSequenceNumber property that seems to be doing the trick but I am not sure if it is correct approach.
EventHub doesn't have a notion of Message count, as EventHub is a high-Throughput, low-latency durable stream of events on cloud - getting the CORRECT current count at a given point of time, could be wrong the very next milli-second!! and hence, it wasn't provided :)
Hmm, we should have named EventHubs something like a StreamHub - which would make this obvious!!
If what you are looking for is - how much is the Receiver lagging behind - then EventHubClient.GetPartitionRuntimeInformation().LastEnqueuedSequenceNumber is your Best bet.
As long as no messages are sent to the partition this value remains constant :)
On the Receiver side - when a message is received - receivedEventData.SequenceNumber will indicate the Current sequence number you are processing and the diff. between EventHubClient.GetPartitionRuntimeInformation().LastEnqueuedSequenceNumber and EventData.SequenceNumber can indicate how much the Receiver of a Partition is lagging behind - based on which, the receiver process can Scale up or down the no. of Workers (work distribution logic).
more on Event Hubs...
You can use Stream Analytics, with a simple query:
SELECT
COUNT(*)
FROM
YourEventHub
GROUP BY
TUMBLINGWINDOW(DURATION(hh, <Number of hours in which the events happened>))
Of course you will need to specify a time window, but you can potentially run it from when you started collecting data to now.
You will be able to output to SQL/Blob/Service Bus et cetera.
Then you can get the message out of the output from code and process it. It is quite complicated for a one off count, but if you need it frequently and you have to write some code around it, it could be the solution for you.

Patterns to azure idempotent operations?

anybody know patterns to design idempotent operations to azure manipulation, specially the table storage? The more common approach is generate a id operation and cache it to verify new executions, but, if I have dozen of workers processing operations this approach will be more complicated. :-))
Thank's
Ok, so you haven't provided an example, as requested by knightpfhor and codingoutloud. That said, here's one very common way to deal with idempotent operations: Push your needed actions to a Windows Azure queue. Then, regardless of the number of worker role instances you have, only one instance may work on a specific queue item at a time. When a queue message is read from the queue, it becomes invisible for the amount of time you specify.
Now: a few things can happen during processing of that message:
You complete processing after your timeout period. When you go to delete the message, you get an exception.
You realize you're running out of time, so you increase the queue message timeout (today, you must call the REST API to do this; one day it'll be included in the SDK).
Something goes wrong, causing an exception in your code before you ever get to delete the message. Eventually, the message becomes visible in the queue again (after specified invisibility timeout period).
You complete processing before the timeout and successfully delete the message.
That deals with concurrency. For idempotency, that's up to you to ensure you can repeat an operation without side-effects. For example, you calculate someone's weekly pay, queue up a print job, and store the weekly pay in a Table row. For some reason, a failure occurs and you either don't ever delete the message or your code aborts before getting an opportunity to delete the message.
Fast-forward in time, and another worker instance (or maybe even the same one) re-reads this message. At this point, you should theoretically be able to simply re-perform the needed actions. If this isn't really possible in your case, you don't have an idempotent operation. However, there are a few mechanisms at your disposal to help you work around this:
Each queue message has a DequeueCount. You can use this to determine if the queue message has been processed before and, if so, take appropriate action (maybe examine the Table row for that employee, for example).
Maybe there are stages of your processing pipeline that can't be repeated. In that case: you now have the ability to modify the queue message contents while the queue message is still invisible to others and being processed by you. So, imagine appending something like |SalaryServiceCalled . Then a bit later, appending |PrintJobQueued and so on. Now, if you have a failure in your pipeline, you can figure out where you left off, the next time you read your message.
Hope that helps. Kinda shooting in the dark here, not knowing more about what you're trying to achieve.
EDIT: I guess I should mention that I don't see the connection between idempotency and Table Storage. I think that's more of a concurrency issue, as idempotency would need to be dealt with whether using Table Storage, SQL Azure, or any other storage container.
I believe you can use Reply log storage way to solve this problem

Resources