How retry works in Azure Service Bus Java - azure

I'm new to service bus, I'm curious about RetryPolicy and how it works, as per the documentation, retry had happened automatically for transient exceptions(MessagingExcepitons, ServerBusy), and the default retry count is 3, but we can set out custom retry policy using RetryExponential class.
I want to see the logs does the RetryPolicy did actually trying to connect or not when exception occurs.
How can I check this, how to replicate MessagingExcepitons, ServerBusy exceptions, so that I can see the logs. I'm using azure service bus java sdk.
Can any one help me to understand this. Thanks in advance

The Java SDK is open source and looking for retryPolicy in these files shows how the underlying implementation uses it
CoreMessageSender
CoreMessageReceiver
For example, here's the flow for CoreMessageSender when an error is thrown
When an error occurs and if its a ServiceBusException, a retry is scheduled - See line
After waiting, it ensures the link is still open and increments the retry count - See line
This continues and on successful completion it resets the count - See line
As for logging, the Java SDK uses SLF4J and you can see the required logs with a line like this in your code
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
Logger.getLogger("com.microsoft.azure.servicebus").setLevel(Level.WARN);

Related

Could not get HttpClient cache - No ThreadContext available for thread id=1

I'm working on upgrading our service to use 3.63.0 (upgrading from 3.57.0) and I've noticed the following warning (with stack trace) shows up in the logs that wasn't there on the previous version:
2022-02-18 14:03:41.038 WARN 1088 --- [ main] c.s.c.s.c.c.AbstractHttpClientCache : Could not get HttpClient cache.
com.sap.cloud.sdk.cloudplatform.thread.exception.ThreadContextAccessException: No ThreadContext available for thread id=1.
at com.sap.cloud.sdk.cloudplatform.thread.ThreadLocalThreadContextFacade.lambda$tryGetCurrentContext$0(ThreadLocalThreadContextFacade.java:39) ~[cloudplatform-core-3.63.0.jar:na]
at io.vavr.Value.toTry(Value.java:1414) ~[vavr-0.10.4.jar:na]
at com.sap.cloud.sdk.cloudplatform.thread.ThreadLocalThreadContextFacade.tryGetCurrentContext(ThreadLocalThreadContextFacade.java:37) ~[cloudplatform-core-3.63.0.jar:na]
at io.vavr.control.Try.flatMapTry(Try.java:490) ~[vavr-0.10.4.jar:na]
at io.vavr.control.Try.flatMap(Try.java:472) ~[vavr-0.10.4.jar:na]
at com.sap.cloud.sdk.cloudplatform.thread.ThreadContextAccessor.tryGetCurrentContext(ThreadContextAccessor.java:84) ~[cloudplatform-core-3.63.0.jar:na]
at com.sap.cloud.sdk.cloudplatform.connectivity.RequestScopedHttpClientCache.getCache(RequestScopedHttpClientCache.java:28) ~[cloudplatform-connectivity-3.63.0.jar:na]
at com.sap.cloud.sdk.cloudplatform.connectivity.AbstractHttpClientCache.tryGetOrCreateHttpClient(AbstractHttpClientCache.java:78) ~[cloudplatform-connectivity-3.63.0.jar:na]
at com.sap.cloud.sdk.cloudplatform.connectivity.AbstractHttpClientCache.tryGetHttpClient(AbstractHttpClientCache.java:46) ~[cloudplatform-connectivity-3.63.0.jar:na]
at com.sap.cloud.sdk.cloudplatform.connectivity.HttpClientAccessor.tryGetHttpClient(HttpClientAccessor.java:153) ~[cloudplatform-connectivity-3.63.0.jar:na]
at com.sap.cloud.sdk.cloudplatform.connectivity.HttpClientAccessor.getHttpClient(HttpClientAccessor.java:131) ~[cloudplatform-connectivity-3.63.0.jar:na]
at com.octanner.mca.service.MarketingCloudApiContactService.uploadContacts(MarketingCloudApiContactService.java:138) ~[classes/:na]
...
This happens when the following calls are made...
Using the lower level API
HttpClient httpClient = HttpClientAccessor.getHttpClient(destination); // warning happens here
ODataRequestResultMultipartGeneric batchResult = requestBatch.execute(httpClient);
Using the higher level API
service
.getAllContactOriginData()
.withQueryParameter("$expand", "AdditionalIDs")
.top(size)
.filter(filter)
.executeRequest(destination)); // warning happens here
Even though this warning shows up in the logs the service requests do continue to work as expected. It's just a little concerning to see this and I'm wondering if maybe I have something misconfigured. I reviewed all of the java docs and the troubleshooting page and didn't see anything out of the ordinary other than how I am fetching my destination, but even using the DestinationAccessor didn't seem to make a difference. Also, I'm not doing any asynchronous or multi-tenant processing.
Any help you or guidance you can give on this would be appreciated!
Cheers!
Such an issue is often the result of missing Spring Boot annotations - especially in synchronous executions.
Please refer to our documentation to learn more about the SAP Cloud SDK Spring Boot integration.
Edit Feb. 28th 2022
It is safe to ignore the logged warning if your application does not need any of the SAP Cloud SDK's multitenancy features.
Error Cause
The SAP Cloud SDK for Java recently (in version 3.63.0) introduced a change to the thread propagation behavior of the HttpClientCache.
With that change, we also adapted the logging in case the propagation didn't work as expected - this is often caused by not using the ThreadContextExecutor for wrapping asynchronous operations.
This is the reason for logs like the one described by the issue author.
Planned Mitigation
In the meanwhile, we realized that these WARN logs are causing confusion on the consumer side.
We are working on improving the situation by degrading the log level to INFO for the message and to DEBUG for the exception.

EventHubConsumerClient Apache Qpid memory leak?

I am reading events from an Azure EventHub cluster synchronously via the receiveFromPartition method on the EventHubConsumerClient class.
I create the client once like so:
EventHubConsumerClient eventHubConsumerClient = new EventHubClientBuilder()
.connectionString(eventHubConnectionString)
.consumerGroup(consumerGroup)
.buildConsumerClient());
I then just use a ScheduledExecutorService to retrieve events every 1.5s via:
IterableStream<PartitionEvent> receivedEvents = eventHubConsumerClient.receiveFromPartition(
partitionId, 1, eventPosition);
The equivalent logic in V3 of the SDK worked fine (using PartitionReceivers), but now I am seeing OOMs in my JVM.
Running a profiler against a local version of the logic I see the majority of the heap (90%, mainly in OG) is being taken up by byte[]s, referenced by org.apache.qpid.proton.codex.CompositeReadableBuffer. This pattern is not present when I profile the V3 logic.
What could be causing a leak of the AMQP messages here, do I need to interact with the SDK further, for example close a connection that I'm not aware of after each call?
Any advise would be very appreciated, thanks!
Turns out it was a bug, solved here: https://github.com/Azure/azure-sdk-for-java/issues/13775

Azure function goes idle when running in Consumption Plan with ServiceBus Queue trigger

I have also asked this question in the MSDN Azure forums, but have not received any guidance as to why my function goes idle.
I have an Azure function running on a Consumption plan that goes idle (i.e. does not respond to new messages on the ServiceBus trigger queue) despite following the instructions outlined in this GitHub issue:
The configuration for the function is the following json:
{
"ConnectionStrings": {
"MyConnectionString": "Server=tcp:project.database.windows.net,1433;Database=myDB;User ID=user#project;Password=password;Encrypt=True;Connection Timeout=30;"
},
"Values": {
"serviceBusConnection": "Endpoint=sb://project.servicebus.windows.net/;SharedAccessKeyName=SharedAccessKeyName;SharedAccessKey=KEY_HERE",
}
}
And the function signature is:
public static void ProcessQueue([ServiceBusTrigger("queueName", AccessRights.Listen, Connection = "serviceBusConnection")] ...)
Based on the discussion in the GitHub issue, I believed that having either a serviceBusConnection entry OR an AzureWebJobServiceBus entry should be enough to ensure that the central listener triggers the function when a new message is added to the ServiceBusQueue, but that is proving to not be the case.
Can anyone clarify the difference between how those two settings are used, or notice anything else with the settings I provided that might be causing the function to not properly be triggered after a period of inactivity?
I suggest there are several possible causes for this behavior. I have several Azure subs and only one of them had issues with Storage/Service Bus-based triggers only popping up when app is not idle. So far I have observed that actions listed below will prevent triggers from working correctly:
Creating any Storage-based trigger, deleting (for any reason) the triggering object and re-creating it.
Corrupting azure function input parameters by deleting/altering associated objects without recompiling a function
Restarting functions app when one of the functions fails to compile/bind to trigger OR input parameter and hangs may cause same problems.
It has also been observed that using legacy Connection Strings setting for trigger binding will not work.
Clean deploy of an affected function app will most likely solve the problem if it was caused by any of the actions described above.
EDIT:
It looks like this is also caused by setting Authorization/Authentication on the functions app, but I have not yet figured out if it happens in general or when Auth has specific configuration. Tested on affected Azure sub by disabling auth at all - function going idle after 30-40 mins, queue trigger still initiates an execution, though with a delay as expected. I have found an old bug related to this, but it says issue resolved.

Azure webjob - QueueTrigger stops triggering

I am running an azure webjobs SDK console application (continuous) with the recommended setup:
public static void ProcessQueueMessage([QueueTrigger("logqueue")] string logMessage, TextWriter logger)
The azure queue I am running against has ~6000 messages in it and I am running the web-job locally, as a console application.
The problem I'm having is that the processing randomly stops after processing between zero and ~30 messages. The console stays open, but no more console messages are displayed.
For example, it might just process 2 messages:
Executing: 'Functions.ProcessQueueMessage' - Reason: 'New queue message detected on 'QueueName'.'
Executed: 'Functions.ProcessQueueMessage' (Succeeded)
Executing: 'Functions.ProcessQueueMessage' - Reason: 'New queue message detected on 'QueueName'.'
Executed: 'Functions.ProcessQueueMessage' (Succeeded)
And then, nothing. There doesn't seem to be anything wrong with my internet connection and I can't trace the issues down to any particular messages.
Has anyone else had issues with this SDK?
Update:
I made sure that I was using the right versions of all of the dependencies by removing the nuget packages and then re-running install-package Microsoft.Axure.Webjobs. I am now using webjobs version 1.1.0 which has pulled in version 4.3 of azure storage.
As recommended by Matthew, I have pulled down the source code for azure webjobs to determine where the process is freezing up. Once the freez-up occurs, I pause execution and checked the running threads for what I believe is the culprit within Microsoft.Azure.WebJobs.Host.CompositeTraceWriter
protected virtual void InvokeTextWriter(TraceEvent traceEvent)
{
if (_innerTextWriter != null)
{
string message = traceEvent.Message;
if (!string.IsNullOrEmpty(message) &&
message.EndsWith("\r\n", StringComparison.OrdinalIgnoreCase))
{
// remove any terminating return+line feed, since we're
// calling WriteLine below
message = message.Substring(0, message.Length - 2);
}
_innerTextWriter.WriteLine(message);
if (traceEvent.Exception != null)
{
_innerTextWriter.WriteLine(traceEvent.Exception.ToDetails());
}
}
}
The line it freezes on is line 66 : _innerTextWriter.WriteLine(message);
_innerTextWriter is an instance of System.IO.TextWriter.SyncTextWriter
Is it possible there is some deadlock issue with this class or the way it is being used?
Some notes:
I am running in the debugger, so in this case I believe the textwriter is forwarding to the console internally
I have my batchsize set to 1 via config.Queues.BatchSize = 1;, not sure if that could matter
I'm currently working on setting up an environment on another computer so that I can see if it is reproducible somewhere other than this machine (surface book).
Update
The issue was me not understanding how the new windows 10 command prompt works. Any time you click on the command window, it goes into "select" mode which completely pauses execution of the process.
Basically: https://superuser.com/questions/419717/windows-command-prompt-freezing-randomly?newreg=ece53f5584254346be68f85d1fd2f18d
You can tell it is in this state because it will prefix the window title with the word "Select":
You have to press enter or click again to get it going once again.
So, two final comments:
1) What an incredibly confusing and un-intuitive behavior for a command window!
2) I hope some admin will come take pity on the shame I have brought upon myself and my family by deleting this question.
To get rid of this strange behavior, you can disable QuickEdit mode:
Strange. When it is in this stuck state, can you try adding a new queue message to the queue and see if that triggers? Are you sure your function isn't hanging internally? What version of the SDK are you using? You might also try upgrading to v1.1.0 which we just released last week. If there are really a bunch of messages in the queue waiting to be processed, I can't think of anything that would cause this. The queue listener in the SDK should chug along, reading batches of messages in parallel and dispatching them to your function. Have you changed any of the JobHostConfiguration.Queues configuration knobs? You haven't force updated the version of the Azure SDK have you to something higher than the WebJobs SDK supports?
Another option if you can't figure this out might be to clone the SDK, build it and debug it locally. The repo is here. The main queue processing loop is here.

Azure Document Db Worker Role

I am having problems getting the Microsoft.Azure.Documents library to initialize the client in an azure worker role. I'm using Nuget Package 0.9.1-preview.
I have mimicked what was done in the example for azure document
When running locally through the emulator I can connect fine with the documentdb and it runs as expected. When running in the worker role, I am getting a series of NullReferenceException and then ArgumentNullException.
The bottom System.NullReferenceException that is highlighted above has this call stack
so the nullReferenceExceptions start in this call at the new DocumentClient.
var endpoint = "myendpoint";
var authKey = "myauthkey";
var enpointUri = new Uri(endpoint);
DocumentClient client = new DocumentClient(endpointUri, authKey);
Nothing changes between running it locally vs on the worker role other then the environment (obviously).
Has anyone gotten DocumentDb to work on a worker role or does anyone have an idea why it would be throwing null reference exceptions? The parameters getting passed into the DocumentClient() are filled.
UPDATE:
I tried to rewrite it being more generic which helped at least let the worker role run and let me attached a debugger. It is throwing the error on the new DocumentClient. Seems like some security passing is null. Both the required parameters on initialization are not null. Is there a security setting I need to change for my worker role to be able to connect to my documentdb? (still works locally fine)
UPDATE 2:
I can get the instance to run in release mode, but not debug mode. So it must be something to do with some security setting or storage setting that is misconfigured I guess?
It seems I'm getting System.Security.SecurityExceptions - only when using The DocumentDb - queues do not give me that error. All Call Stacks for that error seem to be with System.Diagnostics.EventLog. The very first Exception I see in the Intellitrace Summary is System.Threading.WaitHandleCannotBeOpenedException.
More Info
Intellitrace summary exception data:
top is the earliest and bottom is the latest (so System.Security.SecurityException happens first then the NullReference)
The solution for me to get rid of the security exception and null reference exception was to disable intellitrace. Once I did that, I was able to deploy and attach debugger and see everything working.
Not sure what is between the null in intellitrace and the DocumentClient, but hopefully it's just in relation to the nuget and it will be fixed in the next iteration.
unable to repro.
I created a new Worker Role. Single instance. Added authkey & endoint config to cscfg.
Created private static DocumentClient at WorkerRole class level
Init DocumentClient in OnStart
Dispose DocumentClient in OnStop
In RunAsync inside loop,
execute a query Works as expected.
Test in emulator works.
Deployed as Release to Production slot. works.
Deployed as Debug to Staging with Remote Debug. works.
Attached VS to CloudService, breakpoint hit inside loop.
Working solution : http://ryancrawcour.blob.core.windows.net/samples/AzureCloudService1.zip

Resources