About understanding partition lease expiration - azure

I have an event hub with 4 partitions and 2 consumer groups. I have 2 webjobs that read the data using an EventProcessor. Both for a different consumer group
I have configured the event processors like this:
var host = new EventProcessorHost(
PartitionManagerOptions = new PartitionManagerOptions
AcquireInterval = TimeSpan.FromSeconds(10),
RenewInterval = TimeSpan.FromSeconds(10),
LeaseInterval = TimeSpan.FromSeconds(30)
var options = EventProcessorOptions.DefaultOptions;
options.MaxBatchSize = 250;
await host.RegisterEventProcessorFactoryAsync(new PlanCareEventProcessorFactory(telemetryClient, configurationManager), options);
return host;
In my EventProcessor I keep track of the progress (some methods skipped to keep it short and readable):
internal class PlanCareEventProcessor : IEventProcessor
public Task OpenAsync(PartitionContext context)
namespaceManager = NamespaceManager.CreateFromConnectionString(configurationManager.EventHubConfiguration.ManagerConnectionString);
if (namespaceManager == null)
var currentSeqNo = context.Lease.SequenceNumber;
var lastSeqNo = namespaceManager.GetEventHubPartition(context.EventHubPath, context.ConsumerGroupName, context.Lease.PartitionId).EndSequenceNumber;
var delta = lastSeqNo - currentSeqNo;
var msg = $"Last processed seqnr for partition {context.Lease.PartitionId}: {currentSeqNo} of {lastSeqNo} in consumergroup '{context.ConsumerGroupName}' (lag: {delta})";
telemetryClient.TrackTrace(new TraceTelemetry(msg, SeverityLevel.Information));
telemetryClient.TrackMetric(new MetricTelemetry($"Partition_Lag_{context.Lease.PartitionId}_{context.ConsumerGroupName}", delta));
return Task.CompletedTask;
public async Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> events)
await LogProgress(context);
private async Task LogProgress(PartitionContext context)
if (progressCounter >= 100)
await CheckPointAsync(context);
progressCounter = 0;
Now I noticed a difference in the webjobs when it comes to how often OpenAsync and CloseAsync are called. For one of the consumer groups this is about every half hour while for the other one it is several times a minute.
Since both webjobs use the same code and are running on the same app plan, what could be the reason for this?
It bothers me because checkpointing using await CheckPointAsync(context) is almost never done for one of the webjobs since it does not reach the threshold before the lease is gone.


How to handle Sql Server rollbacks when Azure service bus message fails to save on the queue and they depend on each other?

I'm saving a row to my db (class with teacher/students, time, date, etc), once I have the id I create a message on my Azure service bus where the unique id of the row from my db is used as the message body of the service bus message. I'm creating scheduled messages so I can notify the students before the class and after the class is over so they can rate/review their teacher.
QUESTION - I'd like to know how to roll back or an easy way to remove the db row by not allowing it to fully save if the message to the Azure service bus fails to save?
Currently I'm using a generic repository with UnitOfWork to save to my db and I'm catching the exception from my service bus service if it fails, then deleting the row that was just saved, but it's sloppy looking and I can see it will lead to problems.
Here is what I'm doing now in the controller.
public async Task<IActionResult> OnCreating(OnCreatingEventDto onCreatingDto)
var userFromRepo = await _userManager.FindByEmailFromClaimsPrinciple(HttpContext.User);
if (userFromRepo == null)
return Unauthorized(new ApiResponse(401));
var newEvent = _mapper.Map<ClassEvent>(onCreatingDto);
var success = await _unitOfWork.Complete();
if (success > 0) {
try {
var sequenceNUmber = await _serviceBusProducer.SendMessage(newEvent.Id.ToString(), newEvent.eventTime.addDays(1), queueName);
newEvent.ServiceBusSequenceNumber = sequenceNUmber;
var secondSuccess = await _unitOfWork.Complete();
if (secondSuccess > 0) {
return Ok();
} catch(Exception ex) {
_logger.LogError("error saving to service bus");
var deleteSuccess = await _unitOfWork.Complete();
if (deleteSuccess > 0) {
return BadRequest(new ApiResponse(400, "Problem Creating Event"));
return BadRequest(new ApiResponse(400, "Problem Creating Event"));
Here is the method from my service that creates the message on the queue
public async Task<long> SendMessage(string messageBody, DateTimeOffset scheduledEnqueueTime, string queueName)
await using (ServiceBusClient client = new ServiceBusClient(_config["ServiceBus:Connection"]))
ServiceBusSender sender = client.CreateSender(_config["ServiceBus:" + queueName]);
ServiceBusMessage message = new ServiceBusMessage(messageBody);
var sequenceNumber =
await sender.ScheduleMessageAsync(message, scheduledEnqueueTime);
return sequenceNumber;

Azure Function Durable Timer does not wake up until app is touched

I have a Durable Orchestration that scales up and down Azure Cosmos DB throughput on request. The scale up is triggered via HTTP, and the scale down happens later via a Durable Timer that is supposed to wake up the Azure Function at the end of the current or next hour. Here is the Orchestrator Function:
public static class CosmosDbScalerOrchestrator
public static async Task RunOrchestrator(
[OrchestrationTrigger] IDurableOrchestrationContext context)
var cosmosDbScalerRequestString = context.GetInput<string>();
var didScale = await context.CallActivityAsync<bool>(nameof(ScaleUpActivityTrigger), cosmosDbScalerRequestString);
if (didScale)
var minutesUntilLastMinuteOfHour = 59 - context.CurrentUtcDateTime.Minute;
var minutesUntilScaleDown = minutesUntilLastMinuteOfHour < 15
? minutesUntilLastMinuteOfHour + 60
: minutesUntilLastMinuteOfHour;
var timeUntilScaleDown = context.CurrentUtcDateTime.Add(TimeSpan.FromMinutes(minutesUntilScaleDown));
await context.CreateTimer(timeUntilScaleDown, CancellationToken.None);
await context.CallActivityAsync(nameof(ScaleDownActivityTrigger), cosmosDbScalerRequestString);
Here is the ScaleUpActivityTrigger:
public class ScaleUpActivityTrigger
public static async Task<bool> Run([ActivityTrigger] string cosmosDbScalerRequestString, ILogger log)
var cosmosDbScalerRequest =
var scaler = new ContainerScaler(cosmosDbScalerRequest.ContainerId);
var currentThroughputForContainer = await scaler.GetThroughputForContainer();
// Return if would scale down
if (currentThroughputForContainer > cosmosDbScalerRequest.RequestedThroughput) return false;
var newThroughput = cosmosDbScalerRequest.RequestedThroughput < 25000
? cosmosDbScalerRequest.RequestedThroughput
: 25000;
await scaler.Scale(newThroughput);
return true;
and the ScaleDownActivityTrigger:
public class ScaleDownActivityTrigger
public static async Task Run([ActivityTrigger] string cosmosDbScalerRequestString, ILogger log)
var cosmosDbScalerRequest =
var scaler = new ContainerScaler(cosmosDbScalerRequest.ContainerId);
var minimumRusForContainer = await scaler.GetMinimumRusForContainer();
await scaler.Scale(minimumRusForContainer);
However, what I observe is that the Function is not awakened until something else triggers the Durable Orchestration. Notice the difference in the timestamps for when this was scheduled and when it happened.
Is the fact that it did not wake up until then by design, or a bug? If it is by design, how can I wake it up when I actually want to?

Azure Service Bus Queue: How the ordering of the message work?

public static async Task DoMessage()
const int numberOfMessages = 10;
queueClient = new QueueClient(ConnectionString, QueueName);
await SendMessageAsync(numberOfMessages);
await queueClient.CloseAsync();
private static async Task SendMessageAsync(int numOfMessages)
for (var i = 0; i < numOfMessages; i++)
var messageBody = $"Message {i}";
var message = new Message(Encoding.UTF8.GetBytes(messageBody));
message.SessionId = i.ToString();
await queueClient.SendAsync(message);
catch (Exception e)
This is my sample code to send message to the service bus queue with session id.
My question is if I call DoMessage function 2 times: Let's name it as MessageSet1 and MessageSet2, respectively. Will the MessageSet2 be received and processed by the received azure function who dealing with the receiving ends of the message.
I want to handle in order like MessageSet1 then the MessageSet2 and never handle with MessageSet2 unless MessageSet1 finished.
There are a couple of issues with what you're doing.
First, Azure Functions do not currently support sessions. There's an issue for that you can track.
Second, the sessions you're creating are off. A session should be applied on a set of messages using the same SessionId. Meaning your for loop should be assigning the same SessionId to all the messages in the set. Something like this:
private static async Task SendMessageAsync(int numOfMessages, string sessionID)
var tasks = new List<Task>();
for (var i = 0; i < numOfMessages; i++)
var messageBody = $"Message {i}";
var message = new Message(Encoding.UTF8.GetBytes(messageBody));
message.SessionId = sessionId;
await Task.WhenAll(tasks).ConfigureAwait(false);
catch (Exception e)
// handle exception
For ordered messages using Sessions, see documentation here.

Tight Loop - Disk at 100%, Quad Core CPU #25% useage, only 15MBsec disk write speed

I have a tight loop which runs through a load of carts, which themselves contain around 10 events event objects and writes them to the disk in JSON via an intermediate repository (jOliver common domain rewired with GetEventStore.com):
// create ~200,000 carts, each with ~5 events
List<Cart> testData = TestData.GenerateFrom(products);
foreach (var cart in testData)
count = count + (cart as IAggregate).GetUncommittedEvents().Count;
I see the disk says it is as 100%, but the throughout is 'low' (15MB/sec, ~5,000 events per second) why is this, things i can think of are:
Since this is single threaded does the 25% CPU usage actually mean 100% of the 1 core that I am on (any way to show specific core my app is running on in Visual Studio)?
Am i constrained by I/O, or by CPU? Can I expect better performance if i create my own thread pool one for each CPU?
How come I can copy a file at ~120MB/sec, but I can only get throughput of 15MB/sec in my app? Is this due to the write size of lots of smaller packets?
Anything else I have missed?
The code I am using is from the geteventstore docs/blog:
public class GetEventStoreRepository : IRepository
private const string EventClrTypeHeader = "EventClrTypeName";
private const string AggregateClrTypeHeader = "AggregateClrTypeName";
private const string CommitIdHeader = "CommitId";
private const int WritePageSize = 500;
private const int ReadPageSize = 500;
IStreamNamingConvention streamNamingConvention;
private readonly IEventStoreConnection connection;
private static readonly JsonSerializerSettings serializerSettings = new JsonSerializerSettings { TypeNameHandling = TypeNameHandling.None };
public GetEventStoreRepository(IEventStoreConnection eventStoreConnection, IStreamNamingConvention namingConvention)
this.connection = eventStoreConnection;
this.streamNamingConvention = namingConvention;
public void Save(IAggregate aggregate)
this.Save(aggregate, Guid.NewGuid(), d => { });
public void Save(IAggregate aggregate, Guid commitId, Action<IDictionary<string, object>> updateHeaders)
var commitHeaders = new Dictionary<string, object>
{CommitIdHeader, commitId},
{AggregateClrTypeHeader, aggregate.GetType().AssemblyQualifiedName}
var streamName = this.streamNamingConvention.GetStreamName(aggregate.GetType(), aggregate.Identity);
var newEvents = aggregate.GetUncommittedEvents().Cast<object>().ToList();
var originalVersion = aggregate.Version - newEvents.Count;
var expectedVersion = originalVersion == 0 ? ExpectedVersion.NoStream : originalVersion - 1;
var eventsToSave = newEvents.Select(e => ToEventData(Guid.NewGuid(), e, commitHeaders)).ToList();
if (eventsToSave.Count < WritePageSize)
this.connection.AppendToStreamAsync(streamName, expectedVersion, eventsToSave).Wait();
var startTransactionTask = this.connection.StartTransactionAsync(streamName, expectedVersion);
var transaction = startTransactionTask.Result;
var position = 0;
while (position < eventsToSave.Count)
var pageEvents = eventsToSave.Skip(position).Take(WritePageSize);
var writeTask = transaction.WriteAsync(pageEvents);
position += WritePageSize;
var commitTask = transaction.CommitAsync();
private static EventData ToEventData(Guid eventId, object evnt, IDictionary<string, object> headers)
var data = Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(evnt, serializerSettings));
var eventHeaders = new Dictionary<string, object>(headers)
EventClrTypeHeader, evnt.GetType().AssemblyQualifiedName
var metadata = Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(eventHeaders, serializerSettings));
var typeName = evnt.GetType().Name;
return new EventData(eventId, typeName, true, data, metadata);
It was partially mentioned in the comments, but to enhance on that, as you are working fully single-threaded in the mentioned code (though you use async, you are just waiting for them, so effectively working sync) you are suffering from latency and overhead for context switching and EventStore protocol back and forth. Either really go the async route, but avoid waiting on the async threads and rather parallelize it (EventStore likes parallelization because it can batch multiple writes) or do batching yourself and send, for example, 20 events at a time.

tableclient.RetryPolicy Vs. TransientFaultHandling

Both myself and a colleague have been tasked with finding connection-retry logic for Azure Table Storage. After some searching, I found this really cool Enterprise Library suite, which contains the Microsoft.Practices.TransientFaultHandling namespace.
Following a few code examples, I ended up creating an Incremental retry strategy, and wrapping one of our storage calls with the retryPolicy's ExecuteAction callback handler :
/// <inheritdoc />
public void SaveSetting(int userId, string bookId, string settingId, string itemId, JObject value)
// Define your retry strategy: retry 5 times, starting 1 second apart, adding 2 seconds to the interval each retry.
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting(StorageConnectionStringName));
retryPolicy.ExecuteAction(() =>
var tableClient = storageAccount.CreateCloudTableClient();
var table = tableClient.GetTableReference(SettingsTableName);
var entity = new Models.Azure.Setting
PartitionKey = GetPartitionKey(userId, bookId),
RowKey = GetRowKey(settingId, itemId),
UserId = userId,
BookId = bookId.ToLowerInvariant(),
SettingId = settingId.ToLowerInvariant(),
ItemId = itemId.ToLowerInvariant(),
Value = value.ToString(Formatting.None)
catch (StorageException exception)
Feeling awesome, I went to go show my colleague, and he smugly noted that we could do the same thing without having to include Enterprise Library, as the CloudTableClient object already has a setter for a retry policy. His code ended up looking like :
/// <inheritdoc />
public void SaveSetting(int userId, string bookId, string settingId, string itemId, JObject value)
var storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting(StorageConnectionStringName));
var tableClient = storageAccount.CreateCloudTableClient();
// set retry for the connection
tableClient.RetryPolicy = new ExponentialRetry(TimeSpan.FromSeconds(2), 3);
var table = tableClient.GetTableReference(SettingsTableName);
var entity = new Models.Azure.Setting
PartitionKey = GetPartitionKey(userId, bookId),
RowKey = GetRowKey(settingId, itemId),
UserId = userId,
BookId = bookId.ToLowerInvariant(),
SettingId = settingId.ToLowerInvariant(),
ItemId = itemId.ToLowerInvariant(),
Value = value.ToString(Formatting.None)
catch (StorageException exception)
My Question :
Is there any major difference between these two approaches, aside from their implementations? They both seem to accomplish the same goal, but are there cases where it's better to use one over the other?
Functionally speaking both are the same - they both retries requests in case of transient errors. However there are few differences:
Retry policy handling in storage client library only handles retries for storage operations while transient fault handling retries not only handles storage operations but also retries SQL Azure, Service Bus and Cache operations in case of transient errors. So if you have a project where you're using more that storage but would like to have just one approach for handling transient errors, you may want to use transient fault handling application block.
One thing I liked about transient fault handling block is that you can intercept retry operations which you can't do with retry policy. For example, look at the code below:
var retryManager = EnterpriseLibraryContainer.Current.GetInstance<RetryManager>();
var retryPolicy = retryManager.GetRetryPolicy<StorageTransientErrorDetectionStrategy>(ConfigurationHelper.ReadFromServiceConfigFile(Constants.DefaultRetryStrategyForTableStorageOperationsKey));
retryPolicy.Retrying += (sender, args) =>
// Log details of the retry.
var message = string.Format(CultureInfo.InvariantCulture, TableOperationRetryTraceFormat, "TableStorageHelper::CreateTableIfNotExist", storageAccount.Credentials.AccountName,
tableName, args.CurrentRetryCount, args.Delay);
TraceHelper.TraceError(message, args.LastException);
var isTableCreated = retryPolicy.ExecuteAction(() =>
var table = storageAccount.CreateCloudTableClient().GetTableReference(tableName);
return table.CreateIfNotExists(requestOptions, operationContext);
return isTableCreated;
catch (Exception)
In the code example above, I could intercept retry operations and do something there if I want to. This is not possible with storage client library.
Having said all of this, it is generally recommended to go with storage client library retry policy for retrying storage operations as it is an integral part of the package and thus would be kept up to date with the latest changes to the library.
