Corda RPC communication. Slow Performance - rpc

Have a scenario when a client triggers cordapp via RPC and waits for the result.
rpcConnection.proxy
.startFlow(::ImportAssetFlow, importDto)
.returnValue
.get(importTimeout /* 30000 ms */, TimeUnit.MILLISECONDS)
Flow triggers correctly and returns the response, but the problem with the slow response after the flow is processed. In the end of the FlowLogic.call() codeblock, the response should came back to the client via Artemis messsage. The result returns back on a client via Corda future after 12 seconds.
On the client side on debug lvl log of the RPCClientProxyHandler to check how the process works:
2020-01-08 12:12:45.982 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler : Got message from RPC server RpcReply(id=fc317c4a-3de4-4936-b4c3-768b8b727245, timestamp: 2020-01-08T10:12:44.237Z, entityType: Invocation, result=Success(FlowHandleImpl(id=[16566124-f7d2-41cf-b3a4-f86846073632], returnValue=net.corda.core.internal.concurrent.CordaFutureImpl#58f8aa01)), deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:45.986 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler : Got message from RPC server Observation(id=b3f0b064-6d82-4900-85e6-e70b7d00926a, timestamp: 2020-01-08T10:11:26.411Z, entityType: Invocation, content=[rx.Notification#b461fac0 OnNext Added(stateMachineInfo=StateMachineInfo([16566124-f7d2-41cf-b3a4-f86846073632], ***.workflow.asset.flow.ImportAssetFlow))], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:45.987 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler : Got message from RPC server Observation(id=12887a04-f22c-422d-b684-c679f137d66b, timestamp: 2020-01-08T10:12:45.979Z, entityType: Invocation, content=[rx.Notification#4c59250 OnNext Starting], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:58.603 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler : Got message from RPC server Observation(id=b83c15ca-9047-4958-a106-65165e5abfbd, timestamp: 2020-01-08T10:12:45.975Z, entityType: Invocation, content=[rx.Notification#e03cfa2d OnNext [B#2dceac3d], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:58.605 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler : Got message from RPC server Observation(id=b83c15ca-9047-4958-a106-65165e5abfbd, timestamp: 2020-01-08T10:12:45.975Z, entityType: Invocation, content=[rx.Notification#15895539 OnCompleted], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
There is a big gap between events
12:12:45.987 OnNext Starting - the start of the flow which consumes 1k objects
12:12:58.603 OnNext [B#2dceac3d] - the actual result of the operation.
So it's ~12.5s to return the response.
According to Jprofiler Corda processed flow in ~1.3s and send the result back.
What can be a reason for such behavior, slow journaling of artemis messages?
UPD:
Found that Corda has a suspend/resume mechanism (checkpoints), to store the thread state to the Disk and in the future read it again and resume this thread.
net.corda.node.services.statemachine.FlowStateMachineImpl#run
that triggers co.paralleluniverse.fibers.Fiber#parkAndSerialize. Seems it's one of the time consumers.
Thank you a lot in advance

There can be a number of causes:
JProfiler might not be reporting the flow duration correctly. Other profiling tools certainly have difficulty with the parking that Fibers do, as each resume from a suspend looks like a distinct method call. I would add some logging to your flow to check how long it really takes.
If you return a really large result from your flow (e.g. a large collection of large objects), the serialization and deserialization of that over RPC can take a reasonable amount of time. So are you returning a large result?

Found the root cause of the delay, it seems that all the entityManager transactions flushes at the end of the flow, but not after the end of the block withEntityManager as I expected
fun save(assetBatch: PCAssetBatch): PCAssetBatch {
return serviceHub.withEntityManager {
persist(assetBatch)
logger.debug("Batch [${assetBatch.batchId} - ${assetBatch.batchName}] has been persisted")
assetBatch
}
}
It took a while to understand that, cause of log I believed that the data was persisted. It’s because of the entity with OneToMany annotation
PCAssetBatch contains column List of PCAssetBatchItem, it seems that hibernate persist them one by one not by using batching.
The delay was resolved by fixing this part of the code.

Related

Pulsar: If a message gets nack'd (negativeAcknowledge()) when will it be redelivered?

If we cannot process a message (perhaps due to some timing problem or race condition) and we call
consumer.negativeAcknowledge(messageId);
When will it be redelivered to retry processing?
I am unable to figure out what the default setting for delivery is from the documentation
https://pulsar.apache.org/docs/en/concepts-messaging/#negative-acknowledgement
https://pulsar.apache.org/docs/en/concepts-messaging/#acknowledgement-timeout
https://pulsar.apache.org/docs/en/reference-configuration/
The default is 60 seconds.
You can configure it in the consumer:
Consumer<byte[]> consumer = client.newConsumer()
.topic("my-topic")
.subscriptionName("my-sub")
.negativeAckRedelivery(10, TimeUnit.SECONDS)
.subscribe()

When not to call FlushAsync on IAsyncCollector in Azure Functions ServiceBus binding

I have a range of Azure Function apps all outputting messages to Azure Service Bus using the IAsyncCollector<Message>:
public async Task Run([ServiceBus(...)] IAsyncCollector<Message> messages)
{
...
await messages.AddAsync(msg);
}
I'm getting errors logged from time to time looking like this:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Function
---> Microsoft.Azure.ServiceBus.ServiceBusTimeoutException: The operation did not complete within the allocated time 00:00:59.9999536 for object message.Reference: ..., 7/13/2020 2:46:24 PM
---> System.TimeoutException: The operation did not complete within the allocated time 00:00:59.9999536 for object message.
at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
at Microsoft.Azure.Amqp.SendingAmqpLink.EndSendMessage(IAsyncResult result)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList)
--- End of inner exception stack trace ---
at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.Core.MessageSender.SendAsync(IList`1 messageList)
at Microsoft.Azure.WebJobs.ServiceBus.Bindings.MessageSenderExtensions.SendAndCreateEntityIfNotExists(MessageSender sender, Message message, Guid functionInstanceId, EntityType entityType, CancellationToken cancellationToken)
at My.Function.Run(String mySbMsg, IAsyncCollector`1 messages)
I'm having a bit of a hard time figuring out when in the pipeline this happens. But I have recently learned about the FlushAsync method:
await messages.AddAsync(msg);
await messages.FlushAsync();
My question is the following. Why would I ever NOT include a call to FlushAsync in my function? Getting the timeout exception in my own code will make it possible to retry, do better exception logging, and more. Any downsides of flushing manually like this within the function code?
Why would I ever NOT include a call to FlushAsync in my function? Getting the timeout exception in my own code will make it possible to retry, do better exception logging, and more.
I'm going to go a bit further here and say that after I earned some experience with Azure Functions, I now avoid IAsyncCollector<T> completely. Some implementations publish on AddAsync; other implementations may publish on both AddAsync and FlushAsync. I suspect the service bus implementation is actually publishing on AddAsync, in which case FlushAsync may be a noop.
The nice part about IAsyncCollector<T> is that it gives you a "write these things" abstraction; all you have to do is provide a connection string, and the rest is magic. The problem with IAsyncCollector<T> is that it gives you an abstraction, and thus you have much less control.
Under the hood, how many retries are being done? Are they using constant delays or exponentially increasing? What's the behavior if it never succeeds? Usually none of this critical information is documented.
What's especially annoying is when the AF team changes the semantics of the abstraction. E.g., for some of the output bindings (either CosmosDB or storage, I don't remember), the retry behavior changed from one release of the functions SDK to the next.
So, I've gravitated towards avoiding output bindings, especially IAsyncCollector<T>. I usually want to do tight but exponentially increasing de-correlated jittered retries with a cap of a minute or so, but aborted when there's only a minute left of Functions runtime, and then the recovery behavior changes to writing a message to an error queue (with retries). This is way more complex than an IAsyncCollector<T> could ever provide, but it isn't that hard to do with Polly calling the SDKs directly.
Any downsides of flushing manually like this within the function code?
No. By default, IAsyncCollector<T>.FlushAsync is called by the functions host after your function executes. So if you call it yourself, you're just calling it early. It should be safe to call multiple times.
There's nothing wrong with calling FlushAsync in your code. The current implementation of IAsyncCollector for the ServiceBus output binding is that your messages get batched until FlushAsync is called or the function returns. When trying to send a high volume of messages I stumbled on the same Timeout exception as you. The solution I found is to call FlushAsync every N messages, in my case N=100 was the best trade-off. Calling FlushAsync too often would trigger obvious performance penalties.

Memory issue with camel resequencer

I have a camel resequencer for giving priorities to the incoming messages.
from(fromLocation)
.routeId(routeName)
.autoStartup(true)
.convertBodyTo(String.class)
.threads(1)
.setHeader(PRIORITY, constant(1))
.to(SEDA_PROCESS_NOTIFICATIONS+id);
from(SEDA_PROCESS_NOTIFICATIONS+id)
.resequence(header(PRIORITY))
.allowDuplicates
from(endpoint2)
.setHeader(PRIORITY, constant(2))
.to(SEDA_PROCESS_NOTIFICATIONS+id);
However, after running this for some 3-4 hours, I get memory exception with the batchprocessor of Resequencer. The batchProcessor is not releasing the memory. The error persists even if I try to decrease the batch size. If I use stream mode of resequencer, then all the messages are not processed. Please let me know what am I missing here?

Wait for messages processed by Service Bus OnMessage to finish

I'm using the Azure Service Bus SubscriptionClient.OnMessage method; configured to process up to 5 messages concurrently.
Within the code I need to wait for all messages to finish processing before I can continue (to properly shutdown an Azure Worker Role). How do I do this?
Will SubscriptionClient.Close() block until all messages have finished processing?
Calling Close on SubscriptionClient or QueueClient will not block. Calling Close closes off the entity immediately as far as I can tell. I tested quickly just using the Worker Role With Service Bus Queue project template that shipped with Windows Azure SDK 2.0. I added a thread sleep for many seconds in the message process action and then shut down the role while it was running. I saw the Close method get called while the messages were processing in their thread sleep but it certainly did not wait for the for message processing to complete, the role simple closed down.
To handle this gracefully you'll need to do the same thing we did when dealing with any worker role that was processing messages (Service Bus, Azure Storage queue or anything else): keep track of what is being worked on and shut down when it is complete. There are several ways to deal with that but all of them are manual and made messy in this case because of the multiple threads involved.
Given the way that OnMessage works you'll need to add something in the action that looks to see if the role has been told to shutdown, and if so, to not do any processing. The problem is, when the OnMessage action is executed it HAS a message already. You'd probably need to abandon the message but not exit the OnMessage action, otherwise it will keep getting a message if there are ones in the queue. You can't simply abandon the message and let the execution leave the action because then the system will be handed another message (possibly the same one) and several threads doing this may cause messages to get too many dequeue counts and get dead lettered. Also, you can't call Close on the SubscriptionClient or QueueClient, which would stop the receive loop internally, because once you call close any of the outstanding message processing will throw an exception when .Complete, .Abandon, etc. is called on the message because the message entity is now closed. This means you can't stop the incoming messages easily.
The main issue here is because you are using the OnMessage and setting up the concurrent message handling by setting the MaxConcurrentCalls on the OnMessageOptions, which means the code that starts and manages the threads is buried in the QueueClient and SubscriptionClient and you don't have control over that. You don't have a way to reduce the count of threads, or stop the threads individually, etc. You'll need to create a way to put the OnMessage action threads into a state where they are aware that the system is being told to shut down and then complete their message and not exit the action in order for them to not continuously be assigned new messages. This means you'll likely need to also set the MessageOptions to not use autocomplete and manually call complete in your OnMessage action.
Having to do all of this may severely reduce the actual benefit of using the OnMessage helper. Behind the scenes OnMessage is simply setting up a loop calling receive with the default timeout and handing of messages to another thread to do the action (loose description). So what you get by using the OnMessage approach is away from having to write that handler on your own, but then the problem you are having is because you didn't write that handler on your own you don't have control over those threads. Catch-22. If you really need to stop gracefully you may want to step away from the OnMessage approach, write your own Receive loop with threading and within the main loop stop receiving new messages and wait for all the workers to end.
One option, especially if the messages are idempotent (which means processing them more than once yields the same results... which you should be mindful of anyway) then if they are stopped in mid processing they will simply reappear on the queue to be processed by another instance later. If the work itself isn't resource intensive and the operations are idempotent then this really can be an option. No different than when an instance might fail due to hardware failure or other issues. Sure, it's not graceful or elegant, but it certainly removes all the complexity I've mentioned and is still something that can happen anyway due to other failures.
Note that the OnStop is called when an instance is told to shut down. You've got 5 minutes you can delay this until the fabric just shuts it off, so if your messages take longer than five minutes to process it won't really matter if you attempt to shut down gracefully or not, some will be cut off during processing.
You can tweak OnMessageAsync to wait for processing of messages to complete, and block new messages from beginning to be processed:
Here is the implementation:
_subscriptionClient.OnMessageAsync(async message =>
{
if (_stopRequested)
{
// Block processing of new messages. We want to wait for old messages to complete and exit.
await Task.Delay(_waitForExecutionCompletionTimeout);
}
else
{
try
{
// Track executing messages
_activeTaskCollection[message.MessageId] = message;
await messageHandler(message);
await message.CompleteAsync();
}
catch (Exception e)
{
// handle error by disposing or doing nothing to force a retry
}
finally
{
BrokeredMessage savedMessage;
if (!_activeTaskCollection.TryRemove(message.MessageId, out savedMessage))
{
_logger.LogWarning("Attempt to remove message id {0} failed.", savedMessage.MessageId);
}
}
}
}, onMessageOptions);
And an implementation of Stop that waits for completion:
public async Task Stop()
{
_stopRequested = true;
DateTime startWaitTime = DateTime.UtcNow;
while (DateTime.UtcNow - startWaitTime < _waitForExecutionCompletionTimeout && _activeTaskCollection.Count > 0)
{
await Task.Delay(_waitForExecutionCompletionSleepBetweenIterations);
}
await _subscriptionClient.CloseAsync();
}
Note that _activeTaskCollection is a ConcurrentDictionary (we can also use a counter with interlock to count the number of in progress messages, but using a dictionary allows you to investigate what happend easily in case of errors.

Thread resource sharing

I'm struggling with multi-threaded programming...
I have an application that talks to an external device via a CAN to USB
module. I've got the application talking on the CAN bus just fine, but
there is a requirement for the application to transmit a "heartbeat"
message every second.
This sounds like a perfect time to use threads, so I created a thread
that wakes up every second and sends the heartbeat. The problem I'm
having is sharing the CAN bus interface. The heartbeat must only be sent
when the bus is idle. How do I share the resource?
Here is pseudo code showing what I have so far:
TMainThread
{
Init:
CanBusApi =new TCanBusApi;
MutexMain =CreateMutex( "CanBusApiMutexName" );
HeartbeatThread =new THeartbeatThread( CanBusApi );
Execution:
WaitForSingleObject( MutexMain );
CanBusApi->DoSomething();
ReleaseMutex( MutexMain );
}
THeartbeatThread( CanBusApi )
{
Init:
MutexHeart =CreateMutex( "CanBusApiMutexName" );
Execution:
Sleep( 1000 );
WaitForSingleObject( MutexHeart );
CanBusApi->DoHeartBeat();
ReleaseMutex( MutexHeart );
}
The problem I'm seeing is that when DoHeartBeat is called, it causes the
main thread to block while waiting for MutexMain as expected, but
DoHeartBeat also stops. DoHeartBeat doesn't complete until after
WaitForSingleObject(MutexMain) times out in failure.
Does DoHeartBeat execute in the context of the MainThread or
HeartBeatThread? It seems to be executing in MainThread.
What am I doing wrong? Is there a better way?
Thanks,
David
I suspect that the CAN bus API is single-threaded under the covers. It may be marshaling your DoHeartBeat() request from your second thread back to the main thread. In that case, there would be no way for it to succeed since your main thread is blocked. You can fix this in basically two ways: (1) send a message to the main thread, telling it to do the heart beat, rather than doing it on the second thread; or (2) use a timer on the main thread for your heart beat instead of a second thread. (I do think that multithreading is overkill for this particular problem.)
First, re-read the specs about the heartbeat. Does it say that an actual heartbeat message must be received every second, or is it necessary that some message be received every second, and that a heartbeat should be used if no other messages are in flight? The presence of data on the channel is de-facto evidence that the communications channel is alive, so no specific heartbeat message should be required.
If an actual heartbeat message is required, and it's required every second, in the above code there should be only one mutex and both threads need to share it. The code as written creates two separate mutexes, so neither will actually block. You'll end up with a collision on the channel and Bad Things Will Happen in CanBusApi. Make MainMutex visible a global/class variable and have both threads reference it.

Resources