Sending data in batches using spring integration? - spring-integration

I am trying to read a file from GCP based on a notification received as per the flow defined below:
File reader - Deserialises the data into collection and sends for routing.
I am de-searializing the data in collection of objects and sending it router for further processing. As i don't have the control over file size, i am thinking of some approach of batching the reader process.
Currently, the file-reader service activator returns the whole Collection of deserialised objects.
Issue:
In case i receive a file of larger size i.e. with 200k records, i want to send this in batches to the header value router rather than a
collection of 200k objects.
If i convert the file-reader into a splitter and add an aggregator
after that Notification -> file-reader -> aggregator -> router.
I would still need to return the collection of all the objects not the iterator.
I don't want to load all the record into a collection.
Updated approach:
public <S> Collection<S> readData(DataInfo dataInfo, Class<S> clazz) {
Resource gcpResource = context.getResource("classpath://data.json")
var tempDataSet = new HashSet<S>();
AtomicInteger pivot = new AtomicInteger();
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gcpResource.getInputStream()))) {
bufferedReader.lines().map((dataStr) -> {
try {
var data = deserializeData(dataStr, clazz);
return data;
} catch (JsonProcessingException ex) {
throw new CustomException("PARSER-1001", "Error occurred while parsing", ex);
}
}).forEach(data -> {
if (BATCH_SIZE == pivot.get()) {
//When the size in tempDataSet reached BATCH_SIZE send the data in routing channel and reset the pivot
var message = MessageBuilder.withPayload(tempDataSet.clone())
.setHeader(AppConstants.EVENT_HEADER_KEY, eventType)
.build();
routingChannel.send(message);
pivot.set(0);
tempDataSet.removeAll(tempDataSet);
} else {
pivot.addAndGet(1);
tempDataSet.add(data);
}
});
return tempDataSet;
} catch (Exception ex) {
throw new CustomException("PARSER-1002", "Error occurred while parsing", ex);
}
}
If the batch size in 100 and we received 1010 objects. The 11 batches would be created, 10 with 100 and last one with 10 objects in it.
In case i use a splitter and pass the stream to it, will it wait for the whole stream to finish and then send the collected collection or we can achieve something close to previous approach using it?

Not sure what is the question, but I would go with FileSplitter + Aggregator solution. The first one is exactly for streaming file reading use-case. The second one lets you to buffer incoming messages until they reach some condition, so it can emit a single message downstream. That message indeed could be with a collection as a payload.
Here is their docs for your consideration:
https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregator
https://docs.spring.io/spring-integration/docs/current/reference/html/file.html#file-splitter

Related

Sending messages from a file to Kinesis using Spark

We've files containing millions of lines. I need to read each line from the file & send it to Kinesis. I am trying following code:
KinesisAsyncClient kinesisClient = KinesisClientUtil.createKinesisAsyncClient(KinesisAsyncClient.
builder().region(region));
rdd.map(line -> {
PutRecordRequest request = PutRecordRequest.builder()
.streamName("MyTestStream")
.data(SdkBytes.fromByteArray(line.getBytes()))
.build();
try {
kinesisClient.putRecord(request).get();
} catch (InterruptedException e) {
LOG.info("Interrupted, assuming shutdown.");
} catch (ExecutionException e) {
LOG.error("Exception while sending data to Kinesis. Will try again next cycle.", e);
}
return null;
}
);
I am getting this error message:
object not serializable (class: software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient
It seems KinesisAsyncClient is not the right object to use. Which other object can be used? NOTE: We don't want to use Spark (Structured) Streaming for this use case because the files will come only a few times in a day. Doesn't make sense to keep the Streaming app running.
Is there a better way to send messages to Kinesis via Spark. NOTE: We want to use Spark so that messages can be sent in Distributed fashion. Sequentially sending each message it taking too long.

Event Hub Consumer in Service Fabric

I'm trying to get a service fabric to consistently pull messages from an azure event hub. I seem to have everything wired up but notice that my consumer just stops pulling events.
I have a hub with a couple thousand events I've pushed to it. Configured the hub with 1 partition and have my service fabric service with also only 1 partition to ease debugging.
Service starts, creates the EventHubClient, from there uses it to create a PartitionReceiver. The receiver is passed to an "EventLoop" that enters an "infinite" while that calls receiver.ReceiveAsync. The code for the EventLoop is below.
What I am observing is the first time through the loop I almost always get 1 message. Second time through I get somewhere between 103 and 200ish messages. After that, I get no messages. Also seems like if I restart the service, I get the same messages again - but that's because when I restart the service I'm having it start back at the beginning of the stream.
Would expect this to keep running until my 2000 messages were consumed and then it would wait for me (polling ocassionally).
Is there something specific I need to do with the Azure.Messaging.EventHubs 5.3.0 package to make it keep pulling events?
//Here is how I am creating the EventHubClient:
var connectionString = "something secret";
var connectionStringBuilder = new EventHubsConnectionStringBuilder(connectionString)
{
EntityPath = "NameOfMyEventHub"
};
try
{
m_eventHubClient = EventHubClient.Create(connectionStringBuilder);
}
//Here is how I am getting the partition receiver
var receiver = m_eventHubClient.CreateReceiver("$Default", m_partitionId, EventPosition.FromStart());
//The event loop which the receiver is passed to
private async Task EventLoop(PartitionReceiver receiver)
{
m_started = true;
while (m_keepRunning)
{
var events = await receiver.ReceiveAsync(m_options.BatchSize, TimeSpan.FromSeconds(5));
if (events != null) //First 2/3 times events aren't null. After that, always null and I know there are more in the partition/
{
var eventsArray = events as EventData[] ?? events.ToArray();
m_state.NumProcessedSinceLastSave += eventsArray.Count();
foreach (var evt in eventsArray)
{
//Process the event
await m_options.Processor.ProcessMessageAsync(evt, null);
string lastOffset = evt.SystemProperties.Offset;
if (m_state.NumProcessedSinceLastSave >= m_options.BatchSize)
{
m_state.Offset = lastOffset;
m_state.NumProcessedSinceLastSave = 0;
await m_state.SaveAsync();
}
}
}
}
m_started = false;
}
**EDIT, a question was asked on the number of partitions. The event hub has a single partition and the SF service also has a single one.
Intending to use service fabric state to keep track of my offset into the hub, but that's not the concern for now.
Partition listeners are created for each partition. I get the partitions like this:
public async Task StartAsync()
{
// slice the pie according to distribution
// this partition can get one or more assigned Event Hub Partition ids
string[] eventHubPartitionIds = (await m_eventHubClient.GetRuntimeInformationAsync()).PartitionIds;
string[] resolvedEventHubPartitionIds = m_options.ResolveAssignedEventHubPartitions(eventHubPartitionIds);
foreach (var resolvedPartition in resolvedEventHubPartitionIds)
{
var partitionReceiver = new EventHubListenerPartitionReceiver(m_eventHubClient, resolvedPartition, m_options);
await partitionReceiver.StartAsync();
m_partitionReceivers.Add(partitionReceiver);
}
}
When the partitionListener.StartAsync is called, it actually creates the PartitionListener, like this (it's actually a bit more than this, but the branch taken is this one:
m_eventHubClient.CreateReceiver(m_options.EventHubConsumerGroupName, m_partitionId, EventPosition.FromStart());
Thanks for any tips.
Will
How many partition do you have? I can't see in your code how you make sure you read all partitions in the default consumer group.
Any specific reason why you are using PartitionReceiver instead of using an EventProcessorHost?
To me, SF seems like a perfect fit for using the event processor host. I see there is already a SF integrated solution that uses stateful services for checkpointing.

Complete a message in a dead letter queue on Azure Service Bus

I want to be able to remove selected messages from my deadletter queue.
How is this accomplished?
I constantly get an error:
The operation cannot be completed because the RecieveContext is Null
I have tried every approach I can think of and read about, here is where I am now:
public void DeleteMessageFromDeadletterQueue<T>(string queueName, long sequenceNumber)
{
var client = GetQueueClient(queueName, true);
var messages = GetMessages(client);
foreach(var m in messages)
{
if(m.SequenceNumber == sequenceNumber)
{
m.Complete();
}
else
{
m.Abandon();
}
}
}
/// <summary>
/// Return a list of all messages in a Queue
/// </summary>
/// <param name="client"></param>
/// <returns></returns>
private IEnumerable<BrokeredMessage> GetMessages(QueueClient client)
{
var peekedMessages = client.PeekBatch(0, peekedMessageBatchCount).ToList();
bool getmore = peekedMessages.Count() == peekedMessageBatchCount ? true : false;
while (getmore)
{
var moreMessages = client.PeekBatch(peekedMessages.Last().SequenceNumber, peekedMessageBatchCount);
peekedMessages.AddRange(moreMessages);
getmore = moreMessages.Count() == peekedMessageBatchCount ? true : false;
}
return peekedMessages;
}
Not sure why this seems to be such a difficult task.
The issue here is that you've called PeekBatch which returns the messages that are just peeked. There is no Receive context which you can then use to either complete or abandon the message. The Peek and PeekBatch operations only return the messages and does not lock them at all, even if the receivedMode is set to PeekLock. It's mostly for skimming the queue, but you can't take an action on them. Note the docs for both Abandon and Complete state, "must only be called on a message that has been received by using a receiver operating in Peek-Lock ReceiveMode." It's not clear here, but Peek and PeekBatch operations don't count in this as they don't actually get a receive context. This is why it fails when you attempt to call abandon. If you actually found the one you were looking for it would throw a different error when you called Complete.
What you want to do is use a ReceiveBatch operation instead in PeekBatch RecieveMode. This will actually pull a batch of the messages back and then when you look through them to find the one you want you can actually affect the message complete. When you fire the abandon it will immediately release the message that isn't what you want back to the queue.
If your deadletter queue is pretty small usually this won't be bad. If it is really large then taking this approach isn't the most efficient. You're treating the dead letter queue more like a heap and digging through it rather than processing the messages "in order". This isn't uncommon when dealing with dead letter queues which require manual intervention, but if you have a LOT of these then it may be better to have something processing the dead letter queue into a different type of store where you can more easily find and destroy messages, but can still recreate the messages that are okay to push to a different queue to reprocess.
There may be other options, such as using Defer, if you are manually dead lettering things. See How to use the MessageReceiver.Receive method by sequenceNumber on ServiceBus.
I was unsuccessful with MikeWo's suggestion, because when I used the combination of instantiating the DLQ QueueClient with ReceiveMode.PeekLock and pulling messages with ReceiveBatch, I was using the versions of Receive/ReceiveBatch requesting the message by its SequenceNumber.
[aside: in my app, I peek all the messages and list them, and have another handler to re-queue to the main queue a dead-lettered message based on it's specific sequence number...]
But the call to Receive(long sequenceNumber) or ReceiveBatch(IEnumerable sequenceNumber) on the DLQClient always throws the exception, "Failed to lock one or more specified messages. The message does not exist." (even when I only passed 1 and it is definitely in the queue).
Additionally, for reasons that aren't clear, using ReceiveBatch(int messageCount), always only returns the next 1 message in the queue no matter what value is used as the messageCount.
What finally worked for me was the following:
QueueClient queueClient, deadLetterClient;
GetQueueClients(qname, ReceiveMode.PeekLock, out queueClient, out deadLetterClient);
BrokeredMessage msg = null;
var mapSequenceNumberToBrokeredMessage = new Dictionary<long, BrokeredMessage>();
while (msg == null)
{
#if UseReceive
var message = deadLetterClient.Receive();
#elif UseReceiveBatch
var messageEnumerable = deadLetterClient.ReceiveBatch(CnCountOfMessagesToPeek).ToList();
if ((messageEnumerable == null) || (messageEnumerable.Count == 0))
break;
else if (messageEnumerable.Count != 1)
throw new ApplicationException("Invalid expectation that there'd always be only 1 msg returned by ReceiveBatch");
// I've yet to get back more than one in the deadletter queue, but...
var message = messageEnumerable.First();
#endif
if (message.SequenceNumber == lMessageId)
{
msg = message;
break;
}
else if (mapSequenceNumberToBrokeredMessage.ContainsKey(message.SequenceNumber))
{
// this means it's started the list over, so we didn't find it...
break;
}
else
mapSequenceNumberToBrokeredMessage.Add(message.SequenceNumber, message);
message.Abandon();
}
if (msg == null)
throw new ApplicationException("Unable to find a message in the deadletter queue with the SequenceNumber: " + msgid);
var strMessage = GetMessage(msg);
var newMsg = new BrokeredMessage(strMessage);
queueClient.Send(newMsg);
msg.Complete();

ReceiveFromAsync leaking SocketAsyncEventArgs?

I have a client application that receives video stream from a server via UDP or TCP socket.
Originally, when it was written using .NET 2.0 the code was using BeginReceive/EndReceive and IAsyncResult.
The client displays each video in it's own window and also using it's own thread for communicating with the server.
However, since the client is supposed to be up for a long period of time, and there might be 64 video streams simultaneously, there is a "memory leak" of IAsyncResult objects that are allocated each time the data receive callback is called.
This causes the application eventually to run out of memory, because the GC can't handle releasing of the blocks in time. I verified this using VS 2010 Performance Analyzer.
So I modified the code to use SocketAsyncEventArgs and ReceiveFromAsync (UDP case).
However, I still see a growth in memory blocks at:
System.Net.Sockets.Socket.ReceiveFromAsync(class System.Net.Sockets.SocketAsyncEventArgs)
I've read all the samples and posts about implementing the code, and still no solution.
Here's how my code looks like:
// class data members
private byte[] m_Buffer = new byte[UInt16.MaxValue];
private SocketAsyncEventArgs m_ReadEventArgs = null;
private IPEndPoint m_EndPoint; // local endpoint from the caller
Initializing:
m_Socket = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);
m_Socket.Bind(m_EndPoint);
m_Socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.ReceiveBuffer, MAX_SOCKET_RECV_BUFFER);
//
// initalize the socket event args structure.
//
m_ReadEventArgs = new SocketAsyncEventArgs();
m_ReadEventArgs.Completed += new EventHandler<SocketAsyncEventArgs>(readEventArgs_Completed);
m_ReadEventArgs.SetBuffer(m_Buffer, 0, m_Buffer.Length);
m_ReadEventArgs.RemoteEndPoint = new IPEndPoint(IPAddress.Any, 0);
m_ReadEventArgs.AcceptSocket = m_Socket;
Starting the read process:
bool waitForEvent = m_Socket.ReceiveFromAsync(m_ReadEventArgs);
if (!waitForEvent)
{
readEventArgs_Completed(this, m_ReadEventArgs);
}
Read completion handler:
private void readEventArgs_Completed(object sender, SocketAsyncEventArgs e)
{
if (e.BytesTransferred == 0 || e.SocketError != SocketError.Success)
{
//
// we got error on the socket or connection was closed
//
Close();
return;
}
try
{
// try to process a new video frame if enough data was read
base.ProcessPacket(m_Buffer, e.Offset, e.BytesTransferred);
}
catch (Exception ex)
{
// log and error
}
bool willRaiseEvent = m_Socket.ReceiveFromAsync(e);
if (!willRaiseEvent)
{
readEventArgs_Completed(this, e);
}
}
Basically the code works fine and I see the video streams perfectly, but this leak is a real pain.
Did I miss anything???
Many thanks!!!
Instead of recursively calling readEventArgs_Completed after !willRaiseEvent use goto to return to the top of the method. I noticed I was slowly chewing up stack space when I had a pattern similar to yours.

Download an undefined number of files with HttpWebRequest.BeginGetResponse

I have to write a small app which downloads a few thousand files. Some of these files contain reference to other files that must be downloaded as part of the same process. The following code downloads the initial list of files, but I would like to download the others files as part of the same loop. What is happening here is that the loop completes before the first request come back. Any idea how to achieve this?
var countdownLatch = new CountdownEvent(Urls.Count);
string url;
while (Urls.TryDequeue(out url))
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}), webRequest);
}
ce.Wait();
One solution which comes to mind is to keep track of the number of pending requests and only finish the loop once no requests are pending and the Url queue is empty:
string url;
int requestCounter = 0;
int temp;
AutoResetEvent requestFinished = new AutoResetEvent(false);
while (Interlocked.Exchange(requestCounter, temp) > 0 || Urls.TryDequeue(out url))
{
if (url != null)
{
Interlocked.Increment(requestCounter);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
try {
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}
finally {
Interlocked.Decrement(requestCounter);
requestFinished.Set();
}
}), webRequest);
}
else
{
// no url but requests are still pending
requestFinished.WaitOne();
}
}
You are tryihg to write a webcrawler. In order to write a good webcrawler, you first need to define some parameters...
1) How many request do you want to download simultaneously? In other words, how much throughput do you want? This will determine things like how many requests you want outstanding, what the threadpool size should be etc.
2) You will have to have a queue of URLs. This queue is populated by each request that completes. You now need to decide what the growth strategy of the queue is. For eg, you cannot have an unbounded queue, as you can pump workitems into the queue faster than you can download from the network.
Given this, you can design a system as follows:
Have max N worker threads that actually download from the web. They take one time from the queue, and download the data. They parse the data and populate your URL queue.
If there are more than 'M' URLs in the queue, then the queue blocks and does not allow anymore URLs to be queued. Now, here you can do one of two things. You can either cause the thread that is enqueuing to block, or you can just discard the workitem being enqueued. Once another workitem completes on another thread, and a URL is dequeued, the blocked thread will now be able to enqueue succesfully.
With a system like this, you can ensure that you will not run out of system resources while downloading the data.
Implementation:
Note that if you are using async, then you are using an extra I/O thread to do the download. THis is fine, as long as you are mindful of this fact. You can do a pure async implementation, where you can have 'N' BeginGetResponse() outstanding, and for each one that completes, you start another one. THis way you will always have 'N' requests outstanding.

Resources