We are using JdbcOperations.queryForStream() to fetch 30k+ rows from database as per the inputs from Spring Integration Jdbc OutboundGateway returning 1 record ONLY even with MaxRows(0), however split() and aggregate() on the stream is not working. We need the aggregation to work as to know when all the stream records are consumed to perform a final operation.
The splitter doesn't know the size for an Iterator, Stream or Flux request message payloads: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#iterators. Therefore a sequenceSize header is 0 and default aggregator cannot do its job just because there is no sequenceSize to compare with. You must provide a custom releaseStrategy or rely on a groupTimeout to perform that final operation.
Another trick could be done with JDBC: before calling queryForStream() you can ask for count of records and set that value into some header in a reply message before splitter. Such a header you can use in a custom releaseStrategy.
See more info about aggregator features in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregator
Related
I'm new to Azure Stream Analytics. I have an Event hub as input source and now I'm trying to execute a simple query on this stream. An example query is like this:
SELECT
count(*)
INTO [output1]
FROM
[input1] TIMESTAMP BY Time
GROUP BY TumblingWindow(second, 10)
So I want to count the events which arrived within a certain time frame.
When executing this query, I always get the following error:
Request exceeded maximum allowed size limit
As I already narrowed down the checked time window and I'm certain that the amount of events within this time frame is not very big (at most several 100)
I'm not sure how to avoid this error.
Do you have a hint?
Thanks!
Request exceeded maximum allowed size limit
This error(i believe it should be more explicit) indicates that you violated the azure stream analytic resource and object limits.
It's not just about quantity, it's also about size.Please check your source inputs' size or try to reduce the windowsize and test again.
1.Does the record size of the source query mean that one event can only have 64 KB or does this parameter mean 64 K events?
It means the size of one event should below 64KB.
Is there a possibility to use Stream Analytics to select only certain
subfields of the event or is the only way to reduce the event size
before it is sent to the event hub?
As i know,ASA only collects data for processing it,so the size is all depends on the source side and your query sql. Since you need to use COUNT, i'm afraid that you have to do something on the eventhub side.Please refer to my thoughts:
Use Event Hub Azure Function Trigger, when an event streams into event hub,trigger the function and pick only partial key-values and save it into another event hub namespace.(Just in order to reduce the size of source event) Anyway you only need to COUNT records, i think it works for you.
How to consume multiple messages from pubsub? This seems like a simple problem that should have simple solution but currently I can find easy way to consume batches of records from pubsub with spring-cloud-gcp-pubsub.
I'm using spring-cloud-gcp-pubsub to consume messages from pubsub and process them in spring boot application. My current setup is very simple I have PubSubInboundChannelAdapter and ServiceActivator that consumes records. After research I have found spring integration Aggregators but they didn't seem like a good way of doing this because it's not easy to propagate the acknowledgment downstream. Is there anything I'm missing? How can I consume batches of messages?
The PubSubInboundChannelAdapter is based on the subscription to the topic. So, it is going to be a stream of messages and this PubSubInboundChannelAdapter reacts to each of them converting to the Spring Message and sending it downstream to the configured channel.
There is really no way to get a batch of messages during subscription.
Also you need to keep in mind that there is no something like offset in GCP Pub/Sub. You indeed should acknowledge every single message you consume from the Pub/Sub.
Although there is the way to pull a batch of messages at once, using PubSubMessageSource. The messageSource.setMaxFetchSize(5); does the trick, but this PubSubMessageSource still produces every message individually, so you would be able to (n)ack them independently.
You can, of course, leverage the feature PubSubMessageSource uses - PubSubSubscriberOperations.pullAndConvert(). See it's JavaDocs for more info:
/**
* Pull a number of messages from a Google Cloud Pub/Sub subscription and convert them to Spring messages with
* the desired payload type.
* #param subscription the subscription name
* #param maxMessages the maximum number of pulled messages
* #param returnImmediately returns immediately even if subscription doesn't contain enough
* messages to satisfy {#code maxMessages}
* #param payloadType the type to which the payload of the Pub/Sub messages should be converted
* #param <T> the type of the payload
* #return the list of received acknowledgeable messages
* #since 1.1
*/
<T> List<ConvertedAcknowledgeablePubsubMessage<T>> pullAndConvert(String subscription, Integer maxMessages,
Boolean returnImmediately, Class<T> payloadType);
So, this one looks like what you are looking for because you indeed are going to have a list of messages and each of them is a wrapper with (n)ack callbacks.
This API could be used in the custom #InboundChannelAdapter MessageSource or Supplier #Bean implementation.
But still: I don't see benefits of the whole batch processing since every message can be ack'ed individually not affecting all others.
Try using below:
#Bean
#InboundChannelAdapter(channel = "pubsubInputChannel", poller = #Poller(fixedDelay = "5000", maxMessagesPerPoll = "3"))
public MessageSource<Object> pubsubAdapter(PubSubTemplate pubSubTemplate) {
PubSubMessageSource messageSource = new PubSubMessageSource(pubSubTemplate, "testSubscription");
messageSource.setAckMode(AckMode.MANUAL);
return messageSource;
}
maxMessagesPerPoll property determines how many messages will be polled.
I have a Spark dataframe which I need to filter based on the condition.
Condition is: There is a column "keyword" in the dataframe and I need to call an API passing this keyword column value. It is to be done for all the keyword column values. API will send back one number which I need to match with a threshold value. If it's greater then need to return true else false.
I wrote an UDF for that and it looks like below..
val filteredDf = df.filter(apiUdf(col("keyword_text")) === true))
val apiUdf = udf((topic: String) => {..
.....
HTTP API call ..
parse the result ...
find out the number from the API resposne..
and then compare it with some threshold value and return true/false
Here the issue is that I am opening and closing HTTP connection as many times I have number of keywords.. Can someone tell me how to optimize this and also if UDF approach here is fine?
Spark UDFs are meant to deal with impl complex logic to return value/s.
In a distributed data processing, it is not good design/approach executors calling the external URLs.
Its scaling issue with the size of the data and no of times
connection opening/closing.
And also, In most of the production environments, executor nodes are
not exposed to internet.
I would advise
save/collect all the col("keyword_text")
for each keyword call the HTTP API (not spark udf) and get the response
save the data as some_id, keyword_text, api_result
Now with df1 (some_id, keyword_text, api_result)
You can join df and f1 and filter with the api_result.
I am not sure, if HTTP API takes bulk/batch request (usually most will do) you can consider that approach.
We are using Spring Integration 4.2.3 Aggregator component and defined group-timeout and expecting the group to be timed out within the given timeout value while adding messages to the group & release size criteria is not met.
But we are seeing different results, when we input heavy load to the service the aggregator is waiting on all messages to be added to the group rather than expiring the group when the timeout reached.
Is there any way to override the aggregator functionality to look at the first message rather than last message when timing out group.
Well, actually you can do what you need even now. Using the same group-timeout-expression. But you have to consult the #root object of the evaluation context which is exactly what you need - MessageGroup. With that you can call one of for your purpose:
/**
* #return the timestamp (milliseconds since epoch) associated with the creation of this group
*/
long getTimestamp();
/**
* #return the timestamp (milliseconds since epoch) associated with the time this group was last updated
*/
long getLastModified();
Therefore an expression for your original request might be like:
group-timeout-expression="timestamp + 10000 - T(System).currentTimeMillis()"
And we get that adjusted timeout which will be applied to scheduled task with the value like: new Date(System.currentTimeMillis() + groupTimeout));.
No; the timeout is currently based on the arrival of the last message only.
If you use a MessageGroupStoreReaper instead, the time is based on the group creation by default, but that can be changed by setting the group store's timeoutOnIdle to true.
If your group is not timing out at all, perhaps the thread pool in the default taskScheduler is exhausted - it only has 10 threads by default.
You can increase the pool size or inject a dedicated scheduler into the aggregator.
we debugged the issue with your group-timeout-expression(timestamp + 20000 - T(System).currentTimeMillis()) and found out that the expression is evaluating to a negative value after messaged keep flowing in thus causing the group never getting released.
The code block where the issue is in AbstractCorrelatingMessageHandler.java
Once we removed the condition of "groupTimeout >= 0", now the group is getting expired because of the else block. The code is now behaving like how we expected.
Could you let me know why you are not forcing the group to be timedoout when it reaches negative value?
I have a batch process , and we receive an START message in a queue, and an END message in the same queue. After the Start message, we receive thousands of messages in other 3 queues, that we filter, enrich, aggregate and finally transform to JSON. We can call this pipeline as MAIN_PIPE )
After that Start message we have an adapter that reads from database the total number of elements in only one message that we will receive ( we can call this pipeline as COUNTER_PIPE )
And after the End message, whenever we have treated ALL the messages we have to send a request to an external service.
So, we need to count all treated messages ( JSON converted ) in MAIN_PIPE and compare to that number in COUNTER_PIPE.
How can I compare that ?
Would you mind to describe also how do you read from those 3 queues? It isn't clear to me where is a correlation between START and all those messages to the batch. If that is regular message-driven channel adapter, there is a case when we may start receiving those message but there is still no START or no info about count in the DB.
Anyway I'd make it like:
The START and END messages, as well, as all messages in that batch must have the same correlataionKey to let an Aggregator to form a batch in the end.
Since the group in case is based on the count anyway, you don't have choice unless send to the aggregator even discarded messages from the filter. That might be just simple error stub to be able to distinguish them properly in the aggregator's release function.
The releaseStrategy of the aggregator must iterate over the group to find a message with the count and compare it with the group size + 2 (START & END messages).
Does it make sense to you?