I am looking to understand what is the best way to persist the output of an integration flow containing an aggregator (with release strategy spanning approx. 3 minutes) into a datasource.
I've been considering JDBC, but open to other options as well. Ideally I could use a spring integration component that deals well with buffered writes, maintaining jdbc connections with low overhead and that can operate at a spring data level (I want to work with a domain model not with raw SQL updates).
JDCBOutboundGateway and JDBCMessageHandler seem to be limited by SQL direct
interaction.
I considered JDBCChanelMessageStore but seemed that would be limited by a binary representation of the messages.
Is a service activator a more suited choice? Would I have to read service activator inputs from a pollable queue in order to ensure better write performance?
Transactional constraints are also under consideration for this integration flow with the goal to accommodate about 10K messages/hour.
Any directions are highly appreciated. Have been having trouble finding resources about JDBC output from integration flows (mostly mentions of jdbc message handlers and gateways without specific considerations for performance and data model flexibility).
Related
I have been reading much Spring Cloud DataFlow and related documentation in order to produce a data ingest solution that will run in my organization's Cloud Foundry deployment. The goal is to poll an HTTP service for data, perhaps three times per day for the sake of discussion, and insert/update that data in a PostgreSQL database. The HTTP service seems to provide 10s of thousands of records per day.
One point of confusion thus far is a best practice in the context of a DataFlow pipeline for deduplicating polled records. The source data do not have a timestamp field to aid in tracking polling, only a coarse day-level date field. I also have no guarantee that records are not ever updated retroactively. The records appear to have a unique ID, so I can dedup the records that way, but I am just not sure based on the documentation how best to implement that logic in DataFlow. As far as I can tell, the Spring Cloud Stream starters do not provide for this out-of-the-box. I was reading about Spring Integration's smart polling, but I'm not sure that's meant to address my concern either.
My intuition is to create a custom Processor Java component in a DataFlow Stream that performs a database query to determine whether polled records have already been inserted, then inserts the appropriate records into the target database, or passes them on down the stream. Is querying the target database in an intermediate step acceptable in a Stream app? Alternatively, I could implement this all in a Spring Cloud Task as a batch operation which triggers based on some schedule.
What is the best way to proceed with respect to a DataFlow app? What are common/best practices for achieving deduplication as I described above in a DataFlow/Stream/Task/Integration app? Should I copy the setup of a starter app or just start from scratch, because I am fairly certain I'll need to write custom code? Do I even need Spring Cloud DataFlow, because I'm not sure I'll be using its DSL at all? Apologies for all the questions, but being new to Cloud Foundry and all these Spring projects, it's daunting to piece it all together.
Thanks in advance for any help.
You are on the right track, given your requirements you will most likely need to create a custom processor. You need to keep track of what has been inserted in order to avoid duplication.
There's nothing preventing you from writing such processor in a stream app, however performance may take a hit, since for each record you will issue a DB query.
If order is not important, you could parallelize the query so you could process several concurrent messages, but in the end your DB would still pay the price.
Another approach would to use a bloomfilter that can help quite a lot on speeding up your checking for inserted records.
You can start by cloning the starter apps, you could have a poller trigger an http client processor that fetches your data and then go through your custom code processor and finally to a jdbc-sink. Something like stream create time --triger.cron=<CRON_EXPRESSION> | httpclient --httpclient.url-expression=<remote_endpoint> | customProcessor | jdbc
One of the advantages of using SCDF is that you could independently scale your custom processor via deployment properties such as deployer.customProcessor.count=8
Spring Cloud Data Flow builds integration streams for data based on the Spring Cloud Stream, which, in turn, is fully based on the Spring Integration. And all the principles exist in Spring Integration can be applied everywhere there on the SCDF level.
That really might be a case that you won't be able to avoid some codding, but what you need is called in EIP Idempotent Receiver. And Spring Integration provides one for us:
#ServiceActivator(inputChannel = "processChannel")
#IdempotentReceiver("idempotentReceiverInterceptor")
public void handle(Message<?> message)
I am trying to design an IoT platform using the above mentioned technologies. I would be happy if someone can comment on the architecture, if its good and scalable !
I get IoT sensor data through mqtt which I will receive through spark streaming( There is a mqtt connector for spark streaming which does it). I only have to subscribe to the topics and there is a third party server which publishes IoT data to the topic.
Then I parse the data , and insert in AWS DynamoDB . Yes whole setup will run on AWS.
I may have to process/transform the data in future depending on the IoT use cases so I thought spark might be useful . Also I have heard spark streaming is blazing fast.
It's a simple overview and I am not sure if its a good architecture. Will it be a overkill to use spark streaming ? Are there other ways to directly store data on DynamoDB received from mqtt ?
I cannot state whether your components will result in a scalable architecture, since you did not elaborate on how you will scale them, nor what will be the estimated load that such a system should handle, or if there will be peaks in terms of load.
If you are talking about scalability in terms of performance, you should also consider scalability in terms of pricing which may be important to your project.
For instance, DynamoDB is a very scalable NoSQL database service, which offers elastic performances with a very efficient pricing. I do not know much about Apache Spark, and even if it has been designed to be very efficient at scale, how will you distribute incoming data ? Will you host multiple instances on EC2 and use autoscaling to manage instances ?
My advice would be to segregate your needs in terms of components to conduct a successful analysis. To summarize your statements:
You need to ingest incoming sensor telemetry at scale using MQTT.
You need to transform or enrich these data on the fly.
You need to insert these data (probably as time-series) into DynamoDB in order to build an event-sourcing system.
Since you mentioned Apache Spark, I imagine you would need to perform some analysis of these data, either in near real-time, or in batch, to build value out of your data.
My advice would be to use serverless, managed services in AWS so that you can only pay for what really you use, and forget about the maintenance, or the scalability, and focus on your project.
AWS IoT is a platform built into AWS which will allow you to securely ingest data at any scale using MQTT.
This platform also embeds a rules engine, which will allow you to build your business rules in the cloud. For example, intercepting incoming messages, enrich them, and call other AWS services as a result (e.g calling a Lambda function to do some processing on the ingested data).
The rules engine has a native connector to DynamoDB, which will allow you to insert your enriched or transformed data into a table.
The rules engine has also a connector to the new Amazon Machine Learning service, if you want to get predictions on sensor data in real-time.
You can then use other services such as EMR + Spark to batch-process your data once a day, week, month.
The advantage here is that you assemble your components and use them as you go, meaning that you do not need the full featured stack when you are beginning, but still have the flexibility of making changes in the future.
An overview of the AWS IoT service.
I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that?
Is it something I need to concern about?
As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.
The requirement is : To query the database everyday at say 10 pm.Based on the result set from DB,call several 3rd party services and perform some business operations and then complete the job.
What is the best possible way to achieve this in spring.will spring batch or spring batch integration be good?
According to your steps it would be good that you will take a look into Spring Integration, too, and decide yourself what to be the best for you.
Spring Integration provides JDBC Inbound Channel Adapter, which really can poll DB using Cron Trigger. The result of the DB execution you really can to any other service, e.g. <int-ws:outbound-gateway> or just generic <service-activator>.
And even you can do some parallelism for several records from DB.
Not sure what you mean with "and then complete the job", but the work will be done automatically after the last records processed.
I think you really can come up with something similar using just Spring Batch, because there are really enough useful components, like to read DB, as well as implement your own to cal third party services.
Plus you can manage jobs via Repository.
To determine the difference and scope you should read manuals of both projects and decide yourself how to be further.
We have a Spring Integration application which polls from a database, marshalls into XML, and then sends that XML as a message to a downstream system. The application works, but the issue we face is that some of the result sets can be massive. In such cases, it appears Spring Integration cannot handle the transformation because the result set is too big to handle in memory. I do not see a Stax marshaller in Spring Integration, as there is say in Spring Batch. That actually makes sense, because messaging usually means working with many small messages and not large files.
One option we have is to develop a Spring Batch application instead.
Is there a design we could adopt for Spring Integration to handle this? Does Spring Integration have any notion of streaming? For instance, would it be possible to read the result set in chunks, transform each piece, and each piece as a separate message which is part of a set? Or is Spring Batch just a better fit?
Thanks very much
You can set max-rows-per-poll to limit the size of each result set.
If that's not practical then you can use a combination of Spring Batch and Spring Integration - have the batch ItemWriter send the chunks into a Spring Integration flow via a <gateway/>.