Efficient way to do a batch job in spring

Efficient way to do a batch job in spring - spring-integration

The requirement is : To query the database everyday at say 10 pm.Based on the result set from DB,call several 3rd party services and perform some business operations and then complete the job.
What is the best possible way to achieve this in spring.will spring batch or spring batch integration be good?

According to your steps it would be good that you will take a look into Spring Integration, too, and decide yourself what to be the best for you.
Spring Integration provides JDBC Inbound Channel Adapter, which really can poll DB using Cron Trigger. The result of the DB execution you really can to any other service, e.g. <int-ws:outbound-gateway> or just generic <service-activator>.
And even you can do some parallelism for several records from DB.
Not sure what you mean with "and then complete the job", but the work will be done automatically after the last records processed.
I think you really can come up with something similar using just Spring Batch, because there are really enough useful components, like to read DB, as well as implement your own to cal third party services.
Plus you can manage jobs via Repository.
To determine the difference and scope you should read manuals of both projects and decide yourself how to be further.

Related

Should I use a Webhook or AWS queue (SQS)?

I've been implementing an SQS service(AWS) for my project. My purpose for this implement is I have 2 projects (microservice) and I want to sync data from one project to another. So, I intend to use SQS service but I also think about webhook for solving my case. I know some basics of the pros and cons of them. So, my question is should I use a webhook or SQS for my case?
Thanks for any helping!

First of all, if you wish to sync 2 databases you would probably want something that's not accounting on your service. Try reading about change data capture - Log scanners is a safe way to do that. Debezium - is a strong tool for it.
Second, if you wish to go with your own implementation I would suggest going with the queueing approach. The biggest advantage of it will be incased when the second service is down. While if using Webhooks the information will be lost, using queues (SQS or any other) will keep the data until the service is up again.

SQS is your best bet here. Couple of reasons
- Reliability in case something is down.
- Ability to repopulate other micro-services. For example if you decide to create another microservice and you need to populate data since start, you will probably read everything from service 1 and put it in the queue for the new micro service.
- Scalability - Queues makes your architecture horizontally scalable. Just put machines to do the work while reading it from queues in parallel.

CQRS and Event Sourcing Guide

I want to create a CQRS and Event Sourcing architecture that is very cheap and very flexible and very uncomplicated.
I want to make sure that events never fail to at least reach the publisher/event store, ever, ever, because that's where business is.
Now, i have several options in mind:
Azure
With azure, i seem to not know what to use.
Azure service bus
Azure Function
Azure webjob (i suppose this can be replaced with Azure functions)
?? (something else i forgot or dont know?)
How reliable are these azure server-less solutions??
Custom
For this i am thinking of using RabbitMQ, the problem is the cost of a virtual machine to run it.
All in all, i want:
Ability to replay the messages/events in case of failure.
Ability to easily add subscribers.
Ability to select the subscribers upon which to replay the messages.
The Event store should be able to store very large sizes of event messages (or how else shall queue an image or file??).
The event store MUST NEVER EVER get chocked, or sleep.
Speed of implementation/prototyping would be an added
advantage.
What does your experience suggest?
What about other alternatives? (eg: apache-kafka)?

Why not run Event Store? Created by Greg Young himself. Host where you need.

I am a java user, I have been using hornetq (aka artemis which I dont use) an alternative to rabbitmq for the longest; the only problem is it does not support replication but gets the job done when it comes to eventsourcing. For your custom scenario, rabbitmq is a good choice but try running it on a digital ocean instance for low costs. If you are looking for simplicity and flexibility you have only 2 choices , build your own or forgo simplicity and pick up apache kafka with all its complexities but will give you flexibility. Again you can also build an eventstore with mongodb. https://www.mongodb.com/blog/post/event-sourcing-with-mongodb

Your requirements are too vague to make the optimal choice. You need to consider a lot of things, one of them would be, for instance, the numbers of events per one aggregate, the number of aggregates (note that this has to be statistical). Those are important primarily because if you allow tens of thousands of events for each aggregate then you would need to have snapshotting which adds complexity which you might not need.
But for regular use cases you could just use a relational database like Postgres as your (linearizable) event store. It also has a listen/notify functionality to you would not really need any message bus either and your application could be written in a reactive way.

Which tools to use when migrating bounded data?

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?

Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.

Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

Spring Cloud DataFlow http polling and deduplication

I have been reading much Spring Cloud DataFlow and related documentation in order to produce a data ingest solution that will run in my organization's Cloud Foundry deployment. The goal is to poll an HTTP service for data, perhaps three times per day for the sake of discussion, and insert/update that data in a PostgreSQL database. The HTTP service seems to provide 10s of thousands of records per day.
One point of confusion thus far is a best practice in the context of a DataFlow pipeline for deduplicating polled records. The source data do not have a timestamp field to aid in tracking polling, only a coarse day-level date field. I also have no guarantee that records are not ever updated retroactively. The records appear to have a unique ID, so I can dedup the records that way, but I am just not sure based on the documentation how best to implement that logic in DataFlow. As far as I can tell, the Spring Cloud Stream starters do not provide for this out-of-the-box. I was reading about Spring Integration's smart polling, but I'm not sure that's meant to address my concern either.
My intuition is to create a custom Processor Java component in a DataFlow Stream that performs a database query to determine whether polled records have already been inserted, then inserts the appropriate records into the target database, or passes them on down the stream. Is querying the target database in an intermediate step acceptable in a Stream app? Alternatively, I could implement this all in a Spring Cloud Task as a batch operation which triggers based on some schedule.
What is the best way to proceed with respect to a DataFlow app? What are common/best practices for achieving deduplication as I described above in a DataFlow/Stream/Task/Integration app? Should I copy the setup of a starter app or just start from scratch, because I am fairly certain I'll need to write custom code? Do I even need Spring Cloud DataFlow, because I'm not sure I'll be using its DSL at all? Apologies for all the questions, but being new to Cloud Foundry and all these Spring projects, it's daunting to piece it all together.
Thanks in advance for any help.

You are on the right track, given your requirements you will most likely need to create a custom processor. You need to keep track of what has been inserted in order to avoid duplication.
There's nothing preventing you from writing such processor in a stream app, however performance may take a hit, since for each record you will issue a DB query.
If order is not important, you could parallelize the query so you could process several concurrent messages, but in the end your DB would still pay the price.
Another approach would to use a bloomfilter that can help quite a lot on speeding up your checking for inserted records.
You can start by cloning the starter apps, you could have a poller trigger an http client processor that fetches your data and then go through your custom code processor and finally to a jdbc-sink. Something like stream create time --triger.cron=<CRON_EXPRESSION> | httpclient --httpclient.url-expression=<remote_endpoint> | customProcessor | jdbc
One of the advantages of using SCDF is that you could independently scale your custom processor via deployment properties such as deployer.customProcessor.count=8

Spring Cloud Data Flow builds integration streams for data based on the Spring Cloud Stream, which, in turn, is fully based on the Spring Integration. And all the principles exist in Spring Integration can be applied everywhere there on the SCDF level.
That really might be a case that you won't be able to avoid some codding, but what you need is called in EIP Idempotent Receiver. And Spring Integration provides one for us:
#ServiceActivator(inputChannel = "processChannel")
#IdempotentReceiver("idempotentReceiverInterceptor")
public void handle(Message<?> message)

spring integration flow with jdbc output

I am looking to understand what is the best way to persist the output of an integration flow containing an aggregator (with release strategy spanning approx. 3 minutes) into a datasource.
I've been considering JDBC, but open to other options as well. Ideally I could use a spring integration component that deals well with buffered writes, maintaining jdbc connections with low overhead and that can operate at a spring data level (I want to work with a domain model not with raw SQL updates).
JDCBOutboundGateway and JDBCMessageHandler seem to be limited by SQL direct
interaction.
I considered JDBCChanelMessageStore but seemed that would be limited by a binary representation of the messages.
Is a service activator a more suited choice? Would I have to read service activator inputs from a pollable queue in order to ensure better write performance?
Transactional constraints are also under consideration for this integration flow with the goal to accommodate about 10K messages/hour.
Any directions are highly appreciated. Have been having trouble finding resources about JDBC output from integration flows (mostly mentions of jdbc message handlers and gateways without specific considerations for performance and data model flexibility).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string