spring batch parellel processing using partitioner

spring batch parellel processing using partitioner - multithreading

process many files with limited number of threads not working properly.After completion of task that thread cannot be re-used.Enable to pass multiple files to limited number of threads.
#Bean
public Step orderStep1() throws IOException {
return stepBuilderFactory.get("orderStep1")
.partitioner("slaveStep", partitioner())
.step(slaveStep())
.gridSize(40)
.taskExecutor(taskExecutor1())
.build();
}
#Bean
public ThreadPoolTaskExecutor taskExecutor1() {
ThreadPoolTaskExecutor taskExecutor1 = new ThreadPoolTaskExecutor();
taskExecutor1.setCorePoolSize(30);
return taskExecutor1;
}
#Bean
#StepScope
public FlatFileItemReader<GPRSDAO> reader(
#Value("#{stepExecutionContext['file']}") String file
) {
reader = new FlatFileItemReader<GPRSDAO>();
reader.setResource(new FileSystemResource(file));
reader.setLineMapper(new DefaultLineMapper<GPRSDAO>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(new String[] { "firstName","lastName" });
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<GPRSDAO>() {
{
setTargetType(GPRSDAO.class);
}
});
}
});
reader.close();
return reader; }
Input files=100(set dynamically during runtime using stepExecutionContext)

After completion of task that thread cannot be re-used.Enable to pass multiple files to limited number of threads.
You set the core pool size to 30 but did not set the maxPoolSize parameter. You need to set the max pool size to limit the number of threads.

Related

How to poll for multiple files at once with Spring Integration with WebFlux?

I have the following configuration below for file monitoring using Spring Integration and WebFlux.
It works well, but if I drop in 100 files it will pick up one file at a time with a 10 second gap between the "Received a notification of new file" log messages.
How do I poll for multiple files at once, so I don't have to wait 1000 seconds for all my files to finally register?
#Configuration
#EnableIntegration
public class FileMonitoringConfig {
private static final Logger logger =
LoggerFactory.getLogger(FileMonitoringConfig.class.getName());
#Value("${monitoring.folder}")
private String monitoringFolder;
#Value("${monitoring.polling-in-seconds:10}")
private int pollingInSeconds;
#Bean
Publisher<Message<Object>> myMessagePublisher() {
return IntegrationFlows.from(
Files.inboundAdapter(new File(monitoringFolder))
.useWatchService(false),
e -> e.poller(Pollers.fixedDelay(pollingInSeconds, TimeUnit.SECONDS)))
.channel(myChannel())
.toReactivePublisher();
}
#Bean
Function<Flux<Message<Object>>, Publisher<Message<Object>>> myReactiveSource() {
return flux -> myMessagePublisher();
}
#Bean
FluxMessageChannel myChannel() {
return new FluxMessageChannel();
}
#Bean
#ServiceActivator(
inputChannel = "myChannel",
async = "true",
reactive = #Reactive("myReactiveSource"))
ReactiveMessageHandler myMessageHandler() {
return new ReactiveMessageHandler() {
#Override
public Mono<Void> handleMessage(Message<?> message) throws MessagingException {
return Mono.fromFuture(doHandle(message));
}
private CompletableFuture<Void> doHandle(Message<?> message) {
return CompletableFuture.runAsync(
() -> {
logger.info("Received a notification of new file: {}", message.getPayload());
File file = (File) message.getPayload();
});
}
};
}
}

The Inbound Channel Adapter polls a single data record from the source per poll cycle.
Consider to add maxMessagesPerPoll(-1) to your poller() configuration.
See more in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/core.html#channel-adapter-namespace-inbound

How do RSocket issue lease to multiple clients?

I could create a server lease to a single client as follows：
#Slf4j
public class LeaseServer {
private static final String SERVER_TAG = "server";
public static void main(String[] args) throws InterruptedException {
// Queue for incoming messages represented as Flux
// Imagine that every fireAndForget that is pushed is processed by a worker
int queueCapacity = 50;
BlockingQueue<String> messagesQueue = new ArrayBlockingQueue<>(queueCapacity);
// emulating a worker that process data from the queue
Thread workerThread =
new Thread(
() -> {
try {
while (!Thread.currentThread().isInterrupted()) {
String message = messagesQueue.take();
System.out.println("consume message：" + message);
Thread.sleep(100000); // emulating processing
}
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
});
workerThread.start();
CloseableChannel server = getFireAndForgetServer(messagesQueue, workerThread);
TimeUnit.MINUTES.sleep(10);
server.dispose();
}
private static CloseableChannel getFireAndForgetServer(BlockingQueue<String> messagesQueue, Thread workerThread) {
CloseableChannel server =
RSocketServer.create((setup, sendingSocket) ->
Mono.just(new RSocket() {
#Override
public Mono<Void> fireAndForget(Payload payload) {
// add element. if overflows errors and terminates execution
// specifically to show that lease can limit rate of fnf requests in
// that example
try {
if (!messagesQueue.offer(payload.getDataUtf8())) {
System.out.println("Queue has been overflowed. Terminating execution");
sendingSocket.dispose();
workerThread.interrupt();
}
} finally {
payload.release();
}
return Mono.empty();
}
}))
.lease(() -> Leases.create().sender(new LeaseCalculator(SERVER_TAG, messagesQueue)))
.bindNow(TcpServerTransport.create("localhost", 7000));
return server;
}
}
But how do I issue a lease to multiple clients connected to that server?
Otherwise my queue will be written multiple times by multiple clients, resulting in an overflow of the service.
I can't find the details in the public documents and materials.
Your help was very much appreciated.

How can I parallel consumption kafka with spark streaming? I set concurrentJobs but something error [duplicate]

The doc of kafka give an approach about with following describes:
One Consumer Per Thread:A simple option is to give each thread its own consumer > instance.
My code:
public class KafkaConsumerRunner implements Runnable {
private final AtomicBoolean closed = new AtomicBoolean(false);
private final CloudKafkaConsumer consumer;
private final String topicName;
public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) {
this.consumer = consumer;
this.topicName = topicName;
}
#Override
public void run() {
try {
this.consumer.subscribe(topicName);
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
// Shutdown hook which can be called from a separate thread
public void shutdown() {
closed.set(true);
consumer.wakeup();
}
public static void main(String[] args) {
CloudKafkaConsumer kafkaConsumer = KafkaConsumerBuilder.builder()
.withBootstrapServers("172.31.1.159:9092")
.withGroupId("test")
.build();
ExecutorService executorService = Executors.newFixedThreadPool(5);
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log"));
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log.info"));
executorService.shutdown();
}
}
but it doesn't work and throws an exception:
java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Furthermore, I read the source of Flink (an open source platform for distributed stream and batch data processing). Flink using multi-thread consumer is similar to mine.
long pollTimeout = Long.parseLong(flinkKafkaConsumer.properties.getProperty(KEY_POLL_TIMEOUT, Long.toString(DEFAULT_POLL_TIMEOUT)));
pollLoop: while (running) {
ConsumerRecords<byte[], byte[]> records;
//noinspection SynchronizeOnNonFinalField
synchronized (flinkKafkaConsumer.consumer) {
try {
records = flinkKafkaConsumer.consumer.poll(pollTimeout);
} catch (WakeupException we) {
if (running) {
throw we;
}
// leave loop
continue;
}
}
flink code of mutli-thread
What's wrong?

Kafka consumer is not thread safe. As you pointed out in your question, the document stated that
A simple option is to give each thread its own consumer instance
But in your code, you have the same consumer instance wrapped by different KafkaConsumerRunner instances. Thus multiple threads are accessing the same consumer instance. The kafka documentation clearly stated
The Kafka consumer is NOT thread-safe. All network I/O happens in the
thread of the application making the call. It is the responsibility of
the user to ensure that multi-threaded access is properly
synchronized. Un-synchronized access will result in
ConcurrentModificationException.
That's exactly the exception you received.

It is throwing the exception on your call to subscribe. this.consumer.subscribe(topicName);
Move that block into a synchronized block like this:
#Override
public void run() {
try {
synchronized (consumer) {
this.consumer.subscribe(topicName);
}
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}

Maybe is not your case, but if you are mergin processing of data of serveral topics, then you can read data from multiple topics with the same consumer. If not, then is preferable to create separate jobs consuming each topic.

Azure EventProcessorHost and Worker role

I was hoping for some guidance on how to use the EventProcessorHost with a worker role. Basically I am hoping to have the EventProcessorHost process the partitions in parallel and I'm wondering where I should go about placing this type of code within the worker role and if I'm missing anything key.
var manager = NamespaceManager.CreateFromConnectionString(connectionString);
var desc = manager.CreateEventHubIfNotExistsAsync(path).Result;
var client = Microsoft.ServiceBus.Messaging.EventHubClient.CreateFromConnectionString(connectionString, path);
var host = new EventProcessorHost(hostname, path, consumerGroup, connectionString, blobStorageConnectionString);
EventHubProcessorFactory<EventData> factory = new EventHubProcessorFactory<EventData>();
host.RegisterEventProcessorFactoryAsync(factory);
Everything I've read says the EventProcessorHost will divide up the partitions on its own, but is the above code sufficient to process all the partitions asynchronously?

Here's a simplified version of how we process our event hub from an Worker Role. We keep the instance in the mainWorker role and call the IEventProcessor to start processing it.
This way we can call it and close it down when the Worker Responds to shutdown events etc.
EDIT:
As for the processing it in parallel, the IEventProcessor class will just grab 10 more events from the event hub when it's finished processing the current one. Handling all the fancy partition leasing for you.
It's a synchronous workflow, When I scale to multiple worker roles I start to see the partitions get split between instances and it gets faster etc. You'd have to roll your own solution if you wanted it to process the event hub in a different way.
public class WorkerRole : RoleEntryPoint
{
private readonly CancellationTokenSource _cancellationTokenSource = new CancellationTokenSource();
private readonly ManualResetEvent _runCompleteEvent = new ManualResetEvent(false);
private EventProcessorHost _eventProcessorHost;
public override bool OnStart()
{
ThreadPool.SetMaxThreads(4096, 2048);
ServicePointManager.DefaultConnectionLimit = 500;
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;
var eventClient = EventHubClient.CreateFromConnectionString("consumersConnectionString",
"eventHubName");
_eventProcessorHost = new EventProcessorHost(Dns.GetHostName(), eventClient.Path,
eventClient.GetDefaultConsumerGroup().GroupName,
"consumersConnectionString", "blobLeaseConnectionString");
return base.OnStart();
}
public override void Run()
{
try
{
RunAsync(this._cancellationTokenSource.Token).Wait();
}
finally
{
_runCompleteEvent.Set();
}
}
private async Task RunAsync(CancellationToken cancellationToken)
{
// starts processing here
await _eventProcessorHost.RegisterEventProcessorAsync<EventProcessor>();
while (!cancellationToken.IsCancellationRequested)
{
await Task.Delay(TimeSpan.FromMinutes(1));
}
}
public override void OnStop()
{
_eventProcessorHost.UnregisterEventProcessorAsync().Wait();
_cancellationTokenSource.Cancel();
_runCompleteEvent.WaitOne();
base.OnStop();
}
}
I have multiple processors for the specific partitions (you can guarantee FIFO this way), but you can implement you're own logic easily i.e. skip the use of a EventDataProcessor class and Dictionary lookup in my example and just implement some logic within the ProcessEventsAsync method.
public class EventProcessor : IEventProcessor
{
private readonly Dictionary<string, IEventDataProcessor> _eventDataProcessors;
public EventProcessor()
{
_eventDataProcessors = new Dictionary<string, IEventDataProcessor>
{
{"A", new EventDataProcessorA()},
{"B", new EventDataProcessorB()},
{"C", new EventDataProcessorC()}
}
}
public Task OpenAsync(PartitionContext context)
{
return Task.FromResult<object>(null);
}
public async Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach(EventData eventData in messages)
{
// implement your own logic here, you could just process the data here, just remember that they will all be from the same partition in this block
try
{
IEventDataProcessor eventDataProcessor;
if(_eventDataProcessors.TryGetValue(eventData.PartitionKey, out eventDataProcessor))
{
await eventDataProcessor.ProcessMessage(eventData);
}
}
catch (Exception ex)
{
_//log exception
}
}
await context.CheckpointAsync();
}
public async Task CloseAsync(PartitionContext context, CloseReason reason)
{
if (reason == CloseReason.Shutdown)
await context.CheckpointAsync();
}
}
Example of one of our EventDataProcessors
public interface IEventDataProcessor
{
Task ProcessMessage(EventData eventData);
}
public class EventDataProcessorA : IEventDataProcessor
{
public async Task ProcessMessage(EventData eventData)
{
// Do Something specific with data from Partition "A"
}
}
public class EventDataProcessorB : IEventDataProcessor
{
public async Task ProcessMessage(EventData eventData)
{
// Do Something specific with data from Partition "B"
}
}
Hope this helps, it's been rock solid for us so far and scales easily to multiple instances

NoHostAvailable Exception on Async operations with Datastax Cassandra driver

I'm using nested Asynchronous query execution with Cassandra. Data is continuously streamed in and for each incoming data, the below block of cassandra operations are executed. It works fine for a while but then starts throwing a lot of NoHostAvailableException.
Please me help me out here.
Cassandra Session Connection code :
I use separate sessions for read and write. Each of these sessions connect to a different seed as I was told this would improve performance.
final com.datastax.driver.core.Session readSession = CassandraManager.connect("10.22.1.144", "fr_repo",
"READ");
final com.datastax.driver.core.Session writeSession = CassandraManager.connect("10.1.12.236", "fr_repo",
"WRITE");
The CassandraManager.connect method is below :
public static Session connect(String ip, String keySpace,String type) {
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 400);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
return s;
}
Database operation code :
ResultSetFuture resultSetFuture = readSession.executeAsync(selectBound.bind(fr.getHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(com.datastax.driver.core.ResultSet resultSet) {
try {
Iterator<Row> rows = resultSet.iterator();
if (!rows.hasNext()) {
ResultSetFuture resultSetFuture = readSession.executeAsync(selectPrimaryBound
.bind(fr.getPrimaryKeyHashcode()));
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onFailure(Throwable arg0) {
}
public void onSuccess(ResultSet arg0) {
Iterator<Row> rows = arg0.iterator();
if (!rows.hasNext()) {
writeSession.executeAsync(insertBound.bind(fr.getHashcode(), fr,
System.currentTimeMillis()));
writeSession.executeAsync(insertPrimaryBound.bind(
fr.getHashcode(),
fr.getCombinedPrimaryKeys(), System.currentTimeMillis()));
produceintoQueue(new Gson().toJson(frCompleteMap));
} else {
writeSession.executeAsync(updateBound.bind(fr,
System.currentTimeMillis(), fr.getHashcode()));
produceintoQueue(new Gson().toJson(frCompleteMap));
}
}
});
} else {
writeSession.executeAsync(updateLastSeenBound.bind(System.currentTimeMillis(),
fr.getHashcode()));
}
} catch (Exception e) {
e.printStackTrace();
}
}

It sounds like you're sending more requests than your pool/cluster can handle. This is pretty easy to do when you're never actually waiting for a result, as is the case in your code. You're essentially just throwing as many requests as you can into the pipeline with no blocking, and there's no natural back pressure to slow down your app if the pool or cluster get backed up. So if your request volume is too high, eventually all the hosts will be busy with the backed up work queue. You can use nodetool tpstats to see what your request queues look like on each node.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spring batch parellel processing using partitioner - multithreading

After completion of task that thread cannot be re-used.Enable to pass multiple files to limited number of threads. You set the core pool size to 30 but did not set the maxPoolSize parameter. You need to set the max pool size to limit the number of threads.

Related

How to poll for multiple files at once with Spring Integration with WebFlux?

How do RSocket issue lease to multiple clients?

How can I parallel consumption kafka with spark streaming? I set concurrentJobs but something error [duplicate]

Azure EventProcessorHost and Worker role

NoHostAvailable Exception on Async operations with Datastax Cassandra driver

Categories

Resources