How to fix unexpected hazelcast client shutdown - hazelcast-jet

I'm using hazelcast jet to perform aggregations on stream data. Problem is, that hazelcast cliend shutsdown unexpectedly.
I've implemented simple pipeline with remote map source and then the result is simply sinked.
// init pipeline
Pipeline p = Pipeline.create();
// configure source
BatchSource remoteBatchMap = Sources.remoteMap(<my remote map>, <my config>);
// add source and sink to pipeline
p.drawFrom(remoteBatchMap).drainTo(Sinks.map(SINK_MAP_NAME));
On a client side, output is as expected for the first cca. 30 seconds. Then shutdown happens, and further on, those printed values freezes. Ok, that is logical, as it has been shut down. But, how to prevent shutdown?
2019-07-25 14:22:18,214 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 254/41254
2019-07-25 14:22:19,359 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 262/41254
2019-07-25 14:22:20,496 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 269/41259
2019-07-25 14:22:20,786 INFO com.hazelcast.logging.StandardLoggerFactory$StandardLogger [hz._hzInstance_1_jet.async.thread-8] betex0.7899090253375379 [app] [3.1] [3.12.1] HazelcastClient 3.12.1 (20190611 - 0a0ee66) is SHUTTING_DOWN
2019-07-25 14:22:20,791 INFO com.hazelcast.logging.StandardLoggerFactory$StandardLogger [hz._hzInstance_1_jet.async.thread-8] betex0.7899090253375379 [app] [3.1] [3.12.1] Removed connection to endpoint: [192.168.41.3]:5701, connection: ClientConnection{alive=false, connectionId=1, channel=NioChannel{/192.168.26.78:64217->/192.168.41.3:5701}, remoteEndpoint=[192.168.41.3]:5701, lastReadTime=2019-07-25 14:22:19.980, lastWriteTime=2019-07-25 14:22:19.855, closedTime=2019-07-25 14:22:20.789, connected server version=3.12.1}
2019-07-25 14:22:20,794 INFO com.hazelcast.logging.StandardLoggerFactory$StandardLogger [hz._hzInstance_1_jet.async.thread-8] betex0.7899090253375379 [app] [3.1] [3.12.1] Removed connection to endpoint: [192.168.41.4]:5701, connection: ClientConnection{alive=false, connectionId=2, channel=NioChannel{/192.168.26.78:64218->/192.168.41.4:5701}, remoteEndpoint=[192.168.41.4]:5701, lastReadTime=2019-07-25 14:22:20.525, lastWriteTime=2019-07-25 14:22:20.376, closedTime=2019-07-25 14:22:20.793, connected server version=3.12.1}
2019-07-25 14:22:20,797 INFO com.hazelcast.logging.StandardLoggerFactory$StandardLogger [hz._hzInstance_1_jet.async.thread-8] betex0.7899090253375379 [app] [3.1] [3.12.1] HazelcastClient 3.12.1 (20190611 - 0a0ee66) is SHUTDOWN
2019-07-25 14:22:20,802 INFO com.hazelcast.logging.StandardLoggerFactory$StandardLogger [hz._hzInstance_1_jet.async.thread-8] [192.168.1.66]:5701 [jet] [3.1] Execution of job '8dc4-d1e2-df66-a444', execution 9622-ba74-b907-150c completed in 42,335 ms
2019-07-25 14:22:21,635 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 41246/41259
2019-07-25 14:22:22,771 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 41246/41259
2019-07-25 14:22:23,909 INFO com.betex.service.FixtureOddTotalSummaryImpl [SockJS-2] Number of sink elements vs original (BCK): 41246/41259
On server side it says that connection is closed by the other side - so, my client side:
2019-07-25 14:22:21.909 INFO 21375 --- [hz.betex.IO.thread-in-2] com.hazelcast.nio.tcp.TcpIpConnection : [192.168.41.3]:5701 [app] [3.1] Connection[id=159, /192.168.41.3:5701->192.168.26.78/192.168.26.78:64217, qualifier=null, endpoint=[192.168.26.78]:64217, alive=false, type=JAVA_CLIENT] closed. Reason: Connection closed by the other side
2019-07-25 14:22:21.910 INFO 21375 --- [hz.betex.event-14] c.h.client.impl.ClientEndpointManager : [192.168.41.3]:5701 [app] [3.1] Destroying ClientEndpoint{connection=Connection[id=159, /192.168.41.3:5701->192.168.26.78/192.168.26.78:64217, qualifier=null, endpoint=[192.168.26.78]:64217, alive=false, type=JAVA_CLIENT], principal='ClientPrincipal{uuid='c5286586-cbe2-4c84-8e74-4c2f1f59310a', ownerUuid='ebce22c4-ed31-4ccf-9808-b19005dc55f8'}, ownerConnection=true, authenticated=true, clientVersion=3.12.1, creationTime=1564057300564, latest statistics=null}
I'd be very happy to get some orientation and ideas, where to look for a problem.

If you haven't already wrapped the code in a try/catch, I'd try that. I remember running into something similar but can't recall the root cause; it may have been a ClassCastException or something serialization-related. There wasn't any clue in the output but once I added the try/catch and dumped a stack trace the issue was obvious.

The cluster is independent from the clients. Jet client can be used to submit a job and to monitor it, but if the client shuts down, the cluster isn't affected and the job continues to run.
You don't share your code, but you probably shut down the client yourself. You need to fix your code.

Related

Error Code PINOT_UNABLE_TO_FIND_BROKER :No valid brokers found

I am trying to query pinot table data using presto, below are my configuration details.
started Pinot is one of the sit server.i.e. 10.184.160.52
Controller: 10.184.160.52:9000
server: 10.184.160.52:7000
broker: 10.184.160.52:8000
I have Presto on different server Ports are open b/w these 2 servers. i.e.10.184.160.53
Created One pinot.properties file inside presto/etc/catalog/pinot.properties.
connector.name=pinot
pinot.controller-urls=Controller_Host:9000
bin/launcher run ---> Loaded Pinot catalog.
Started Prestro with Pinot Segment.
./presto --server 10.184.160.53:8080 --catalog pinot
show catalogs;(able to see my Catalog)
pinot
show schemas; (able to see sachema also)
presto> show schemas;
Schema
--------------------
default
presto> use default;
USE
presto:default> show tables;----(able to see pinot tables:)
Table
------------------------------
test
test2
test3
(3 rows)
Query 20210519_124218_00061_vcz4u, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0:00 [3 rows, 98B] [10 rows/s, 340B/s]
but when I am doing select * from test ; its showing broker not found
presto:default> select * from test;
Query 20210519_124230_00062_vcz4u failed: No valid brokers found for test
Complete Presto Logs:
Error Code PINOT_UNABLE_TO_FIND_BROKER (84213767)
Stack Trace
io.prestosql.pinot.PinotException: No valid brokers found for test
at io.prestosql.pinot.client.PinotClient.getBrokerHost(PinotClient.java:285)
at io.prestosql.pinot.client.PinotClient.sendHttpGetToBrokerJson(PinotClient.java:185)
at io.prestosql.pinot.client.PinotClient.getRoutingTableForTable(PinotClient.java:302)
at io.prestosql.pinot.PinotSplitManager.generateSplitsForSegmentBasedScan(PinotSplitManager.java:72)
at io.prestosql.pinot.PinotSplitManager.getSplits(PinotSplitManager.java:167)
at io.prestosql.split.SplitManager.getSplits(SplitManager.java:87)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitScanAndFilter(DistributedExecutionPlanner.java:203)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:185)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:156)
at io.prestosql.sql.planner.plan.TableScanNode.accept(TableScanNode.java:143)
at io.prestosql.sql.planner.DistributedExecutionPlanner.doPlan(DistributedExecutionPlanner.java:124)
at io.prestosql.sql.planner.DistributedExecutionPlanner.doPlan(DistributedExecutionPlanner.java:131)
at io.prestosql.sql.planner.DistributedExecutionPlanner.plan(DistributedExecutionPlanner.java:101)
at io.prestosql.execution.SqlQueryExecution.planDistribution(SqlQueryExecution.java:470)
at io.prestosql.execution.SqlQueryExecution.start(SqlQueryExecution.java:386)
at io.prestosql.execution.SqlQueryManager.createQuery(SqlQueryManager.java:237)
at io.prestosql.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:143)
at io.prestosql.$gen.Presto_350____20210519_105836_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
I am not able to understand what is happening here why this issue is showing for select statement.Looks like pinot broker is not accepting queries.someOne Kindly Suggest, What is the issue here.
Update: This is because the connector does not support mixed case table names. Mixed case column names are supported. There is a pull request to add support for mixed case table names: https://github.com/trinodb/trino/pull/7630

Airflow: get messages from Kafka

I try get messages from Kafka in Airflow with python-kafka package.
Just in Python script it works. But in Airflow I have this messages from Kafka Consumer. And don't have messages from Kafka.
[2020-09-07 17:51:08,046] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,046] {parser.py:166} DEBUG - Processing response MetadataResponse_v1
[2020-09-07 17:51:08,047] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,047] {conn.py:1071} DEBUG - <BrokerConnection node_id=2 host=10.1.25.112:9092 <connected> [IPv4 ('10.1.25.112', 9092)]> Response 28 (55.747270584106445 ms): MetadataResponse_v1(brokers=[(node_id=2, host='10.1.25.112', port=9092, rack=None), (node_id=3, host='10.1.25.113', port=9092, rack=None), (node_id=1, host='10.1.25.111', port=9092, rack=None)], controller_id=1, topics=[(error_code=0, topic='dev.tracking.nifi.rmsp.monthly.flow.downloading', is_internal=False, partitions=[(error_code=0, partition=0, leader=3, replicas=[3], isr=[3])])])
[2020-09-07 17:51:08,048] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,047] {cluster.py:325} DEBUG - Updated cluster metadata to ClusterMetadata(brokers: 3, topics: 1, groups: 0)
[2020-09-07 17:51:08,048] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,048] {fetcher.py:296} DEBUG - Stale metadata was raised, and we now have an updated metadata. Rechecking partition existance
[2020-09-07 17:51:08,048] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,048] {fetcher.py:299} DEBUG - Removed partition TopicPartition(topic='dev.tracking.nifi.rmsp.monthly.flow.downloading', partition=(0,)) from offsets retrieval
[2020-09-07 17:51:08,049] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,048] {fetcher.py:247} DEBUG - Could not find offset for partition TopicPartition(topic='dev.tracking.nifi.rmsp.monthly.flow.downloading', partition=(0,)) since it is probably deleted
[2020-09-07 17:51:08,107] {logging_mixin.py:112} INFO - [2020-09-07 17:51:08,107] {group.py:453} DEBUG - Closing the KafkaConsumer.
This error occurs when calling
# Issue #1780
# Recheck partition existence after after a successful metadata refresh
if refresh_future.succeeded() and isinstance(future.exception, Errors.StaleMetadata):
log.debug("Stale metadata was raised, and we now have an updated metadata. Rechecking partition existance")
unknown_partition = future.exception.args[0] # TopicPartition from StaleMetadata
if self._client.cluster.leader_for_partition(unknown_partition) is None:
log.debug("Removed partition %s from offsets retrieval" % (unknown_partition, ))
timestamps.pop(unknown_partition)
Why I can't get topic leader?
I think you need to provide some more info on your problem / set up.
we cannot see the set up, for example are you using docker for Airflow? and local machine for Kafka?
Then some people might be able to help you out.

Logstash 6.2 - full persistent queue (wrong mapping?)

My queue is almost full and I see this errors in my log file:
[2018-05-16T00:01:33,334][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"2018.05.15-el-mg_papi-prod", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x608d85c1>], :response=>{"index"=>{"_index"=>"2018.05.15-el-mg_papi-prod", "_type"=>"doc", "_id"=>"mHvSZWMB8oeeM9BTo0V2", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [papi_request_json.query.disableFacets]", "caused_by"=>{"type"=>"i_o_exception", "reason"=>"Current token (VALUE_TRUE) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#56b8442f; line: 1, column: 555]"}}}}}
[2018-05-16T00:01:37,145][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 69
[2018-05-16T00:01:37,147][INFO ][org.logstash.beats.BeatsHandler] [local: 0:0:0:0:0:0:0:1:5000, remote: 0:0:0:0:0:0:0:1:50222] Handling exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 84
...
[2018-05-16T15:28:09,981][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
[2018-05-16T15:28:09,982][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
If I understand first warning, problem is with mapping. I have a lot of files in my queue Logstash folder. My questions is:
How to empty my queue, can i just delete all files from logstash queue folder(all logs will be lost)? And then resend all the data to logstash to proper index?
How can I determine where exactly is problem in mapping, or which servers sending data of wrong type?
I have pipeline on port 5000 named testing-pipeline just for checking if Logstash is active from nagios. What is that [INFO ][org.logstash.beats.BeatsHandler] logs?
If I understand correctly, [INFO ][logstash.outputs.elasticsearch] is just logs about retrying to process logstash queue?
On all servers is FIlebeat 6.2.2. Thank you for your help.
All pages in queue could be deleted but it is not the proper solution. In my case, the queue was full because there was events with different mapping of index. In Elasticsearch 6, you cannot send documents with different mapping to the same index so the logs stacked in queue because of this logs (even if there is only one wrong event, all others will not be processed). So how to process all data you can process an skip the wrong one? Solution is to configure DLQ (dead letter queue). Every event with response code 400 or 404 is moved to DLQ so others could be process. The data from DLQ can be processed later with pipeline.
Wrong mapping can be determined by error log "error"=>{"type"=>"mapper_parsing_exception", ..... }. To specify exact place with wrong mapping, you have to compare mapping of events and indices.
[INFO ][org.logstash.beats.BeatsHandler] was caused by Nagios server. The check did not consist of valid request, that's why the Handling exception. The check should test if Logstash service is active. Now I check Logstas service on localhost:9600, for more info here.
[INFO ][logstash.outputs.elasticsearch] means that Logstash trying to process the queue but index is locked ([FORBIDDEN/12/index read-only / allow delete (api)]) because the indices was set to read-only state. Elasticsearch, when there is not enough space on server, automatically configure indices to read-only. This can be change by cluster.routing.allocation.disk.watermark.low, for more info here.

spring xd losing messages when processing huge volume

I am using spring xd My stream looks like below and running tests on 3 node container with 1 admin node with rabbit as transport
aws-s3|processor1|http-client|processor2>queue:readyQueue
I have created below tap.
tap1 aws-s3>s3Queue
tap2 processor1>processorQueue1
tap3 http-client>httpQueue
I run below scenarios in my tests:
Scenario1: 5 files of 200k =1 Million records
concurrency of http-client=70 and processor2=30
I see 900k message s3Queue
I see 889k message processorQueue1
I see 886k message httpQueue
I see 883k message processorQueue2
Messages are lost everywhere and its random
Scenario2:
5 files of 200k =1 Million records and all module concurrency=1
I see 998800 message s3Queue
I see 998760 message processorQueue1
I see 997540 message httpQueue
I see 997530 message processorQueue2
Even this number is random and not consistent
Scenario3
I changed stream as below and concurrency=1 and 5 files of 200k =1 Million records
aws-s3 >testQueue
I get all my messages I run 3 times and no issues.I get all my 1 million messages
scenario4
I changed stream as below and concurrency=1 5 files of 200k =1 Million records
aws-s3 |processor1 >testQueue2
I get all my messages I run 3 times and no issues.I get all my 1 million messages
In scenario4 and scenarion 3 data ingestion is faster and it took 5 min to process 5 million faster and ingestion was faster in rabbit transport queue like 5k msg per sec
In scenario 1 data ingestion was slower even s3 module was pulling the data very slow like 300 to 1000 msg per sec
In scenario 2 s3 pulled data faster but http client was slow like 100 msg per sec but aws-s3 pulled data fast like 3-4k msg per sec.
I am thinking like seeing xd threading is causing issues and i am losing messages.Please can you help me how to solve this issue.
update
Scenario 5
I changed reply-timeout to -1 in http client and then
I lost only 37 msgs
Now again I run 2nd iteration I lost 25000 msgs i see the bellowing containers log when that happened
2016-03-04T03:42:04-0500 1.2.1.RELEASE ERROR task-scheduler-7 handler.LoggingHandler - org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.integration.amqp.outbound.AmqpOutboundEndpoint#b6700b1]; nested exception is org.springframework.amqp.AmqpIOException: java.io.IOException
at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:84)
at org.springframework.xd.dirt.integration.rabbit.RabbitMessageBus$SendingHandler.handleMessageInternal(RabbitMessageBus.java:891)
at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:78)
at org.springframework.integration.dispatcher.AbstractDispatcher.tryOptimizedDispatch(AbstractDispatcher.java:116)
at org.springframework.integration.dispatcher.UnicastingDispatcher.doDispatch(UnicastingDispatcher.java:101)
at org.springframework.integration.dispatcher.UnicastingDispatcher.dispatch(UnicastingDispatcher.java:97)
at org.springframework.integration.channel.AbstractSubscribableChannel.doSend(AbstractSubscribableChannel.java:77)
at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:287)
at org.springframework.integration.channel.interceptor.WireTap.preSend(WireTap.java:129)
at org.springframework.integration.channel.AbstractMessageChannel$ChannelInterceptorList.preSend(AbstractMessageChannel.java:392)
at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:282)
at org.springframework.integration.channel.AbstractMessageChannel.send(AbstractMessageChannel.java:245)
at sun.reflect.GeneratedMethodAccessor204.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.integration.monitor.DirectChannelMetrics.monitorSend(DirectChannelMetrics.java:114)
at org.springframework.integration.monitor.DirectChannelMetrics.doInvoke(DirectChannelMetrics.java:98)
at org.springframework.integration.monitor.DirectChannelMetrics.invoke(DirectChannelMetrics.java:92)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
at com.sun.proxy.$Proxy1537.send(Unknown Source)
at org.springframework.messaging.core.GenericMessagingTemplate.doSend(GenericMessagingTemplate.java:115)
at org.springframework.messaging.core.GenericMessagingTemplate.doSend(GenericMessagingTemplate.java:45)
at org.springframework.messaging.core.AbstractMessageSendingTemplate.send(AbstractMessageSendingTemplate.java:95)
at org.springframework.integration.handler.AbstractMessageProducingHandler.sendOutput(AbstractMessageProducingHandler.java:231)
at org.springframework.integration.handler.AbstractMessageProducingHandler.produceOutput(AbstractMessageProducingHandler.java:154)
at org.springframework.integration.splitter.AbstractMessageSplitter.produceOutput(AbstractMessageSplitter.java:157)
at org.springframework.integration.handler.AbstractMessageProducingHandler.sendOutputs(AbstractMessageProducingHandler.java:102)
at org.springframework.integration.handler.AbstractReplyProducingMessageHandler.handleMessageInternal(AbstractReplyProducingMessageHandler.java:105)
Caused by: org.springframework.amqp.AmqpIOException: java.io.IOException
at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator.convertRabbitAccessException(RabbitExceptionTranslator.java:63)
at org.springframework.amqp.rabbit.connection.SimpleConnection.createChannel(SimpleConnection.java:51)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$ChannelCachingConnectionProxy.createBareChannel(CachingConnectionFactory.java:758)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$ChannelCachingConnectionProxy.access$300(CachingConnectionFactory.java:747)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.doCreateBareChannel(CachingConnectionFactory.java:419)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.createBareChannel(CachingConnectionFactory.java:395)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.getCachedChannelProxy(CachingConnectionFactory.java:364)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.getChannel(CachingConnectionFactory.java:357)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.access$1100(CachingConnectionFactory.java:75)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$ChannelCachingConnectionProxy.createChannel(CachingConnectionFactory.java:763)
at org.springframework.amqp.rabbit.connection.ConnectionFactoryUtils$1.createChannel(ConnectionFactoryUtils.java:85)
at org.springframework.amqp.rabbit.connection.ConnectionFactoryUtils.doGetTransactionalResourceHolder(ConnectionFactoryUtils.java:134)
at org.springframework.amqp.rabbit.connection.ConnectionFactoryUtils.getTransactionalResourceHolder(ConnectionFactoryUtils.java:67)
at org.springframework.amqp.rabbit.core.RabbitTemplate.doExecute(RabbitTemplate.java:1035)
at org.springframework.amqp.rabbit.core.RabbitTemplate.execute(RabbitTemplate.java:1028)
at org.springframework.amqp.rabbit.core.RabbitTemplate.send(RabbitTemplate.java:540)
at org.springframework.amqp.rabbit.core.RabbitTemplate.convertAndSend(RabbitTemplate.java:635)
at org.springframework.integration.amqp.outbound.AmqpOutboundEndpoint.send(AmqpOutboundEndpoint.java:331)
at org.springframework.integration.amqp.outbound.AmqpOutboundEndpoint.handleRequestMessage(AmqpOutboundEndpoint.java:323)
at org.springframework.integration.handler.AbstractReplyProducingMessageHandler.handleMessageInternal(AbstractReplyProducingMessageHandler.java:99)
at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:78)
... 93 more
Caused by: java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:106)
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:102)
at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:124)
at com.rabbitmq.client.impl.ChannelN.open(ChannelN.java:125)
at com.rabbitmq.client.impl.ChannelManager.createChannel(ChannelManager.java:134)
at com.rabbitmq.client.impl.AMQConnection.createChannel(AMQConnection.java:499)
at org.springframework.amqp.rabbit.connection.SimpleConnection.createChannel(SimpleConnection.java:44)
... 112 more
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error
at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:348)
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:221)
at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:118)
... 116 more
Caused by: com.rabbitmq.client.impl.UnknownChannelException: Unknown channel number 23364
at com.rabbitmq.client.impl.ChannelManager.getChannel(ChannelManager.java:80)
at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:552)
... 1 more
2016-03-04T03:42:05-0500 1.2.1.RELEASE ERROR AMQP Connection xxx:5672 connection.CachingConnectionFactory - Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'xdbus.tap-s3.tap:stream:stream.batch-aws-s3-source.0' in vhost '/', class-id=50, method-id=20)
2016-03-04T03:53:13-0500 1.2.1.RELEASE ERROR AMQP Connection xxx:5672 connection.CachingConnectionFactory - Channel shutdown: connection error
2016-03-04T03:53:13-0500 1.2.1.RELEASE ERROR AMQP Connection xxx:5672 connection.CachingConnectionFactory - Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no queue 'xdbus.tap-s3.tap:stream:stream.batch-aws-s3-source.0' in vhost '/', class-id=50, method-id=20)
~
2016-03-04T02:57:54-0500 1.2.1.RELEASE ERROR AMQP Connection xxx:8080 connection.CachingConnectionFactory - Channel shutdown: connection error
2016-03-04T02:57:55-0500 1.2.1.RELEASE ERROR AMQP Connection xxx:8080 connection.CachingConnectionFactory - Channel shutdown: connection error
2016-03-04T03:42:04-0500 1.2.1.RELEASE ERROR AMQP Connection yyy:5672 connection.CachingConnectionFactory - Channel shutdown: connection error
Updated
I found the issue for message loses when this exception happens i see lot of msg lost.This pattern i tested multiple time.Everytime this exception happens i see msg lost.Also bumping up concurrency makes this issue to occur often.
2016-03-05T13:59:41-0500 1.2.1.RELEASE ERROR AMQP Connection host1:5672 connection.CachingConnectionFactory - Channel shutdown: connection error
rabbit configuration
spring:
rabbitmq:
addresses: host1:5672,host2:5672,host3:5672
adminAddresses: http://host1:15672,http://host2:15672,http://host3:15672
nodes: rabbit#host1.test.com,rabbit#host2.test.com,rabbit#host2.test.com
username: test
password: test
virtual_host: /
useSSL: false
sslProperties:
updated with increasing cache size to 200
I added xml provided by you and increased cache size to 200.This is the way happens when processing 1 million and 80 k messages.Only my http client concurrency is 100 all other is 1 .Slowly processing stopped msg are still there before http-client queue and same count.But msg count in my named channel slowly increasing like 10 msg per minute but its very slow
s3-poller|processor|http-client>queue:batchCacheQueue
Msg not getting decreass in queue before http 186174.But slowly msg are coming in to batchCacheQueue
Test case to simulate:
1)I was using spring integration aws-s3 source with a splitter in composite module | processor like xml parsing |http-client with concurrency 100 >named channel.
2)I think file source might also work.Create single file of million records and try to pull this from file.
3)After some 4 to 5 run we see this exception happening
Caused by: com.rabbitmq.client.impl.UnknownChannelException: Unknown channel number 23364
We found an issue when channels are churned a lot; you need to increase the channel cache size in the rabbit caching connection factory.
See this answer for a work-around.
I opened a JIRA issue so that the next version of Spring XD will expose this setting by in servers.yml so you don't have to override the bus configuration file.

opensearchserver this directory is closed-error

I recently upgraded to oss1.3 rc3, and I am having some difficulties while using scheduler
9/24/12 12:49:00 PM
9/24/12 12:49:00 PM
0:00:00
Index - optimize
Optimize starts
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
9/24/12 12:49:00 PM
9/24/12 12:49:00 PM
0:00:00
Web crawler - stop
Was not running
9/24/12 12:48:00 PM
9/24/12 12:48:12 PM
0:00:12
Index - optimize
Optimize starts
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
9/24/12 12:38:00 PM
9/24/12 12:48:00 PM
0:10:00
Web crawler - stop
Not stopped after 10 minutes
I tried to check the log file, which looks like:
00:00:00,001 root - Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /data/test/index/20120922160504/write.lock
org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /data/test/index/20120922160504/write.lock
at org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:274)
at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:5739)
at com.jaeksoft.searchlib.index.WriterLocal.unlock(Unknown Source)
at com.jaeksoft.searchlib.index.WriterLocal.close(Unknown Source)
at com.jaeksoft.searchlib.index.WriterLocal.optimizeNoLock(Unknown Source)
at com.jaeksoft.searchlib.index.WriterLocal.optimize(Unknown Source)
at com.jaeksoft.searchlib.index.IndexSingle.optimize(Unknown Source)
at com.jaeksoft.searchlib.Client.optimize(Unknown Source)
at com.jaeksoft.searchlib.scheduler.task.TaskOptimizeIndex.execute(Unknown Source)
at com.jaeksoft.searchlib.scheduler.TaskItem.run(Unknown Source)
at com.jaeksoft.searchlib.scheduler.JobItem.run(Unknown Source)
at com.jaeksoft.searchlib.scheduler.TaskManager.execute(Unknown Source)
at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Thanks in advance.
I suppose you already fixed this issue.
The file "write.lock" is created when OpenSearchServer writes in the index. In some case (server crash, application killed), this file may not be deleted automatically.
Here is the process how to fix this:
- Stop OpensearchServer.
- Delete the file named "write.lock".
- Restart OpenSearchServer.
Sometimes, when using the scheduler, you may have concurrent jobs which will try to update the index at the same time. A typical scheduler job will do that kind of tasks:
- Stop the web crawler.
- Optimize the index.
- Replicate the index.
- Start the web crawler.
I hope this will be useful.

Resources