Intermittent connectionTimeout errors in spark streaming job

Intermittent connectionTimeout errors in spark streaming job - apache-spark

I have a Spark (2.1) streaming job that writes processed data to azure blob storage every batch (with batch interval 1 min). Every now and then (once every couple of hours) I get the 'java.net.ConnectException' with connection timeout message. This does gets retried and eventually succeeds. But this issue is causing delay in the completion of the 1 min streaming batch and is causing it to finish in 2 to 3 min when this error occurs.
Below is the executor log snippet with error message. I have spark.executor.cores=5.
Is there some kind of number of connections limit that might be causing this?
17/10/11 16:09:02 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Starting operation.}
17/10/11 16:09:02 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Starting operation with location 'PRIMARY' per location mode 'PRIMARY_ONLY'.}
17/10/11 16:09:02 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Starting request to 'http://<name>.blob.core.windows.net/rawData/2017/10/11/16/data1.json' at 'Wed, 11 Oct 2017 16:09:02 GMT'.}
17/10/11 16:09:02 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Waiting for response.}
..
17/10/11 16:11:09 WARN root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Retryable exception thrown. Class = 'java.net.ConnectException', Message = 'Connection timed out (Connection timed out)'.}
17/10/11 16:11:09 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Checking if the operation should be retried. Retry count = '0', HTTP status code = '-1', Error Message = 'An unknown failure occurred : Connection timed out (Connection timed out)'.}
17/10/11 16:11:09 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {The next location has been set to 'PRIMARY', per location mode 'PRIMARY_ONLY'.}
17/10/11 16:11:09 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {The retry policy set the next location to 'PRIMARY' and updated the location mode to 'PRIMARY_ONLY'.}
17/10/11 16:11:09 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Operation will be retried after '0'ms.}
17/10/11 16:11:09 INFO root: {89f867cc-cbd3-4fa9-a549-4e07be3f69b0}: {Retrying failed operation.}

Related

Running into RedisTimeoutException and other exceptions with Redisson and Azure Redis Cache

A lot of timeout exceptions and Can't add slave exceptions
Steps to reproduce or test case
intermittent
Redis version
Azure Redis Cache with 5 shards
4.0.14, 3.2.7
Redisson version
3.11.4
Redisson configuration
Default clustered config with the following overrides:
REDIS_ENABLED | true
REDIS_KEEP_ALIVE | true
REDIS_THREADS | 512
REDIS_NETTY_THREADS | 1024
REDIS_MASTER_CONNECTION_MINIMUM_IDLE_SIZE | 5
REDIS_MASTER_CONNECTION_POOL_SIZE | 10
REDIS_SLAVE_CONNECTION_MINIMUM_IDLE_SIZE | 5
REDIS_SLAVE_CONNECTION_POOL_SIZE | 10
REDIS_TIMEOUT | 1000
REDIS_RETRY_INTERVAL | 500
REDIS_TCP_NO_DELAY | true
I see following exceptions in the log:
`
exception: { [-]
class: org.redisson.client.RedisConnectionException
thrownfrom: unknown
}
level: ERROR
logger_name: org.redisson.cluster.ClusterConnectionManager
message: Can't add slave: rediss://:15002
process: 6523
stack_trace: org.redisson.client.RedisTimeoutException: Command execution timeout for command: (READONLY), params: [], Redis client: [addr=rediss://:15002]
at org.redisson.client.RedisConnection.lambda$async$1(RedisConnection.java:207)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483)
... 2 common frames omitted
Wrapped by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: /:15002
at org.redisson.connection.pool.ConnectionPool$1.lambda$run$0(ConnectionPool.java:160)
at org.redisson.misc.RedissonPromise.lambda$onComplete$0(RedissonPromise.java:183)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at org.redisson.misc.RedissonPromise.tryFailure(RedissonPromise.java:96)
at org.redisson.connection.pool.ConnectionPool.promiseFailure(ConnectionPool.java:330)
at org.redisson.connection.pool.ConnectionPool.lambda$createConnection$1(ConnectionPool.java:296)
at org.redisson.misc.RedissonPromise.lambda$onComplete$0(RedissonPromise.java:183)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at org.redisson.misc.RedissonPromise.tryFailure(RedissonPromise.java:96)
at org.redisson.client.RedisClient$2$1.run(RedisClient.java:240)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
at i.n.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
stack_trace: org.redisson.client.RedisTimeoutException: Unable to get connection! Try to increase 'nettyThreads' and/or connection pool size settingsNode source: NodeSource [slot=15393, addr=redis://:15007, redisClient=null, redirect=MOVED, entry=null], command: (PSETEX), params: [some key, 3600000, PooledUnsafeDirectByteBuf(ridx: 0, widx: 457, cap: 512)] after 0 retry attempts
at org.redisson.command.RedisExecutor$2.run(RedisExecutor.java:209)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:680)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:755)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:483)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
level: ERROR
logger_name: org.redisson.cluster.ClusterConnectionManager
message: Can't add master: rediss://:15007 for slot ranges: [[15019-15564], [12288-13652], [4096-5461]]
process: 6574
thread_name: redisson-netty-2-718
timestamp: 2019-10-29 22:32:15.592
`
My logs are flooded with these exceptions and when I login to Azure portal, I see the CPU metric for Redis spiked to 100%. Any help is appreciated.

Spark job can not acquire resource from mesos cluster

I am using Spark Job Server (SJS) to create context and submit jobs.
My cluster includes 4 servers.
master1: 10.197.0.3
master2: 10.197.0.4
master3: 10.197.0.5
master4: 10.197.0.6
But only master1 has a public ip.
First of all I set up zookeeper for master1, master3 and master3 and zookeeper-id from 1 to 3.
I intend use master1, master2, master3 to be a masters of cluster.
That mean quorum=2 I set for 3 masters.
The zk connect is zk://master1:2181,master2:2181,master3:2181/mesos
each server I also start mesos-slave so I have 4 slaves and 3 masters.
As you can see all slaves are conencted.
But the funny thing is when I create a job to run it can not acquire the resource.
From logs I saw that it's continuing DECLINE the offer. This logs from master.
I0523 15:01:00.116981 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4264 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4#10.197.0.3:28765
I0523 15:01:00.117086 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4265 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4#10.197.0.3:28765
I0523 15:01:01.460502 32508 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (914)#127.0.0.1:5050
I0523 15:01:02.117753 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b#10.197.0.3:28765
I0523 15:01:02.118099 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4#10.197.0.3:28765
I0523 15:01:02.119299 32508 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4266 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b#10.197.0.3:28765
I0523 15:01:02.119858 32515 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4267 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4#10.197.0.3:28765
I0523 15:01:02.900946 32509 http.cpp:312] HTTP GET for /master/state from 10.197.0.3:35778 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' with X-Forwarded-For='113.161.38.181'
I0523 15:01:03.118147 32514 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4#10.197.0.3:28765
For 1 of my slave I check
W0523 14:53:15.487599 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019
W0523 14:53:15.487773 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020
I0523 14:53:15.487820 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019 to master#10.197.0.3:5050
I0523 14:53:15.488008 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020 to master#10.197.0.3:5050
I0523 15:02:24.120436 32680 http.cpp:190] HTTP GET for /slave(1)/state from 113.161.38.181:63097 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
W0523 15:02:24.165690 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523111535-0003' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019: Container 'cac7667c-3309-4380-9f95-07d9b888e44e' not found
W0523 15:02:24.165771 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523112724-0001' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020: Container '9c661311-bf7f-4ea6-9348-ce8c7f6cfbcb' not found
From SJS Logs
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4565 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4566 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4567 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,366] WARN cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[2016-05-23 15:04:10,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4568 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4569 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:11,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:12,308] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4570 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:12,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
In master2 logs
May 23 08:19:44 ants-vps mesos-master[1866]: E0523 08:19:44.273349 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:19:54 ants-vps mesos-master[1866]: I0523 08:19:54.274245 1899 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1257)#127.0.0.1:5050
May 23 08:19:54 ants-vps mesos-master[1866]: E0523 08:19:54.274533 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:20:04 ants-vps mesos-master[1866]: I0523 08:20:04.275291 1897 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1260)#127.0.0.1:5050
May 23 08:20:04 ants-vps mesos-master[1866]: E0523 08:20:04.275512 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
From master3:
May 23 08:21:05 ants-vps mesos-master[22023]: I0523 08:21:05.994082 22042 recover.cpp:193] Received a recover response from a replica in EMPTY status
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994051 22043 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994529 22036 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (1282)#127.0.0.1:5050
How to find the reason of that issues and fix it?

How can I drop an empty line in logstash

In my logstash logs I have sometimes empty lines or lines with only spaces.
To drop the empty line I created a dropemptyline filter file
# drop empty lines
filter {
if [message] =~ /^\s*$/ {
drop { }
}
}
But the empty line filter is not working as expected, mainly because this particular filter is inside a chain other filters where there are filter coming afterwards afterwards.
00_input.conf
05_syslogfilter.conf
06_dropemptylines.conf
07_classifier.conf
So I think my particular filter would work if it was the only one but its not.
2015-02-11 15:02:12.347 WARN 1 --- [tp1812226644-23] o.eclipse.jetty.servlet.ServletHandler :
org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.dao.DataAccessResourceFailureException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]; nested exception is com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]
My question is how can I drop out of all filters and go directly to output?

you can just ignore empty lines entirely using a grok filter,
%{GREEDYDATA:1st}(\n{1,})%{GREEDYDATA:2nd}
it will generate,
{
"1st": [
[
"2015-02-11 15:02:12.347 WARN 1 --- [tp1812226644-23] o.eclipse.jetty.servlet.ServletHandler : "
]
],
"2nd": [
[
"org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.dao.DataAccessResourceFailureException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]; nested exception is com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]"
]
]
}
or more elegant way,
(?m)%{GREEDYDATA:log}
Output:
{
"log": [
[
"2015-02-11 15:02:12.347 WARN 1 --- [tp1812226644-23] o.eclipse.jetty.servlet.ServletHandler : \n\n\n\norg.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.dao.DataAccessResourceFailureException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]; nested exception is com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting for a server that matches AnyServerSelector{}. Client view of cluster state is {type=Unknown, servers=[{address=mongo:27017, type=Unknown, state=Connecting, exception={com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: mongo: unknown error}}]"
]
]
}

Reduce the retry time in CoreData - see below

This is what appears in the log, see the Retrying after delay: 60 seconds
2013-10-09 15:21:59.807 MyApp[4848:6507] -[PFUbiquitySafeSaveFile waitForFileToUpload:](299): CoreData: Ubiquity: <PFUbiquityPeerReceipt: 0x15e09de0>(0)
permanentLocation: <PFUbiquityLocation: 0x15e22100>: /var/mobile/Library/Mobile Documents/HHWT75NS6T~com~xyz~MyApp/CoreData/New Project retry 2/mobile~A3BF465A-15C7-46F2-8C68-AC8F49FD15AB/New Project retry 2/W8MckEJ0x2d~HZZIeUH2be6hs41TEOONzKIrCuLcuP4=/receipt.0.cdt
safeLocation: <PFUbiquityLocation: 0x15ec4660>: /var/mobile/Library/Mobile Documents/HHWT75NS6T~com~xyz~MyApp/CoreData/New Project retry 2/mobile~A3BF465A-15C7-46F2-8C68-AC8F49FD15AB/New Project retry 2/W8MckEJ0x2d~HZZIeUH2be6hs41TEOONzKIrCuLcuP4=/mobile~A3BF465A-15C7-46F2-8C68-AC8F49FD15AB.0.cdt
currentLocation: <PFUbiquityLocation: 0x15ec4660>: /var/mobile/Library/Mobile Documents/HHWT75NS6T~com~xyz~MyApp/CoreData/New Project retry 2/mobile~A3BF465A-15C7-46F2-8C68-AC8F49FD15AB/New Project retry 2/W8MckEJ0x2d~HZZIeUH2be6hs41TEOONzKIrCuLcuP4=/mobile~A3BF465A-15C7-46F2-8C68-AC8F49FD15AB.0.cdt
kv: (null)
Safe save failed for file, error: Error Domain=NSCocoaErrorDomain Code=260 "The operation couldn’t be completed. (Cocoa error 260 - Request for item properties on non-existent item.)" UserInfo=0x15eaedf0 {NSDescription=Request for item properties on non-existent item.}
2013-10-09 15:21:59.809 iProjectiPad[4848:6507] -[PFUbiquitySetupAssistant finishSetupWithRetry:](811): CoreData: Ubiquity: <PFUbiquitySetupAssistant: 0x15e4e520>: Retrying after delay: 60

How do I reconnect to Cassandra using Hector?

I have the following code:
StringSerializer ss = StringSerializer.get();
String cf = "TEST";
CassandraHostConfigurator conf = new CassandraHostConfigurator("localhost:9160");
conf.setCassandraThriftSocketTimeout(40000);
conf.setExhaustedPolicy(ExhaustedPolicy.WHEN_EXHAUSTED_BLOCK);
conf.setRetryDownedHostsDelayInSeconds(5);
conf.setRetryDownedHostsQueueSize(128);
conf.setRetryDownedHosts(true);
conf.setLoadBalancingPolicy(new LeastActiveBalancingPolicy());
String key = Long.toString(System.currentTimeMillis());
Cluster cluster = HFactory.getOrCreateCluster("TestCluster", conf);
Keyspace keyspace = HFactory.createKeyspace("TestCluster", cluster);
Mutator<String> mutator = HFactory.createMutator(keyspace, StringSerializer.get()); int count = 0;
while (!"q".equals(new Scanner( System.in).next())) {
try{
mutator.insert(key, cf, HFactory.createColumn("column_" + count, "v_" + count, ss, ss));
count++;
} catch (Exception e) {
e.printStackTrace();
}
}
and I can write some values using it, but when I restart cassandra, it fails. Here is the log:
[15:11:07] INFO [CassandraHostRetryService ] Downed Host Retry service started with >queue size 128 and retry delay 5s
[15:11:07] INFO [JmxMonitor ] Registering JMX >me.prettyprint.cassandra.service_ASG:ServiceType=hector,MonitorType=hector
[15:11:17] ERROR [HThriftClient ] Could not flush transport (to be expected >if the pool is shutting down) in close for client: CassandraClient
org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
at >me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:98)
at >me.prettyprint.cassandra.connection.client.HThriftClient.close(HThriftClient.java:26)
at >me.prettyprint.cassandra.connection.HConnectionManager.closeClient(HConnectionManager.java:308)
at >me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:257)
at >me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
at com.app.App.main(App.java:40)
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 9 more
[15:11:17] ERROR [HConnectionManager ] MARK HOST AS DOWN TRIGGERED for host >localhost(127.0.0.1):9160
[15:11:17] ERROR [HConnectionManager ] Pool state on shutdown: >:{localhost(127.0.0.1):9160}; IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: 49
[15:11:17] INFO [ConcurrentHClientPool ] Shutdown triggered on :{localhost(127.0.0.1):9160}
[15:11:17] INFO [ConcurrentHClientPool ] Shutdown complete on :{localhost(127.0.0.1):9160}
[15:11:17] INFO [CassandraHostRetryService ] Host detected as down was added to retry queue: localhost(127.0.0.1):9160
[15:11:17] WARN [HConnectionManager ] Could not fullfill request on this host CassandraClient
[15:11:17] WARN [HConnectionManager ] Exception:
me.prettyprint.hector.api.exceptions.HectorTransportException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at >me.prettyprint.cassandra.connection.client.HThriftClient.getCassandra(HThriftClient.java:82)
at >me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:236)
at >me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.model.MutatorImpl.insert(MutatorImpl.java:69)
at com.app.App.main(App.java:40)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe
at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:157)
at org.apache.cassandra.thrift.Cassandra$Client.send_set_keyspace(Cassandra.java:466)
at org.apache.cassandra.thrift.Cassandra$Client.set_keyspace(Cassandra.java:455)
at >me.prettyprint.cassandra.connection.client.HThriftClient.getCassandra(HThriftClient.java:78)
... 5 more
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at >org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
... 9 more
[15:11:17] INFO [HConnectionManager ] Client CassandraClient released to inactive or dead pool. Closing.
[15:11:17] INFO [HConnectionManager ] Client CassandraClient released to inactive or dead pool. Closing.
[15:11:17] INFO [HConnectionManager ] Added host localhost(127.0.0.1):9160 to pool

You have set -
conf.setRetryDownedHostsDelayInSeconds(5);
Try to to wait after the restart for more than 5 seconds.
Also, you may need to upgrade.
What is the size thrift_max_message_length_in_mb you have set?
Kind regards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Intermittent connectionTimeout errors in spark streaming job - apache-spark

Related

Running into RedisTimeoutException and other exceptions with Redisson and Azure Redis Cache

Spark job can not acquire resource from mesos cluster

How can I drop an empty line in logstash

Reduce the retry time in CoreData - see below

How do I reconnect to Cassandra using Hector?

Categories

Resources