Nodetool repair erroring on cassandra 3.9 cluster complaining of dead nodes - cassandra

I have a cassandra 3.9 cluster. I initiated a repair from one of the nodes in the cluster. The repair went nowhere. I see the logs on that initiated node filled with errors like this.
ERROR [GossipTasks:1] 2018-02-16 23:27:36,949 RepairSession.java:347 - [repair #cadf6f11-1342-11e8-8d73-6767c6890f70] session completed with the following error
java.io.IOException: Endpoint /**.**.**.52 died
at org.apache.cassandra.repair.RepairSession.convict(RepairSession.java:346) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:306) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:782) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper.access$800(Gossiper.java:66) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:181) [apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) [apache-cassandra-3.9.jar:3.9]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_91]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_91]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
One the other hand if I look at the logs for the nodes claimed to be dead, I see one of 3 symptoms.
Either the node claims to have successfully sent the requested
merkle tree over.
The node does not have any trace of the repair session, and thus doesn't appear to have received any repair request.
The node shows exception like this.
ERROR [ValidationExecutor:3] 2018-02-16 23:29:06,548 Validator.java:261 - Failed creating a merkle tree for [repair #cac2bf50-1342-11e8-8d73-6767c6890f70 on somekeyspace/sometable, [(-3531087107126953137,-3495591103116433105], (1424707151780052485,1425479237398192865], (-3533012126945497873,-3531087107126953137], (1425479237398192865,1429220273719165251], (-4991682772598302168,-4984938905452900436], (-7686750611814623539,-7685228552629222537], (7554301216433235881,7559623046999138658], (334796420453180909,342318143371667659], (-3538876023288368831,-3533012126945497873], (1409514567521922418,1424707151780052485], (5391546013321073004,5393284101537339558], (590921410556013711,593440512568877190]]], /..**.43 (see log for details)
ERROR [ValidationExecutor:3] 2018-02-16 23:29:06,549 CassandraDaemon.java:226 - Exception in thread Thread[ValidationExecutor:3,1,main]
java.lang.RuntimeException: Parent repair session with id = c8bf7540-1342-11e8-8d73-6767c6890f70 has failed.
at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:377) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1313) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1222) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:81) ~[apache-cassandra-3.9.jar:3.9]
at org.apache.cassandra.db.compaction.CompactionManager$11.call(CompactionManager.java:844) ~[apache-cassandra-3.9.jar:3.9]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Is this a known issue?

Related

Using pubsub lite library in spark getting error

I am getting error while publishing message to gcp pubsub lite using spark structured streaming.
I cannot use writestream as I want to use it in forEachBatch sink in spark so I am using foreachpartition and foreach and publishing message inside foreach for each dataframe row.
Below is error I get , some messages get published but in some I can see below exception:
2022-06-07 10:08:17 WARN PartitionCountWatcherImpl:101 - Failed to refresh partition count
com.google.api.gax.rpc.ApiException:
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:51)
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:55)
at com.google.cloud.pubsublite.internal.ExtractStatus.toCanonical(ExtractStatus.java:49)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:92)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.onAlarm(PartitionCountWatcherImpl.java:71)
at com.google.cloud.pubsublite.internal.AlarmFactory.lambda$null$0(AlarmFactory.java:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:456)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:100)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:73)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:81)
... 9 more

Logstash + Azure Events Hubs

Trying to follow the link to add azure event into logstash, I have the below issue:
[2020-02-13T14:06:28,886][ERROR][com.microsoft.azure.eventprocessorhost.PartitionManager] host logstash-5fdbcee8-e368-44de-bc13-c640a36f646f: Exception while initializing stores, not starting partition manager com.microsoft.azure.eventhubs.IllegalEntityException: Failure getting partition ids for event hub
at com.microsoft.azure.eventprocessorhost.PartitionManager.lambda$cachePartitionIds$4(PartitionManager.java:80) ~[azure-eventhubs-eph-2.1.0.jar:?]
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) ~[?:1.8.0_242]
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) ~[?:1.8.0_242]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_242]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_242]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_242]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_242]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_242]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
Can someone help ?
I got the hint from this question. It appears the consumer SAS policy still needs manage privileges.

When would ShuffleBlockFetcherIterator throw "Failed to get block(s)" exceptions?

In my spark application which is run in a cluster mode, I get below exception. I know somehow this coud be due to emery issue. But as the error says, it can not connect to a node. But I ma sure the node is available and it can be connected. Can anyone know what is the main cause of this error and how to resolve it?
17/10/31 17:10:54 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from AUPER01-02-10-12-0.prod.vroc.com.au:36787
java.io.IOException: Failed to connect to AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:171)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
... 2 more
It appears that one of the executors died while the other executors tried to pull blocks from earlier shuffle stages to complete a Spark job.
Right after you've spark-submited a Spark application to a cluster, the application gets a set of machines for executors. They are responsible for executing tasks and caching their results (in memory and/or disk).
Every executor has its own BlockManager that is responsible for managing datasets (as blocks).
The BlockManagers in a Spark application have all to be available or the Spark application will re-trigger task execution.
ShuffleBlockFetcherIterator is a Scala Iterator that fetches multiple shuffle blocks (aka shuffle map outputs) from local and remote BlockManagers.

After creating a new jhipster project, unable to launch the application

After creating a jhipster project, tried with the following command.
**mvnw
I am getting the following error. For existing project also, i am facing the same issue.
Error :
com.netflix.discovery.shared.transport.TransportException: Cannot execute request on any known server
at com.netflix.discovery.shared.transport.decorator.RetryableEurekaHttpClient.execute(RetryableEurekaHttpClient.java:111)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.register(EurekaHttpClientDecorator.java:56)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator$1.execute(EurekaHttpClientDecorator.java:59)
at com.netflix.discovery.shared.transport.decorator.SessionedEurekaHttpClient.execute(SessionedEurekaHttpClient.java:77)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.register(EurekaHttpClientDecorator.java:56)
at com.netflix.discovery.DiscoveryClient.register(DiscoveryClient.java:815)
at com.netflix.discovery.InstanceInfoReplicator.run(InstanceInfoReplicator.java:104)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Your question is not detailed enough but it is obvious that you are using a microservice achitecture and did not start the registry, check documentation.

Dataproc Spark Streaming Kafka checkpointing warning on Google Cloud Storage

I've got a lot of warning when using Dataproc 1.1 (Spark 2.0.2) with Kafka checkpointing on Google Cloud Storage. I've got the following warn :
16/12/11 01:36:02 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listStorageObjectsAndPrefixes(GoogleCloudStorageImpl.java:1069)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectNames(GoogleCloudStorageImpl.java:1173)
at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.listObjectNames(ForwardingGoogleCloudStorage.java:182)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectNames(CacheSupplementedGoogleCloudStorage.java:381)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getInferredItemInfo(GoogleCloudStorageFileSystem.java:1286)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getInferredItemInfos(GoogleCloudStorageFileSystem.java:1311)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfos(GoogleCloudStorageFileSystem.java:1212)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.rename(GoogleCloudStorageFileSystem.java:640)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.rename(GoogleHadoopFileSystemBase.java:1091)
at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This goes on several times and eventually just block our spark streaming job on a task that goes on. I've got other warning too before :
16/12/10 18:05:23 WARN ReceivedBlockTracker: Exception thrown while writing record: BatchCleanupEvent(ArrayBuffer()) to the WriteAheadLog.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.cleanupOldBatches(ReceivedBlockTracker.scala:171)
at org.apache.spark.streaming.scheduler.ReceiverTracker.cleanupOldBlocksAndBatches(ReceiverTracker.scala:226)
at org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:287)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:187)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
... 9 more
16/12/10 18:05:23 WARN ReceivedBlockTracker: Failed to acknowledge batch clean up in the Write Ahead Log.
Does anyone have the same issues ?
Regards,
I faced similar errors in checkpointing to google storage recently. I started checkpointing to hdfs in dataproc rather than google storage as a temporary workaround.

Resources