Using pubsub lite library in spark getting error - apache-spark

I am getting error while publishing message to gcp pubsub lite using spark structured streaming.
I cannot use writestream as I want to use it in forEachBatch sink in spark so I am using foreachpartition and foreach and publishing message inside foreach for each dataframe row.
Below is error I get , some messages get published but in some I can see below exception:
2022-06-07 10:08:17 WARN PartitionCountWatcherImpl:101 - Failed to refresh partition count
com.google.api.gax.rpc.ApiException:
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:51)
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:55)
at com.google.cloud.pubsublite.internal.ExtractStatus.toCanonical(ExtractStatus.java:49)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:92)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.onAlarm(PartitionCountWatcherImpl.java:71)
at com.google.cloud.pubsublite.internal.AlarmFactory.lambda$null$0(AlarmFactory.java:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:456)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:100)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:73)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:81)
... 9 more

Related

Got TimeoutException when try to download file from Azure Blob Storage

Im trying to download file from Azure blob storage with flowing code:
blobServiceClient = new BlobServiceClientBuilder().connectionString(connectionString)
.buildClient();
BlobClient b = blobContainerClient.getBlobClient(remotePath);
b.downloadToFile(localPath, true);
But sometimes i got this exception:
Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'map' (and no fallback has been configured)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:288)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:273)
at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:390)
at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:117)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:50)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Do we have any solution to make it's stable?
Version:
<azure-storage-blob.version>12.6.0</azure-storage-blob.version>
<azure-core.version>1.3.0</azure-core.version>

How to define local connection to Spark Thrift in Power BI

I am trying to configure the local connection to the Spark Thrift in Power BI. I am able to connect using Spark ODBC (localhost:10000 with mechanism User Name and Thrift transport SASL). But I would like to use Spark connector as it supports Direct Query.
I couldn't find how to define the connection string. Tried several things like localhost:10000/default/;transportMode=http;ssl=true;user=... but always get the error
ERROR TThreadPoolServer:297 - Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status 80
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status 80
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more
Any hints would be appreciated!
Solved. As written here https://community.powerbi.com/t5/Desktop/Connect-Power-BI-to-Hadoop-Direct-query-HDFS-vs-Spark-vs-custom/td-p/374625
it just doesn't work in Power BI from Microsoft Store. It works in the app from the website.

Could not get a Transport from the Transport Pool for host

I'm trying to write to an IBM Compose Elasticsearch sink from Spark Structured Streaming on IBM Analytics Engine. My spark code:
dataDf
.writeStream
.outputMode(OutputMode.Append)
.format("org.elasticsearch.spark.sql")
.queryName("ElasticSink")
.option("checkpointLocation", s"${s3Url}/checkpoint_elasticsearch")
.option("es.nodes", "xxx1.composedb.com,xxx2.composedb.com")
.option("es.port", "xxxx")
.option("es.net.http.auth.user", "admin")
.option("es.net.http.auth.pass", "xxxx")
.option("es.net.ssl", true)
.option("es.nodes.wan.only", true)
.option("es.net.ssl.truststore.location", SparkFiles.getRootDirectory() + "/my.jks")
.option("es.net.ssl.truststore.pass", "xxxx")
.start("test/broadcast")
However, I'm receiving the following exception:
org.elasticsearch.hadoop.EsHadoopException: Could not get a Transport from the Transport Pool for host [xxx2.composedb.com:xxxx]
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.borrowFrom(PooledHttpTransportFactory.java:106)
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.create(PooledHttpTransportFactory.java:55)
at org.elasticsearch.hadoop.rest.NetworkClient.selectNextNode(NetworkClient.java:99)
at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:82)
at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:59)
at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:94)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:317)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:576)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.run(EsStreamQueryWriter.scala:41)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:52)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:51)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any ideas?
I modified the Elasticsearch hadoop library to output the exception and the underlying problem was the truststore not being found:
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cannot initialize SSL - Expected to find keystore file at [/tmp/spark-e2203f9c-4f0f-4929-870f-d491fce0ad06/userFiles-62df70b0-7b76-403d-80a1-8845fd67e6a0/my.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI.
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.borrowFrom(PooledHttpTransportFactory.java:106)

When would ShuffleBlockFetcherIterator throw "Failed to get block(s)" exceptions?

In my spark application which is run in a cluster mode, I get below exception. I know somehow this coud be due to emery issue. But as the error says, it can not connect to a node. But I ma sure the node is available and it can be connected. Can anyone know what is the main cause of this error and how to resolve it?
17/10/31 17:10:54 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from AUPER01-02-10-12-0.prod.vroc.com.au:36787
java.io.IOException: Failed to connect to AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:171)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
... 2 more
It appears that one of the executors died while the other executors tried to pull blocks from earlier shuffle stages to complete a Spark job.
Right after you've spark-submited a Spark application to a cluster, the application gets a set of machines for executors. They are responsible for executing tasks and caching their results (in memory and/or disk).
Every executor has its own BlockManager that is responsible for managing datasets (as blocks).
The BlockManagers in a Spark application have all to be available or the Spark application will re-trigger task execution.
ShuffleBlockFetcherIterator is a Scala Iterator that fetches multiple shuffle blocks (aka shuffle map outputs) from local and remote BlockManagers.

Dataproc Spark Streaming Kafka checkpointing warning on Google Cloud Storage

I've got a lot of warning when using Dataproc 1.1 (Spark 2.0.2) with Kafka checkpointing on Google Cloud Storage. I've got the following warn :
16/12/11 01:36:02 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listStorageObjectsAndPrefixes(GoogleCloudStorageImpl.java:1069)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectNames(GoogleCloudStorageImpl.java:1173)
at com.google.cloud.hadoop.gcsio.ForwardingGoogleCloudStorage.listObjectNames(ForwardingGoogleCloudStorage.java:182)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectNames(CacheSupplementedGoogleCloudStorage.java:381)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getInferredItemInfo(GoogleCloudStorageFileSystem.java:1286)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getInferredItemInfos(GoogleCloudStorageFileSystem.java:1311)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfos(GoogleCloudStorageFileSystem.java:1212)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.rename(GoogleCloudStorageFileSystem.java:640)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.rename(GoogleHadoopFileSystemBase.java:1091)
at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:241)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This goes on several times and eventually just block our spark streaming job on a task that goes on. I've got other warning too before :
16/12/10 18:05:23 WARN ReceivedBlockTracker: Exception thrown while writing record: BatchCleanupEvent(ArrayBuffer()) to the WriteAheadLog.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.cleanupOldBatches(ReceivedBlockTracker.scala:171)
at org.apache.spark.streaming.scheduler.ReceiverTracker.cleanupOldBlocksAndBatches(ReceiverTracker.scala:226)
at org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:287)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:187)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
... 9 more
16/12/10 18:05:23 WARN ReceivedBlockTracker: Failed to acknowledge batch clean up in the Write Ahead Log.
Does anyone have the same issues ?
Regards,
I faced similar errors in checkpointing to google storage recently. I started checkpointing to hdfs in dataproc rather than google storage as a temporary workaround.

Resources