Dataflow job stuck at reading from Pub/Sub - python-3.x

Our SDK version is Apache Beam Python 3.7 SDK 2.25.0
There is a pipeline which reads data from Pub/Sub, transforms it and saves results to GCS.
Usually it works fine for 1-2 weeks. After that it stucks.
"Operation ongoing in step s01 for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.maybeWait(RemoteGrpcPortWriteOperation.java:175)
at org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.process(RemoteGrpcPortWriteOperation.java:196)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
at org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1400)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:156)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1101)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Step 01 is just a "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription=subscription)
After this dataflow increases the number of workers and stops processing any new data. Job is still in RUNNNING state.
We just need to restart the job to solve it. But it happens every ~2 weeks.
How can we fix it?

This looks like an issue with the legacy "Java Runner Harness." I would suggest running your pipeline with Dataflow Runner v2 to avoid these kinds of issues. You could also wait until it becomes the default (it is currently rolling out).

Related

Spark 3.2 on Kubernetes keeps throwing okhttp3/okio EOFException

I'm using Spark 3.2.1 image that was built from the official distribution via `docker-image-tool.sh', on Kubernetes 1.18 cluster. Everything works fine, except for this error message every 90 seconds:
WARN WatcherWebSocketListener: Exec Failure
java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:61)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This error message does not effect the application, but it's really annoying, especially for Jupyter users, and the lack of details makes it very hard to debug.
It appears on any submit variation - spark-submit, pyspark, spark-shell, and regardless to dynamic execution enabled or disabled.
I've found traces of it on the internet, but all occurrences were from older versions of Spark and resolved by using "newer" version of fabric8 (4.x). Spark 3.2.1 already use fabric8 5.4.1.
I wonder if anyone else still sees this error in Spark 3.x, and has a resolution.
Thanks.
Update:
This seems to be related to the Kubernetes cluster itself. After migrating to a new cluster this error was gone.

How to disable WARN WatcherWebSocketListener: Exec Failure java.io.EOFException when job finished in spark on kubernetes

I have set up a jupyter notebook which runs a pyspark job on our kubernetes cluster in Spark Client mode; after some tinkering around I managed to get the job to run.
The communication between the driver and the executors, however, does not terminate cleanly; more specifically, when the executors are finished, they do not seem to maintain communication with the driver - the driver in the jupyter notebook keeps indicating
22/01/05 09:55:53 WARN WatcherWebSocketListener: Exec Failure
java.io.EOFException
at okio.RealBufferedSource.require(RealBufferedSource.java:61)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
One solution seems to be to kill the spark context at the end of the notebook, but I guess this means that the containers will need to be created each time and it does not really facilitate a very interactive modus.
Is it possible to disable this warning somehow?
I'm working with Spark v3.2.0.
Thanks for any tips.

Flink-Cassandra connector throws exception (flink-connector-cassandra_2.11-1.10.0)

I am trying to upgrade flink 1.7.2 to flink 1.10 and I am having problem with cassandra connector. Everytime I start a job that is using it the following exception is thrown:
com.datastax.driver.core.exceptions.TransportException: [/xx.xx.xx.xx] Error writing
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:550)
at com.datastax.driver.core.Connection$10.operationComplete(Connection.java:534)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
at com.datastax.shaded.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
at com.datastax.shaded.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
at com.datastax.driver.core.Connection$Flusher.run(Connection.java:870)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.shaded.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Direct buffer memory
at com.datastax.shaded.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:643)
Also the following message was printed when the job was run locally (not in YARN):
13:57:54,490 ERROR com.datastax.shaded.netty.util.ResourceLeakDetector - LEAK: You are creating too many HashedWheelTimer instances. HashedWheelTimer is a shared resource that must be reused across the JVM,so that only a few instances are created.
All jobs that do not use cassandra connector are working properly
Can someone help?
UPDATE: The bug is still reproducible and I think this is the reason: https://issues.apache.org/jira/browse/FLINK-17493.
I had an old configuration (from flink 1.7) where classloader.parent-first-patterns.additional: com.datastax. was configured and my cassadndra-flink connector was in flink/lib folder ( this was done because of other problems related to shaded netty I had with Cassandra-flink connector). Now with the migration to flink 1.10 the following problem was hit. Once removing this configuration - classloader.parent-first-patterns.additional: com.datastax., including flink-connector-cassandra_2.12-1.10.0.jar in my jar and removing it from /usr/lib/flink/lib/ the problem was no longer reproducible.

Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.
Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.
thanks
Peeranat F.
Ok. We faced the same issue on GCP and the reason for this is resource preemption.
In GCP, resource preemption can be done by following two strategies,
Node preemption - removing nodes in cluster and replacing them
Container preemption - removing yarn containers.
This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.
What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.
You can see which nodes are preemptable in your GCP console.
The following could be other possible causes:
The cluster uses preemptive workers (they can be deleted at any time), so their work is not completed and could cause inconsistent behaviors.
There exist resizing in the nodes during the spark job execution that causes to restart tasks/containers/executors.
Memory issues. The shuffle operations are usually done in-memory, but if the memory resources are exceeded, will spill over to disk.
Disk space in the workers is full due to a big amount of shuffle operations, or any other process that uses disk at the workers, for example logs.
Yarn kill tasks to make room for failed attempts.
So, I summarize the following actions as possible workarounds:
1.- Increase memory of the workers and master, this will discard if you face memory problems.
2.- Change image version of Dataproc.
3.- Change cluster properties to tune your cluster especially for mapreduce and spark.

Neo4J InvalidEpochException when writing to a multi-node cluster

I have a Neo4J Enterprise cluster with 6 nodes (1 master and 5 slaves) hosted in separate Linux (CentOS 6.4) VMs that I am testing an application with. The VMs are hosted in Azure. I have an object that manages connections between all 6 nodes using a simple round-robin like technique. I noticed that when writing to slave nodes, the following error message occurs several times in the messages.log file:
ERROR [o.n.k.h.c.m.MasterServer]: Could not finish off dead channel
org.neo4j.kernel.ha.com.master.InvalidEpochException: Invalid epoch 282880438682249,
correct epoch is 282880443056723 at
org.neo4j.kernel.ha.com.master.MasterImpl.assertCorrectEpoch(MasterImpl.java:218) ~
[neo4j-ha-2.1.2.jar:2.1.2]
at org.neo4j.kernel.ha.com.master.MasterImpl.finishTransaction(MasterImpl.java:363)
~[neo4j-ha-2.1.2.jar:2.1.2]
at
org.neo4j.kernel.ha.com.master.MasterServer.finishOffChannel(MasterServer.java:70)
~[neo4j-ha-2.1.2.jar:2.1.2]
at org.neo4j.com.Server.tryToFinishOffChannel(Server.java:411) ~[neo4j-com 2.1.2.jar:2.1.2]
at org.neo4j.com.Server$4.run(Server.java:589) [neo4j-com-2.1.2.jar:2.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_60]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_60]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
I am at a loss. Why does this error occur? When this error occurs, the write fails. I have the servers synchronizing their time via NTP. The application writing to Neo4J is a .Net application using the Neo4JClient library. Thank you for any guidance you can provide.
Amir.
The InvalidEpochException might occur when a master happen while a slave is in progress of committing a transaction. The main reason for unexpected master switches are lengthy GC pauses, longer than cluster timeout settings.
So you need to analyze GC behaviour and optimize them or tweak your cluster timeout settings.

Resources