Spark Structured Streaming Blue/Green Deployments - apache-spark

We'd like to be able to deploy our Spark jobs such that there isn't any downtime in processing data during deployments (currently there's about a 2-3 minute window). In my mind, the easiest way to do this is to simulate the "blue/green deployment" philosophy, which is to spin up the new version of the Spark job, let it warm up, then shut down the old job. However, with structured streaming & checkpointing, we cannot do this because the new Spark job sees that the latest checkpoint file already exists (from the old job). I've attached a sample error below. Does anyone have any thoughts on a potential workaround?
I thought about copying over the existing checkpoint directory to another checkpoint directory for the newly created job - while that should work as a workaround (some data might get reprocessed, but our DB should deduplicate), this seems super hacky and something I'd rather not pursue.
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: rename destination /user/checkpoint/job/offsets/3472939 already exists
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.validateOverwrite(FSDirRenameOp.java:520)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.unprotectedRenameTo(FSDirRenameOp.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameTo(FSDirRenameOp.java:282)
at org.apache.hadoop.hdfs.server.namenode.FSDirRenameOp.renameToInt(FSDirRenameOp.java:247)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3677)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename2(NameNodeRpcServer.java:914)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename2(ClientNamenodeProtocolServerSideTranslatorPB.java:587)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:1991)
at org.apache.hadoop.fs.Hdfs.renameInternal(Hdfs.java:335)
at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.rename(HDFSMetadataLog.scala:356)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:160)
... 20 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): rename destination /user/checkpoint/job/offsets/3472939 already exists

It is possible, but it will add some complexity to your application. Starting streams is in general fast, so it is fair to assume, that delay is caused by initialization of static objects and dependencies. In that case you'll need only SparkContext / SparkSession, and no streaming dependencies so process can be described as:
Start new Spark application.
Initialize batch-oriented objects.
Pass message to the previous application to step down.
Wait for confirmation.
Start streams.
At the very high level, the happy path could be visualized as:
Since it is very generic pattern it could be implemented in a different ways, depending on a language and infrastructure:
Lightweight messaging queue like ØMQ.
Passing messages through distributed file system.
Placing applications in an interactive context (Apache Toree, Apache Livy) and using external client for orchestration.

Related

Spark structures streaming too many threads with checkpointing on S3

Spark 3.0.1
hadoop-aws 3.2.0
I have a simple spark streaming application that reads messages from Kafka topic, aggregates them and writes into Elasticsearch. I am using checkpointing and an S3 bucket to store them.
After some time application started to fail with the following exception:
[476.099s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
Error in TaskCompletionListener
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:801)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:939)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1345)
at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:480)
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:61)
at com.google.common.util.concurrent.ForwardingListeningExecutorService.submit(ForwardingListeningExecutorService.java:40)
at org.apache.hadoop.util.SemaphoredDelegatingExecutor.submit(SemaphoredDelegatingExecutor.java:112)
at com.google.common.util.concurrent.ForwardingListeningExecutorService.submit(ForwardingListeningExecutorService.java:40)
at org.apache.hadoop.util.SemaphoredDelegatingExecutor.submit(SemaphoredDelegatingExecutor.java:112)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:434)
at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.cancel(CheckpointFileManager.scala:163)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$cancelDeltaFile(HDFSBackedStateStoreProvider.scala:507)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.abort(HDFSBackedStateStoreProvider.scala:150)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps.$anonfun$mapPartitionsWithStateStore$2(package.scala:65)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps.$anonfun$mapPartitionsWithStateStore$2$adapted(package.scala:64)
at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:125)
at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1(TaskContextImpl.scala:124)
at org.apache.spark.TaskContextImpl.$anonfun$markTaskCompleted$1$adapted(TaskContextImpl.scala:124)
at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1(TaskContextImpl.scala:137)
at org.apache.spark.TaskContextImpl.$anonfun$invokeListeners$1$adapted(TaskContextImpl.scala:135)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:135)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:124)
at org.apache.spark.scheduler.Task.run(Task.scala:143)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
VisualVM shows, that amount of threads rising from the beginning until it reaches the max (~4.8K):
image
And the majority of them are:
s3a-transfer-unbounded-poolXXX-tXX
s3a-transfer-shared-poolXXX-tXX
As I understood, the only place where these threads pools are created is
org.apache.hadoop.fs.s3a.S3AFileSystem#initialize
and Spark creates new filesystem every time
org.apache.spark.sql.execution.streaming.StreamMetadata#write
is called.
Why it is so? How can I prevent this thread creation?
you can't stop those threads being created as the thread pool is needed for the AWS transfer manager, which is in the AWS library. When S3A's close() method is called it shuts down the transfer manager, and the thread pool. Which means: the problem is that spark isn't closing down the FS instances.
Make sure you don't have caching of the FS instances disabled, e.g. fs.s3a.impl.disable.cache MUST be false. That is the default -so work out where it's being change and stop it.
spark.hadoop.fs.s3a.impl.disable.cache false

Elasticsearch write using Spark

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.
is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
Do these errors mean that some data was not written to the index?
Here is one of many reported task failure errors, but again no job failure is reported:
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:102)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The lib I'm using is
org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"
Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.

FileNotFoundException in apache spark(1.6) job during shuffle files

I am working on spark 1.6, it is failing my job with following error
java.io.FileNotFoundException: /data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-local-20141028214722-43f1/26/shuffle_0_312_0.index (No such file or directory)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:733)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:790)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:732)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:728)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
I am performing join operations. When i carefully look into the error and check my code i found it is failing while it is writing back to CSV from dataFrame. But i am not able to get rid of it. I am not using hdp, i have separate installation for all components.
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)

Decommission cassandra node times out with "received only 0 responses"

When I try to decommission a node in my Cassandra cluster, the process starts (I see active streams flowing from the node to decommission to the other nodes in the cluster (using vnodes)), but then after a little delay nodetool decommission exists with the following error message.
I can repeatedly run nodetool decommission and it will start streaming data to other nodes, but so far always exists with the below error.
Why am I seeing this, and is there a way I can safely decommission this node?
Exception in thread "main" java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:578)
at org.apache.cassandra.db.HintedHandOffManager.listEndpointsPendingHints(HintedHandOffManager.java:528)
at org.apache.cassandra.service.StorageService.streamHints(StorageService.java:2854)
at org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2834)
at org.apache.cassandra.service.StorageService.decommission(StorageService.java:2795)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1454)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:74)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1295)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1387)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:818)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
at org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1213)
at org.apache.cassandra.db.HintedHandOffManager.getHintsSlice(HintedHandOffManager.java:573)
... 33 more
The hinted handoff manager is checking for hints to see if it needs to pass those off during
the decommission so that the hints don't get lost. You most likely have a lot of hints, or
a bunch of tombstones, or something in the table causing the query to timeout. You aren't
seeing any other exceptions in your logs before the timeout are you? Raising the read timeout
period on your nodes before you decommission them, or manually deleting the hints CF, should
most likely get your past this. If you delete them, you would then want to make sure you
ran a full cluster repair when you are done with all of your decommissions, to propagate data
from any hints you deleted.
The short answer is that the node I was trying to decommission was underpowered for the amount of data it held. As of this writing there seems to be a reasonable hard minimum of resources needed to handle nodes with arbitrary amounts of data, which seems to be somewhere in the neighborhood of what an AWS i2.2xlarge provides. In particular, the old m1 instances let you get into trouble by allowing you to store far more data on each node than the memory and compute resources available can support on it.

Cassandra 1.1 or 1.2 for production usage?

We are encountering random SSTable corruptions with 1.2.3/1.2.4 (Datastax Community Edition) on single node development machines with a mixed read/write load using a data model with wide rows from a number of columns POV. Writes are more frequent than reads though. The problems manifests with stack traces like:
ERROR [ReadStage:13899] 2013-04-24 07:09:00,770 CassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:13899,5,main]
java.lang.RuntimeException: org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1582)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
at org.apache.cassandra.db.columniterator.SimpleSliceReader.computeNext(SimpleSliceReader.java:106)
... many more
Caused by: java.io.EOFException
at java.io.RandomAccessFile.readFully(Unknown Source)
... many more
or
java.lang.RuntimeException: org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid column name length 0
Unfortunately, we don't have a reproducible test case yet, because this happens randomly (e.g. after a few days) and not immediately.
I have also researched similar issues with 1.2 in this/other forum(s).
The question is: What is your experience with Cassandra 1.2 in production or would you recommend 1.1 being 1.2.4 the most recent release to date in the 1.2 series?
While we encounter these issues on single node development environments, things might get backed up when running the whole stuff in a cluster served by several nodes, but in our opinion things should run on a single node without corruption as well.
Any hints are much appreciated. Thanks.
I have better experience with cassandra-1.1 in production. Current version 1.2.6 still do not passes our heavy preproduction testing.

Resources