Failure when creating a table during job run on Databricks - apache-spark

I have been doing a left join on two tables in SQL (table A contains billions of rows and table B contains millions of rows) and creating a new table (table C) from the result of the join. I am using a XXL SQL Warehouse on Databricks with two workers to handle the compute of this operation.
The main issue is when I schedule this job to be performed it seems to fail every time. When I run it manually, the job finishes almost every time.
DROP TABLE IF EXISTS tableC;
CREATE TABLE tableC
AS (
SELECT *
FROM tableA
LEFT JOIN tableB
ON tableA.id = tableB.id
)
After running the SQL query as a scheduled job, I seem to get the error below. From my understanding the error occurs in the final stage of the job when it is writing the rows into a delta table, however I can't seem to figure out the reason for this failure. I have checked to make sure the data is not corrupted and I have enough memory. The main part that puzzles me is the fact that it works when the SQL command is run manually but always seems to fail when the query is apart of a scheduled job run.
Job aborted due to stage failure: ResultStage 9 (write at WriteIntoDeltaCommand.scala:70) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException
at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:315)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1167)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$next$1(ShuffleBlockFetcherIterator.scala:905)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:736)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.RowToColumnarExec$$anon$1.hasNext(Columnar.scala:491)
at com.databricks.photon.CloseableIterator$$anon$5.hasNext(CloseableIterator.scala:62)
at com.databricks.photon.NativeColumnBatchIterator.hasNext(NativeColumnBatchIterator.java:47)
at 0x8235362 <photon>.HasNext(external/workspace_spark_3_3/photon/jni-wrappers/jni-native-column-batch-iterator.cc:62)
at 0x4707a79 <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/file-writer-node.cc:161)
at com.databricks.photon.JniApiImpl.open(Native Method)
at com.databricks.photon.JniApi.open(JniApi.scala)
at com.databricks.photon.JniExecNode.open(JniExecNode.java:64)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$5(PhotonWriteStageExec.scala:87)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.photon.PhotonExec.timeit(PhotonExec.scala:300)
at com.databricks.photon.PhotonExec.timeit$(PhotonExec.scala:298)
at com.databricks.photon.PhotonWriteStageExec.timeit(PhotonWriteStageExec.scala:38)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$4(PhotonWriteStageExec.scala:87)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1730)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$2(PhotonWriteStageExec.scala:84)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2(PhotonExec.scala:484)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2$adapted(PhotonExec.scala:353)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:882)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:882)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:169)
at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:104)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1696)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:62)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more

Based on stacktrace the issue seems to be for two reasons
org.apache.spark.shuffle.FetchFailedException (operation failed)
Caused by: java.nio.channels.ClosedChannelException ( root cause )
The problem possibly is the default retry configuration / or / timeout configuration is not suitable for complex workloads.
Checkout this article and try the recommendations mainly under 'Network Timeout': section https://towardsdatascience.com/fetch-failed-exception-in-apache-spark-decrypting-the-most-common-causes-b8dff21075c

Related

What might cause a connection issue with Spark external shuffle service?

I'm running a Spark job on dataproc 1.4. with external shuffle service.
Some details to follow up on the comments :
The cluster doesn't use preemptible VMs
Autoscaling is off on this cluster and no manual scale down happened.
The job fails with the error below.
What might be the cause ?
org.apache.spark.shuffle.FetchFailedException: Connection from xxx.internal/10.81.0.46:7337 closed
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:248)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:246)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:252)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:174)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from xxx.internal/10.81.0.46:7337 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:178)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
edit August, 21
It seems the error below is in fact a consequence of this one (7337 is the default port of the external shuffle service). It reinforces the hypothesis of #dagang : the Spark shuffle service may be overloaded.
"Connection to xxx:7337 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong."
The number of partitions is already minimized. To give an idea, there are 30,000 partitions and the cluster has 100 nodes.
So I'm thinking about 3 ways to mitigate the issue
Adding more nodes in the cluster, so each shuffle service will have fewer work to do
Stopping using the external shuffle service (as suggested here : https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc)
Increasing the configuration of spark.network.timeout
What do you think ?
Clusters with preemptible VMs or an autoscaling policy may receive FetchFailedException when workers are preempted or removed before they finish serving shuffle data to reducers.
Possible causes and fixes:
Preemptible worker VMs were reclaimed or non-preemptible worker VMs were removed by the autoscaler. Solution: Use Enhanced Flexibility Mode to make secondary workers safely preemptible or scalable.
Executor or mapper crashed due to OutOfMemory error. Solution: increase the memory of executor or mapper.
The Spark shuffle service may be overloaded. Solution: decrease the number of job partitions.
See more details in this doc.

Last task stuck forever... only sometimes

I have pySpark Job (spark 2.4.1) that seems to be working fine in 10% of the cases, and the other times seemingly stuck forever on a single task, where I can't really understand what is happening.
Here's what I'm doing in my pyspark code:
df = ss.read.parquet(...)
df2 = df.withColumn("A", my_python_udf(sf.col("position.latitude"))
print(df2.groupBy(sf.spark_partition_id()).count().agg(sf.min("count"), sf.max("count"), sf.avg("count")).toPandas())
I seem to be forever stuck in the evaluation of the "toPandas" call.
When I check the executors tab, only one executor is runnable with the following call stack:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
java.io.BufferedInputStream.read(BufferedInputStream.java:345) => holding Monitor(java.io.BufferedInputStream#1118259716})
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:74)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
org.apache.spark.scheduler.Task.run(Task.scala:121)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
I have several questions:
Why does it seems like the call stack is doing stuff related to the UDF evaluatin, which is not needed for my computation?
What is actually going on? I'm not sure if that thread is deadlocked or live from the call stack
How to fix this?
edit :
I also have 2 executors that are failing with the following error:
java.io.IOException: expected more bytes in input stream
at net.razorvine.pickle.PickleUtils.readbytes_into(PickleUtils.java:75)
at net.razorvine.pickle.PickleUtils.readbytes(PickleUtils.java:55)
at net.razorvine.pickle.Unpickler.load_binunicode(Unpickler.java:473)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:190)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:90)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:89)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which makes me believe that something outside of my code is going wrong
In my case, my PySpark jobs always (not randomly) got stuck in PythonUDFRunner waiting for data.
I found out it is waiting for data coming from a PySpark daemon which is used for launching python code (Python UDFs) without wasting loads of memory while forking a Python process (UDF) from a Java process (Spark) More information in this link: pyspark daemon
So, roughly speaking, the Python UDF is serialized by Spark and sent to this daemon which is in charge of running the Python code.
This daemon is run by means of Python code located in a file called pyspark.zip
In my case, for reasons that I prefer not to tell ;) ;) the pyspark.zip was coming from Spark 2.3.3 but Spark itself was running with Spark 2.4.5. I replaced pyspark.zip with the one coming from Spark 2.4.5 and everything started to run with success.
I do not think you will have the same problem as I had but perhaps it could give you some ideas about what is going on in your set up.

Error GC life time is shorter than transaction duration while writing to TiDB using Spark

I am using Apache Spark to write data in batches. The batches are of 1 day. While running the spark job I get this error. I am using MySQL java connector to connect to TiDB cluster. Spark creates 144 parallel tasks for writing.
java.sql.SQLException: GC life time is shorter than transaction duration
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3536)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3468)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1957)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2107)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)
at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1610)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.mysql.jdbc.LoadBalancingConnectionProxy.invoke(LoadBalancingConnectionProxy.java:359)
at com.sun.proxy.$Proxy13.commit(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:665)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This error means, the Spark task Transaction Time is exceed than the TiDB's GC Life Time that means:
The current read data has been deleted by TiDB, since the data's life time exceed than the configed GC Life Time
so solution maybe try to increase the tikv_gc_life_time by:
update mysql.tidb set variable_value='30m' where variable_name='tikv_gc_life_time';
see more:
https://github.com/pingcap/docs/blob/master/op-guide/gc.md#configuration-and-monitor

NoHostAvailableException Found at the Time of data reading from Cassandra

I am using Cassandra 2.1.5, and Cassandra-java-driver 2.0.10. I am facing below Exception when i am fetching Data from Cassandra Table.
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.TransportException: [/127.0.0.1:9042] Connection has been closed))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:265)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:179)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:33)
at com.sun.proxy.$Proxy8.execute(Unknown Source)
at com.exportstagging.SparkTest.ProductDataLoader.dbQuery(ProductDataLoader.java:418)
at com.exportstagging.SparkTest.ProductDataLoader.main(ProductDataLoader.java:442)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /127.0.0.1:9042 (com.datastax.driver.core.TransportException: [/127.0.0.1:9042] Connection has been closed))
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:216)
at com.datastax.driver.core.RequestHandler.access$900(RequestHandler.java:45)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:276)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution$1.run(RequestHandler.java:374)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
In my cassandra Table there is 50000 columns and 380000 rows. when i am fire this query i am getting above error.
Select * from mykeyspace.productdata Where id In (1,...,6000).
I have used token to create batching for fetching data from Cassandra.
Trying to fectch 6000 partitions at once, on 380000 rows with a table with 50000 columns seems totally overkill.
Multi partition select should be done using asynchronous queries, using one query per partition.
Having 50000 columns in a table looks like a data modeling problem. What is your use case ?
Fetching 380000 rows at once shouldn't be necessary (and will take a very long time anyway), and it looks like an analytical query which should be way better handled through Spark.
Trying to get so much data at once, you are most likely getting an OOM on your cassandra node, which could explain why you're getting the message "Connection has been closed"
My advice is to review your model, and try to split the load the way you should (async queries) and use proper paging.
If you want to crunch so much data at once, you have to do it with Spark through batch processing, output the result in another cassandra table, and access that result table through smaller, faster interactive queries.

FileNotFoundException during compaction

All of my nodes are throwing a FileNotFoundException during compaction. As such, not a single compaction (auto, manual) can finish and my SSTable count is now in the thousands for a single CF (CQL3).
nodetool compactionstats shows hundreds of pending tasks in each node but nothing is being processed.
Below is an example log of the exception:
Error occurred during compaction
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:281)
at org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1935)
at org.apache.cassandra.service.StorageService.forceKeyspaceCompaction(StorageService.java:2210)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355)
at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:252)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
... 18 more
I'm currently in the middle of migrating 4.8 billion rows from MySQL which I do via sstableloader in batches of 1 to 4 million rows. Does the exception mean that I've already lost data and must repeat the migration from scratch? So far I don't see any stream error in my logs.
My environment is as follows:
DSE 4.0.1 (Cassandra 2.0.5)
CentOS 6.x x86_64
Java 1.7.0_5x
EDIT:
Some additional info:
During the bulk loading process, I devised a mechanism to kill the sstableloader when the total progress reaches 100%. I also issue a "nodetool stop INDEX_BUILD" to all nodes. The reason for this is because sstableloader waits for the secondary index build to finish and this takes hours to complete (whereas the actual import time is just a fraction of the index build time). I figured out that the imported data remains intact after killing the sstableloader process and cancelling the secondary index build so I wrote a script to automate the mechanism. So far, I have completed more than 200 bulk loads with this trick.
I have paused the migration and restarted the nodes several times in the past week because the OS load reaches high levels (yellow or red in OpsCenter) after finishing several cycles of note #1. It's possible that a compaction is in progress when I restarted the nodes via dse cassandra-stop (yes, we are running DSE as a standalone process)
Could any of these be the cause? How do I get out of this situation? Manual compaction/repair doesn't work because they always throw exceptions. For repair, the exception is different but the meaning is the same - some sstable files are missing:
ERROR [MiscStage:2] 2014-05-03 00:42:10,386 CassandraDaemon.java (line 196) Exception in thread Thread[MiscStage:2,5,main]
java.lang.RuntimeException: Tried to hard link to file that does not exist /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-23797-Summary.db
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:76)
at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215)
at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1816)
at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1849)
at org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:40)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Have you dropped and recreated the keyspace? If so, it's probably this:
https://issues.apache.org/jira/browse/CASSANDRA-4857
Restart your nodes to clear the bad filename out of memory.

Resources