Error GC life time is shorter than transaction duration while writing to TiDB using Spark - tidb

I am using Apache Spark to write data in batches. The batches are of 1 day. While running the spark job I get this error. I am using MySQL java connector to connect to TiDB cluster. Spark creates 144 parallel tasks for writing.
java.sql.SQLException: GC life time is shorter than transaction duration
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3536)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3468)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1957)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2107)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)
at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1610)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.mysql.jdbc.LoadBalancingConnectionProxy.invoke(LoadBalancingConnectionProxy.java:359)
at com.sun.proxy.$Proxy13.commit(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:665)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

This error means, the Spark task Transaction Time is exceed than the TiDB's GC Life Time that means:
The current read data has been deleted by TiDB, since the data's life time exceed than the configed GC Life Time
so solution maybe try to increase the tikv_gc_life_time by:
update mysql.tidb set variable_value='30m' where variable_name='tikv_gc_life_time';
see more:
https://github.com/pingcap/docs/blob/master/op-guide/gc.md#configuration-and-monitor

Related

Failure when creating a table during job run on Databricks

I have been doing a left join on two tables in SQL (table A contains billions of rows and table B contains millions of rows) and creating a new table (table C) from the result of the join. I am using a XXL SQL Warehouse on Databricks with two workers to handle the compute of this operation.
The main issue is when I schedule this job to be performed it seems to fail every time. When I run it manually, the job finishes almost every time.
DROP TABLE IF EXISTS tableC;
CREATE TABLE tableC
AS (
SELECT *
FROM tableA
LEFT JOIN tableB
ON tableA.id = tableB.id
)
After running the SQL query as a scheduled job, I seem to get the error below. From my understanding the error occurs in the final stage of the job when it is writing the rows into a delta table, however I can't seem to figure out the reason for this failure. I have checked to make sure the data is not corrupted and I have enough memory. The main part that puzzles me is the fact that it works when the SQL command is run manually but always seems to fail when the query is apart of a scheduled job run.
Job aborted due to stage failure: ResultStage 9 (write at WriteIntoDeltaCommand.scala:70) has failed the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.FetchFailedException
at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:315)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1167)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$next$1(ShuffleBlockFetcherIterator.scala:905)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:736)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.RowToColumnarExec$$anon$1.hasNext(Columnar.scala:491)
at com.databricks.photon.CloseableIterator$$anon$5.hasNext(CloseableIterator.scala:62)
at com.databricks.photon.NativeColumnBatchIterator.hasNext(NativeColumnBatchIterator.java:47)
at 0x8235362 <photon>.HasNext(external/workspace_spark_3_3/photon/jni-wrappers/jni-native-column-batch-iterator.cc:62)
at 0x4707a79 <photon>.OpenImpl(external/workspace_spark_3_3/photon/exec-nodes/file-writer-node.cc:161)
at com.databricks.photon.JniApiImpl.open(Native Method)
at com.databricks.photon.JniApi.open(JniApi.scala)
at com.databricks.photon.JniExecNode.open(JniExecNode.java:64)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$5(PhotonWriteStageExec.scala:87)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.photon.PhotonExec.timeit(PhotonExec.scala:300)
at com.databricks.photon.PhotonExec.timeit$(PhotonExec.scala:298)
at com.databricks.photon.PhotonWriteStageExec.timeit(PhotonWriteStageExec.scala:38)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$4(PhotonWriteStageExec.scala:87)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1730)
at com.databricks.photon.PhotonWriteStageExec.$anonfun$executeWrite$2(PhotonWriteStageExec.scala:84)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2(PhotonExec.scala:484)
at com.databricks.photon.PhotonExec.$anonfun$executePhoton$2$adapted(PhotonExec.scala:353)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:882)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:882)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:169)
at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:104)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1696)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:62)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:223)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Based on stacktrace the issue seems to be for two reasons
org.apache.spark.shuffle.FetchFailedException (operation failed)
Caused by: java.nio.channels.ClosedChannelException ( root cause )
The problem possibly is the default retry configuration / or / timeout configuration is not suitable for complex workloads.
Checkout this article and try the recommendations mainly under 'Network Timeout': section https://towardsdatascience.com/fetch-failed-exception-in-apache-spark-decrypting-the-most-common-causes-b8dff21075c

SOLR Read timed out (socket connection timeout) when indexing large amount of data in a collection

We are trying to index around 5 billion records existing in hdfs (parquet file) to a collection on solr. We are using solr 7.2.1. We have spawned an emr cluster of 7 data nodes (16 VCores, 128 GB each) and using the same hdfs nodes for storing solr data as well (solr collection created with 7 shards). After running for sometime we see that the rate of indexing becomes slower and some tasks start failing because of 2 main errors (listed below). After indexing around 550 mill or so the job failed because some task failed too many times.
Error 1:
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Error 2:
scala.MatchError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr1:8983/solr/connection-2637-config6: 2 Async exceptions during distributed update:
Broken pipe (Write failed)
Broken pipe (Write failed) (of class org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Then we tried to divide the larger file in 300 million record files each and then index with some cooling off period in between. We saw that each subsequent task is taking more time (3 hrs for 1st 300 mill, 4.5 hrs for second 300 mill, 6 hr for 3rd. etc.).
Now when we have indexed over 2 billion records even after waiting for 5-6 hours we are still getting a lot of socket timeout exceptions and jobs fails. We believe it might be happening because even after the job shows completed the merging of indexes continues on the backend (we have seen reduction in the number of files present in respective core directories, so thats our guess). Is this true? and for either yes/no is there any way to solve this and improve performance.
These are some other solr configs we have set.
SOLR_JAVA_MEM="-Xms30g -Xmx30g"
GC_TUNE="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts -XX:MaxDirectMemorySize=20g"
solrconfig.xml
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">25</int>
<int name="segmentsPerTier">25</int>
<double name="noCFSRatio">0.1</double>
</mergePolicyFactory>
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:1200000}</maxTime>
</autoSoftCommit>
Would really appreciate any help at all.
*Checked in executor logs, found more detailed stacktrace.
21/08/31 11:35:53 ERROR BaseCloudSolrClient: Request to collection [connection-2637-config6-65536] failed due to (0) java.net.SocketTimeoutException: Read timed out, retry=0 commError=false errorCode=0
21/08/31 11:35:53 INFO BaseCloudSolrClient: request was not communication error it seems
21/08/31 11:35:53 ERROR SolrSupport$: Send batch to collection connection-2637-config6-65536 failed due to: org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:676)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolr(SolrSupport.scala:389)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:354)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at shaded.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at shaded.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at shaded.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at shaded.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at shaded.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at shaded.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at shaded.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at shaded.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at shaded.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at shaded.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at shaded.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at shaded.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at shaded.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
... 24 more
21/08/31 11:35:53 ERROR Executor: Exception in task 75.0 in stage 1.0 (TID 77)
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Last task stuck forever... only sometimes

I have pySpark Job (spark 2.4.1) that seems to be working fine in 10% of the cases, and the other times seemingly stuck forever on a single task, where I can't really understand what is happening.
Here's what I'm doing in my pyspark code:
df = ss.read.parquet(...)
df2 = df.withColumn("A", my_python_udf(sf.col("position.latitude"))
print(df2.groupBy(sf.spark_partition_id()).count().agg(sf.min("count"), sf.max("count"), sf.avg("count")).toPandas())
I seem to be forever stuck in the evaluation of the "toPandas" call.
When I check the executors tab, only one executor is runnable with the following call stack:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
java.io.BufferedInputStream.read(BufferedInputStream.java:345) => holding Monitor(java.io.BufferedInputStream#1118259716})
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:74)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
org.apache.spark.scheduler.Task.run(Task.scala:121)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
I have several questions:
Why does it seems like the call stack is doing stuff related to the UDF evaluatin, which is not needed for my computation?
What is actually going on? I'm not sure if that thread is deadlocked or live from the call stack
How to fix this?
edit :
I also have 2 executors that are failing with the following error:
java.io.IOException: expected more bytes in input stream
at net.razorvine.pickle.PickleUtils.readbytes_into(PickleUtils.java:75)
at net.razorvine.pickle.PickleUtils.readbytes(PickleUtils.java:55)
at net.razorvine.pickle.Unpickler.load_binunicode(Unpickler.java:473)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:190)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:90)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:89)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which makes me believe that something outside of my code is going wrong
In my case, my PySpark jobs always (not randomly) got stuck in PythonUDFRunner waiting for data.
I found out it is waiting for data coming from a PySpark daemon which is used for launching python code (Python UDFs) without wasting loads of memory while forking a Python process (UDF) from a Java process (Spark) More information in this link: pyspark daemon
So, roughly speaking, the Python UDF is serialized by Spark and sent to this daemon which is in charge of running the Python code.
This daemon is run by means of Python code located in a file called pyspark.zip
In my case, for reasons that I prefer not to tell ;) ;) the pyspark.zip was coming from Spark 2.3.3 but Spark itself was running with Spark 2.4.5. I replaced pyspark.zip with the one coming from Spark 2.4.5 and everything started to run with success.
I do not think you will have the same problem as I had but perhaps it could give you some ideas about what is going on in your set up.

Executor heartbeat timed out Spark on DataProc

I am trying to fit a ml model in Spark (2.0.0) on a Google DataProc Cluster. When fitting the model I receive an Executor heartbeat timed out error. How can I resolve this?
Other solutions indicate this is probably due to Out of Memory of (one of) the executors. I read as solutions: Set the right setting, repartition, cache, and get a bigger cluster. What can I do, preferably without setting up a larger cluster? (Make more/less partitions? Cache less? Adjust settings?)
My setting:
Spark 2.0.0 on a Google DataProc Cluster:
1 Master and 2 workers all with the same specs: n1-highmem-8 -> 8 vCPUs, 52.0 GB memory - 500GB disk
Settings:
spark\:spark.executor.cores=1
distcp\:mapreduce.map.java.opts=-Xmx2457m
spark\:spark.driver.maxResultSize=1920m
mapred\:mapreduce.map.java.opts=-Xmx2457m
yarn\:yarn.nodemanager.resource.memory-mb=6144
mapred\:mapreduce.reduce.memory.mb=6144
spark\:spark.yarn.executor.memoryOverhead=384
mapred\:mapreduce.map.cpu.vcores=1
distcp\:mapreduce.reduce.memory.mb=6144
mapred\:yarn.app.mapreduce.am.resource.mb=6144
mapred\:mapreduce.reduce.java.opts=-Xmx4915m
yarn\:yarn.scheduler.maximum-allocation-mb=6144
dataproc\:dataproc.scheduler.max-concurrent-jobs=11
dataproc\:dataproc.heartbeat.master.frequency.sec=30
mapred\:mapreduce.reduce.cpu.vcores=2
distcp\:mapreduce.reduce.java.opts=-Xmx4915m
distcp\:mapreduce.map.memory.mb=3072
spark\:spark.driver.memory=3840m
mapred\:mapreduce.map.memory.mb=3072
yarn\:yarn.scheduler.minimum-allocation-mb=512
mapred\:yarn.app.mapreduce.am.resource.cpu-vcores=2
spark\:spark.yarn.am.memoryOverhead=384
spark\:spark.executor.memory=2688m
spark\:spark.yarn.am.memory=2688m
mapred\:yarn.app.mapreduce.am.command-opts=-Xmx4915m
Full Error:
Py4JJavaError: An error occurred while calling o4973.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 151 in stage 16964.0 failed 4 times, most recent failure: Lost task 151.3 in stage 16964.0 (TID 779444, reco-test-w-0.c.datasetredouteasvendor.internal): ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 175122 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:372)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:372)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:371)
at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1156)
at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.countByValue(RDD.scala:1155)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:91)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
As this question doesn't have an answer, to summarize the issue appears to have been related to spark.executor.memory being set too low, causing occasional out-of-memory errors on an executor.
The suggested fix was to first try starting with the default Dataproc config, which tries to fully use all cores and memory available on the instance. If issues continue, then adjust spark.executor.memory and spark.executor.cores to increase the amount of memory available per task (essentially spark.executor.memory / spark.executor.cores).
Dennis also gives more details about the Spark memory config on Dataproc in the following answer:
Google Cloud Dataproc configuration issues

FileNotFoundException during compaction

All of my nodes are throwing a FileNotFoundException during compaction. As such, not a single compaction (auto, manual) can finish and my SSTable count is now in the thousands for a single CF (CQL3).
nodetool compactionstats shows hundreds of pending tasks in each node but nothing is being processed.
Below is an example log of the exception:
Error occurred during compaction
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:281)
at org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1935)
at org.apache.cassandra.service.StorageService.forceKeyspaceCompaction(StorageService.java:2210)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355)
at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:252)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
... 18 more
I'm currently in the middle of migrating 4.8 billion rows from MySQL which I do via sstableloader in batches of 1 to 4 million rows. Does the exception mean that I've already lost data and must repeat the migration from scratch? So far I don't see any stream error in my logs.
My environment is as follows:
DSE 4.0.1 (Cassandra 2.0.5)
CentOS 6.x x86_64
Java 1.7.0_5x
EDIT:
Some additional info:
During the bulk loading process, I devised a mechanism to kill the sstableloader when the total progress reaches 100%. I also issue a "nodetool stop INDEX_BUILD" to all nodes. The reason for this is because sstableloader waits for the secondary index build to finish and this takes hours to complete (whereas the actual import time is just a fraction of the index build time). I figured out that the imported data remains intact after killing the sstableloader process and cancelling the secondary index build so I wrote a script to automate the mechanism. So far, I have completed more than 200 bulk loads with this trick.
I have paused the migration and restarted the nodes several times in the past week because the OS load reaches high levels (yellow or red in OpsCenter) after finishing several cycles of note #1. It's possible that a compaction is in progress when I restarted the nodes via dse cassandra-stop (yes, we are running DSE as a standalone process)
Could any of these be the cause? How do I get out of this situation? Manual compaction/repair doesn't work because they always throw exceptions. For repair, the exception is different but the meaning is the same - some sstable files are missing:
ERROR [MiscStage:2] 2014-05-03 00:42:10,386 CassandraDaemon.java (line 196) Exception in thread Thread[MiscStage:2,5,main]
java.lang.RuntimeException: Tried to hard link to file that does not exist /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-23797-Summary.db
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:76)
at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215)
at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1816)
at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1849)
at org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:40)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Have you dropped and recreated the keyspace? If so, it's probably this:
https://issues.apache.org/jira/browse/CASSANDRA-4857
Restart your nodes to clear the bad filename out of memory.

Resources