Problem running example topology with storm-crawler 2.3-SNAPSHOT - stormcrawler

I'm building SC 2.3-SNAPSHOT from source and generating a project from the archetype. Then I try to run the example Flux topology. Seeds are injected properly. I can see all of them in the ES index with the status DISCOVERED. My problem is that no fetching seems to happen after the injection, so I'm looking for ideas of what to investigate.
All the storm components look fine, ES as well.
In the logs, I can see this kind of errors for my single worker:
2022-02-28 08:41:48.852 c.d.s.e.p.AggregationSpout I/O dispatcher 13 [ERROR] [spout #2] Exception with ES query
java.io.IOException: Unable to parse response body for Response{requestLine=POST /status/_search?typed_keys=true&max_concurrent_shard_requests=5&search_type=query_then_fetch&batched_red
uce_size=512&preference=_shards%3A2%7C_local HTTP/1.1, host=http://node-1:9200, response=HTTP/1.1 200 OK}
at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:2351) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:660) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:394) [stormjar.jar:?]
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:388) [stormjar.jar:?]
at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) [stormjar.jar:?]
at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) [stormjar.jar:?]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) [stormjar.jar:?]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) [stormjar.jar:?]
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [stormjar.jar:?]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
Caused by: java.lang.NullPointerException
at com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout$InProcessMap.containsKey(AbstractQueryingSpout.java:158) ~[stormjar.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:252) ~[stormjar.jar:?]
at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:63) [stormjar.jar:?]
at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:2349) [stormjar.jar:?]
... 18 more

This was fix recently in https://github.com/DigitalPebble/storm-crawler/commit/88784c1af9a35fd45df3b68ace279a0b73e1e856
Please git pull and mvn clean install StormCrawler before rebuilding the topology.
Regarding
"WARN o.a.s.u.Utils - Topology crawler contains unreachable components
"__system" What does it refer to"
No idea but it shouldn't be a big issue.

Related

Is it possible that creating broadcast variables within spark streaming transformation function

I tried to create a recoverable spark streaming job with some arguments got from database. But then I got a problem: it always gives me a serialization error when I try to restart a job from checkpoint.
18/10/18 09:54:33 ERROR Executor: Exception in task 1.0 in stage 56.0 (TID 132) java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to
scala.collection.MapLike at
com.ptnj.streaming.alertJob.InputDataParser$.kafka_stream_handle(InputDataParser.scala:37)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1.apply(InstanceAlertJob.scala:38)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1.apply(InstanceAlertJob.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I followed the advice by maxime G in this existing SO question, and it seems to help.
But now there is another exception. And because of that issue,I have to
create broadcast variables while stream transforming, like
val kafka_data_streaming = stream.map(x => DstreamHandle.kafka_stream_handle(url, x.value(), sc))
So it going to be I have to put sparkcontext as a parameter into
transformation function, then it occurs:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:545)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$.streaming_main(InstanceAlertJob.scala:38)
at com.ptnj.streaming.AlarmMain$.create_ssc(AlarmMain.scala:36) at
com.ptnj.streaming.AlarmMain$.main(AlarmMain.scala:14) at
com.ptnj.streaming.AlarmMain.main(AlarmMain.scala) Caused by:
java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#5fb7183b)
- field (class: com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1, name: sc$1,
type: class org.apache.spark.SparkContext)
- object (class com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1, )
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 14 more
And I have never seen this situation before. Each example shows that broadcast variables would be create in output operation function but not transformation function, so is that possible?

Not able to print stack trace in Spark

I am running below code , in the Failure part , i am printing exception stack trace as INFO in yarn log . My code has syntax error in sql so that an exception will be generated .But when i see yarn log it shows some thing unexpected as below saying "call methods on a stopped SparkContext" . Need help on this , If i am doing anything wrong .
Code Snipet:-
var ret:String= Try {
DbUtil.dropTable("cls_mkt_tracker_split_rownum", batchDatabase)
SparkEnvironment.hiveContext.sql(
s"""CREATE TABLE ${batchDatabase}.CLS_MKT_TRACKER_SPLIT_ROWNUM
AS SELECT ROW_NUMBER() OVER(PARTITION BY XREF_IMS_PAT_NBR,MOLECULE ORER BY IMS_DSPNSD_DT ) AS ROWNUM,*
FROM ${batchDatabase}.CLS_MKT_TRACKER_SPLIT
""")
true
} match {
case Success (b:Boolean) => ""
case Failure (t :Throwable) => logger.info("I am in failure" + t.getMessage + t.getStackTraceString) ; "failure return"
}
Yarn Log:-
16/12/13 11:19:42 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr.
16/12/13 11:19:43 INFO DateAdjustment: I am in failureCannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:83)
SparkEnvironment$.<init>(SparkEnvironment.scala:12)
SparkEnvironment$.<clinit>(SparkEnvironment.scala)
DbUtil$.dropTable(DbUtil.scala:8)
DateAdjustment$$anonfun$1.apply$mcZ$sp(DateAdjustment.scala:126)
DateAdjustment$$anonfun$1.apply(DateAdjustment.scala:125)
DateAdjustment$$anonfun$1.apply(DateAdjustment.scala:125)
scala.util.Try$.apply(Try.scala:161)
DateAdjustment$delayedInit$body.apply(DateAdjustment.scala:125)
scala.Function0$class.apply$mcV$sp(Function0.scala:40)
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
scala.App$$anonfun$main$1.apply(App.scala:71)
scala.App$$anonfun$main$1.apply(App.scala:71)
scala.collection.immutable.List.foreach(List.scala:318)
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
scala.App$class.main(App.scala:71)
DateAdjustment$.main(DateAdjustment.scala:14)
DateAdjustment.main(DateAdjustment.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
The currently active SparkContext was created at:
(No active SparkContext.)
org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:107)
I investigated and found , at one place of my application, SC.stop() has been called and it was causing the issue .

Cassandra : data not replicated on new node

I added a new node to my cassandra cluster (the new node is not a seed node). I now have 3 nodes on my cluster :
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN XXX.XXX.XXX.XXX 52.25 GB 256 100.0% XXX rack1
UN XXX.XXX.XXX.XXX 63.65 GB 256 100.0% XXX rack1
UN XXX.XXX.XXX.XXX 314.72 MB 256 100.0% XXX rack1
I have a replication factor of 3 :
DESCRIBE KEYSPACE mykeyspace
CREATE KEYSPACE mykeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true;
but the data is not replicated on the new cluster (node with 314 MB of data).
I tried to use nodetool rebuild :
ERROR [STREAM-IN-/XXX.XXX.XXX.XXX] 2016-11-11 08:28:42,765
StreamSession.java:520 - [Stream
#0e7a0580-a81b-11e6-9a1c-6d75503d5d02] Streaming error occurred java.lang.IllegalArgumentException: Unknown type 0 at
org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:97)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:58)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261)
~[apache-cassandra-3.1.1.jar:3.1.1] at
java.lang.Thread.run(Thread.java:745) [na:1.8.0_74] ERROR [Thread-16]
2016-11-11 08:28:42,765 CassandraDaemon.java:195 - Exception in thread
Thread[Thread-16,5,RMI Runtime] java.lang.RuntimeException:
java.lang.InterruptedException at
com.google.common.base.Throwables.propagate(Throwables.java:160)
~[guava-18.0.jar:na] at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
~[apache-cassandra-3.1.1.jar:3.1.1] at
java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_74] Caused by:
java.lang.InterruptedException: null at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
~[na:1.8.0_74] at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
~[na:1.8.0_74] at
java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
~[na:1.8.0_74] at
org.apache.cassandra.streaming.compress.CompressedInputStream$Reader.runMayThrow(CompressedInputStream.java:184)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[apache-cassandra-3.1.1.jar:3.1.1] ... 1 common frames omitted INFO
[STREAM-IN-/XXX.XXX.XXX.XXX] 2016-11-11 08:28:42,805
StreamResultFuture.java:182 - [Stream
#0e7a0580-a81b-11e6-9a1c-6d75503d5d02] Session with /XXX.XXX.XXX.XXX is complete WARN [STREAM-IN-/XXX.XXX.XXX.XXX] 2016-11-11 08:28:42,807
StreamResultFuture.java:209 - [Stream
#0e7a0580-a81b-11e6-9a1c-6d75503d5d02] Stream failed ERROR [RMI TCP Connection(14)-127.0.0.1] 2016-11-11 08:28:42,808
StorageService.java:1128 - Error while rebuilding node
org.apache.cassandra.streaming.StreamException: Stream failed at
org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
~[apache-cassandra-3.1.1.jar:3.1.1] at
com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
~[guava-18.0.jar:na] at
com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
~[guava-18.0.jar:na] at
com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
~[guava-18.0.jar:na] at
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
~[guava-18.0.jar:na] at
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
~[guava-18.0.jar:na] at
org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:210)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:430)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:525)
~[apache-cassandra-3.1.1.jar:3.1.1] at
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:279)
~[apache-cassandra-3.1.1.jar:3.1.1] at
java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_74]
I also tried to change the option but the data is still not copied to the new node :
auto_bootstrap: true
Could you please help me understand why the data is not replicated on the new node ?
Please let me know if you need further information from my configuration.
Thank you for your help
It appears (from https://issues.apache.org/jira/browse/CASSANDRA-10448) that this is due to CASSANDRA-10961. Applying that fix should address it.

Saving a dataframe using spark-csv package throws exceptions and crashes (pyspark)

I am running a script on spark 1.5.2 in standalone mode (using 8 cores), and at the end of the script I attempt to serialize a very large dataframe to disk, using the spark-csv package. The code snippet that throws the exception is:
numfileparts = 16
data = data.repartition(numfileparts)
# Save the files as a bunch of csv files
datadir = "~/tempdatadir.csv/"
try:
(data
.write
.format('com.databricks.spark.csv')
.save(datadir,
mode="overwrite",
codec="org.apache.hadoop.io.compress.GzipCodec"))
except:
sys.exit("Could not save files.")
where data is a spark dataframe. At execution time, I get the following stracktrace:
16/04/19 20:16:24 WARN QueuedThreadPool: 8 threads could not be stopped
16/04/19 20:16:24 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$2#70617ec1 rejected from java.util.concurrent.ThreadPoolExecutor#1bf5370e[Shutting d\
own, pool size = 3, active threads = 3, queued tasks = 0, completed tasks = 2859]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:49)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:347)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:330)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:65)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
This leads to a bunch of these:
16/04/19 20:16:24 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-84d7d0a6-a3e5-4f48-bde0-0f6610e44e16/38/temp_shuffle_b9886819-be46-4e\
28-b57f-e592ea37ab95
java.io.FileNotFoundException: /tmp/blockmgr-84d7d0a6-a3e5-4f48-bde0-0f6610e44e16/38/temp_shuffle_b9886819-be46-4e28-b57f-e592ea37ab95 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
at org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/04/19 20:16:24 ERROR BypassMergeSortShuffleWriter: Error while deleting file for block temp_shuffle_b9886819-be46-4e28-b57f-e592ea37ab95
16/04/19 20:16:24 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /tmp/blockmgr-84d7d0a6-a3e5-4f48-bde0-0f6610e44e16/29/temp_shuffle_e474bcb1-5ead-4d\
7c-a58f-5398f32892f2
java.io.FileNotFoundException: /tmp/blockmgr-84d7d0a6-a3e5-4f48-bde0-0f6610e44e16/29/temp_shuffle_e474bcb1-5ead-4d7c-a58f-5398f32892f2 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
...and so on (I have intentionally left out some of the last lines.)
I do understand (roughly) what is happening, but am very uncertain of what to do about it - is it a memory issue?
I seek advice on what to do - is there some setting I can change, add, etc.?

hive on spark - ArrayIndexOutOfBoundsException

i want to run 'select count(*) from test' on 'hive on spark' through beeline, but it crashed.
INFO client.SparkClientUtilities: Added jar[file:/home/.../hive/lib/zookeeper-3.4.6.jar] to classpath
INFO client.RemoteDriver: Failed to run job 41dc814c-deb7-4743-9b6a-b6cace2eae19
com.esotericsoftware.kryo.KryoException: java.lang.ArrayIndexOutOfBoundsException: 1
Serialization trace:
dummyOps (org.apache.hadoop.hive.ql.plan.ReduceWork) left (org.apache.commons.lang3.tuple.ImmutablePair) edgeProperties (org.apache.hadoop.hive.ql.plan.SparkWork)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:126)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
at org.apache.hadoop.hive.ql.exec.spark.KryoSerializer.deserialize(KryoSerializer.java:49)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:235)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at com.esotericsoftware.kryo.serializers.MapSerializer.setGenerics(MapSerializer.java:53)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:604)
... 19 more
hive on spark configuration is:
set spark.home=/home/.../spark;
set hive.execution.engine=spark;
set spark.master=yarn;
set spark.eventLog.enabled=true;
set spark.eventLog.dir=hdfs://...:9000/user/.../spark-history-server-records;
set spark.executor.memory=1024m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
and what's worse, when i change KryoSerializer to JavaSerializer, this issue still exists.
Thanks.

Resources