I am running and spark job that reads from a kafka queue and writes this data into avro files with spark.
But in the cluster when it is running more or less 50 minutes it get this exception
20/01/27 16:21:23 ERROR de.rewe.eem.spark.util.FileBasedShutdown$: Not possible to await termination or timout - exit checkRepeatedlyShutdownAndStop with exception
20/01/27 16:21:23 ERROR org.apache.spark.util.Utils: Uncaught exception in thread Thread-2
java.io.IOException: Filesystem closed
at com.mapr.fs.MapRFileSystem.checkOpen(MapRFileSystem.java:1660)
at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:633)
The spark configuration and properties is this one
SPARK_OPTIONS="\--driver-memory 4G--executor-memory 4G--num-executors 4--executor-cores 4--conf spark.driver.memoryOverhead=768--conf spark.driver.maxResultSize=0--conf spark.executor.memory=4g--master yarn--deploy-mode cluster"
But the 50 minutes that it is running it is working fine.
Related
My structured Spark Streaming Job fails with the following exception after running for more than 24 hrs.
Exception in thread "spark-listener-group-eventLog" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.math.BigInteger.<init>(BigInteger.java:1114)
at java.math.BigInteger.valueOf(BigInteger.java:1098)
at scala.math.BigInt$.apply(BigInt.scala:49)
at scala.math.BigInt$.long2bigInt(BigInt.scala:101)
at org.json4s.Implicits$class.long2jvalue(JsonDSL.scala:45)
at org.json4s.JsonDSL$.long2jvalue(JsonDSL.scala:61)
Quick background:
My structured spark streaming job is to ingest events received as new files (parquet) into Solr collection. So, the sources are 8 different hive tables (8 different hdfs locations) receiving events and the sink is one solr collection.
Configuration:
Number Executors: 30
Executor Memory: 20 G
Driver memory: 20 G
cores - 5
Generated a hprof dump file and loaded into MAT to understand cause. The dump file looks like. This is a test environment and data stream TPS (transaction per minute) is very low and sometimes no transactions at all.
Any clue on what is causing this. Unfortunately, I'm unable to share the code snippet. Sorry about that.
This question already has answers here:
S3 SlowDown error in Spark on EMR
(2 answers)
Closed 4 years ago.
I have simple spark program running in EMR cluster trying to convert 60 GB of CSV file into parquet. When i submit the job i get below exception.
391, ip-172-31-36-116.us-west-2.compute.internal, executor 96): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: D13A3F4D7DD970FA; S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=), S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
503 Slow Down is a generic response from AWS services when you're doing too many requests per second.
Possible solutions:
Copy your file to HDFS first.
Do you have one 60 Gb file or a lot of files that sums up to 60 Gb? If you have a lot of small files, try to combine them first.
Try to decrease the number of partitions in your Parquet output, if you can.
df.repartition(100)
Try using less Spark workers.
val spark = SparkSession.builder.appName("Simple Application").master("local[1]").getOrCreate()
I'm surprised that things failed; the Apache s3a client backs off when it sees a problem like this: your work is done, just more slowly.
All of Sergey's advice is good. I'd start by coalescing small files and reducing workers: a smaller cluster can deliver more performance, and save money.
One more: if you are using SSE-KMS to encrypt the data, accessing that key can trigger throttle events too; throttling shared across all applications trying to use the KMS store.
I have a Kafka producer code in Java that watches a directory for new files using java nio WatchService api and takes any new file and pushes to a kafka topic. Spark streaming consumer reads from the kafka topic. I am getting the following error after the Kafka producer job keeps running for a day. The producer pushes about 500 files every 2 mins. My Kafka topic has 1 partition and 2 replication factor. Can someone please help?
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:342)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:166)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.createProducer(Sender.java:60)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.<init>(Sender.java:38)
at com.hp.hawkeye.HawkeyeKafkaProducer.HawkeyeKafkaProducer.<init>(HawkeyeKafkaProducer.java:54)
at com.hp.hawkeye.HawkeyeKafkaProducer.myKafkaTestJob.main(myKafkaTestJob.java:81)
Caused by: org.apache.kafka.common.KafkaException: java.io.IOException: Too many open files
at org.apache.kafka.common.network.Selector.<init>(Selector.java:125)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:147)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:306)
... 7 more
Caused by: java.io.IOException: Too many open files
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:69)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at java.nio.channels.Selector.open(Selector.java:227)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:123)
... 9 more
Check ulimit -aH
check with your admin and increase the open files size, for eg:
open files (-n) 655536
else I suspect there might be leaks in your code, refer:
http://mail-archives.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAKWX9VVJZObU9omOVCfPaJ_bPAJWiHcxeE7RyeqxUHPWvfj7WA#mail.gmail.com%3E
This was cross-posted in SPARK-22685.
TL;DR – If shard checkpoints don't exist in DynamoDB (== completely fresh), Spark Streaming application reading from Kinesis works flawlessly. However, if the checkpoints exist (e.g. due to app restart), it fails most of the times.
The app uses Spark Streaming 2.2.0 and spark-streaming-kinesis-asl_2.11.
When starting the app with checkpointed shard data (written by KCL to DynamoDB), after a few successful batches (number varies), this is what I can see in the logs:
First, Leases are lost:
17/12/01 05:16:50 INFO LeaseRenewer: Worker 10.0.182.119:9781acd5-6cb3-4a39-a235-46f1254eb885 lost lease with key shardId-000000000515
Then in random order: Can't update checkpoint - instance doesn't hold the lease for this shard and com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond follow, bringing down the whole app in a few batches:
17/12/01 05:17:10 ERROR ProcessTask: ShardId shardId-000000000394: Caught exception:
com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:1948)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:1924)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:969)
at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:945)
at com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.get(KinesisProxy.java:156)
at com.amazonaws.services.kinesis.clientlibrary.proxies.MetricsCollectingKinesisProxyDecorator.get(MetricsCollectingKinesisProxyDecorator.java:74)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher.getRecords(KinesisDataFetcher.java:68)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:291)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:256)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:127)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1190)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
... 22 more
and
17/12/01 05:20:59 ERROR KinesisRecordProcessor: ShutdownException: Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:173)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:94)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:94)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:94)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:158)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:94)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:88)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:88)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:116)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:130)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
Right now, the workaround is to go and delete all the checkpoint data of all shards from DynamoDB so that the app starts from the InitialPositionInStream.LATEST. Obviously, the downside of that is that checkpoint information is not used at all, and data is lost.
I may have missed something obvious, so any help would be appreciated.
I am restarting a Spark streaming job that is checkpointed in HDFS. I am purposely killing the job after 5 minutes and restarting it to test the recovery. I receive this error once ssc.start() is invoked.
INFO WriteAheadLogManager : Recovered 1 write ahead log files from hdfs://...receivedBlockMetadata
INFO WriteAheadLogManager : Reading from the logs:
Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ReducedWindowedDStream#65600fb3 has not been initialized
at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:321)
I am starting the job using: StreamingContext.getOrCreate(checkpointDir,...
The job has three windowed operations that are sliding windows of 5 minutes, 1 hour, and 1 day, but the job was stopped after 5 minutes. In order for the recovery from the checkpoint to work, does the maximum windowed duration need to pass for all the windowed ops to initialize?
I encountered the same problem, and I deleted the checkpoint path on HDFS to avoid the exception