I have some schedule spark jobs running on k8s,
sometimes the driver pod is stacking on "Running" status regarding failures in the sparkSession.getOrCreate function, everytime regarding another reason.
failure examples:
Exception in thread "main"
io.fabric8.kubernetes.client.KubernetesClientException: Failed to
start websocket
Exception in thread "main"
io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: POST at:
https://kubernetes.default.svc/api/v1/namespaces/xxxxx/configmaps.
Message: Internal error occurred: failed calling webhook
"objects.hnc.x-k8s.io": Post
"https://webhook-service.system.svc/validate-objects?timeout=2s":
context deadline exceeded.
For now I don't care about why those exception are occurring (unless you know it, which can be greate bonus for here) - but I care about how to make the driver exit and close.
I am afraid that during the creation of the context some of the resources already created but not being free.
Thanks.
Related
When I submit my spark program, it fails at the end but with a ExitCode:0 as shown in the picture.
The program should write a table on hive and despite the failure, the table was created successfully.
But I can't figure out the origin of the error. Can you help please.
Yarn logs -appID gives the following output here
I finally solved my problem. In fact today I suprisingly got another error from the same program saying
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver X:37478 disassociated! Shutting down.
And I found so many solutions talking about memory or timeOut.
What did the trick is just that I forgot to close my sparkSession (spark.close()).
I am running a application I made in service fabric on the local dev cluster.
when debugging and seeing the output, its filled with
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
everything is green and I havent found anything that dont work as expected. So what exactly do I do to find out what it is that is an InvalidOperation :) ?
Checked event logs and diagnostic events and no errors are shown.
It is probably catched and handled inside SFA, if it would be serious it would be rethrown so I would not worry about it too much.
Might have to do with the fact that it is the local dev environment and certain operations are not valid in a non-azure / on-premises installation environment.
A local dev environment for example lets you simulate a 5 node cluster on a single machine. Any exceptions caused by that might be catched and if it is detected that it is a non-real world environment it could be handled in a different way.
I'm seeing exactly the same thing, but in production as well and agree that I don't think it's a problem, though it will impact performance.
I'll put a few more things in context to hopefully help other people who see the same thing.
Here's my code which works correctly and I don't see the exception being thrown all the way up.
Log.Information("About to record activity for {Machine}", machineId);
using (var tx = StateManager.CreateTransaction())
{
await map.AddOrUpdateAsync(tx, machineId, _ => new SystemInfo { LastSeenUtc = timestampUtc }, (_, info) =>
{
info.LastSeenUtc = timestampUtc;
return info;
}).ConfigureAwait(false);
await tx.CommitAsync().ConfigureAwait(false);
Log.Information("Just committed transaction {Tx}", tx.TransactionId);
}
Log.Information("Recorded activity for {Machine}", machineId);
And with the debug output window. I'm using serilog with the Serilog.Sinks.Debug sink to capture the output.
[07:48:27 INF] About to record activity for d6441d45e1834db7860c00a8074652f9
[07:48:27 INF] Just committed transaction 131489813076754311
[07:48:27 INF] Recorded activity for d6441d45e1834db7860c00a8074652f9
Exception thrown: 'System.InvalidOperationException' in Microsoft.ServiceFabric.Data.Impl.dll
Transaction 131489813076754311 is committing or rolling back or has already committed or rolled back
And I see the stack trace as
Everything else appears to be fine - but throwing exceptions is not great for performance.
When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?
this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.
I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
...
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at http://192.168.1.33:4040
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.
The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
...
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
}
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.
you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.
I am using Spring to configure a (lazy loaded) Hazelcast Client that connects to a 2 member cluster.
<hz:client id="hazelcast" lazy-init="true">
<hz:group name="${HzName}" password="${HzPassword}"/>
<hz:properties>
<hz:property name="hazelcast.client.connection.timeout">10000</hz:property>
<hz:property name="hazelcast.client.retry.count">600</hz:property>
<hz:property name="hazelcast.jmx">true</hz:property>
<hz:property name="hazelcast.logging.type">slf4j</hz:property>
</hz:properties>
<hz:network smart-routing="true" redo-operation="true" connection-attempt-period="5000"
connection-attempt-limit="2">
<hz:member>${HzMember1}</hz:member>
<hz:member>${HzMember2}</hz:member>
</hz:network>
</hz:client>
My issue is:
If, at the time of my application starting, BOTH of the cluster members happen to be unavailable then I am seeing ClusterListenerThread throwing a SEVERE exception:
WARNING: Unable to get alive cluster connection, try in 4945 ms later, attempt 1 of 2.
16-Apr-2015 14:57:34 com.hazelcast.client.spi.impl.ClusterListenerThread
WARNING: Unable to get alive cluster connection, try in 4987 ms later, attempt 2 of 2.
16-Apr-2015 14:57:39 com.hazelcast.client.spi.impl.ClusterListenerThread
SEVERE: Error while connecting to cluster!
java.lang.IllegalStateException: Unable to connect to any address in the config!
at com.hazelcast.client.spi.impl.ClusterListenerThread.connectToOne(ClusterListenerThread.java:273)
at com.hazelcast.client.spi.impl.ClusterListenerThread.run(ClusterListenerThread.java:79)
Caused by: com.hazelcast.spi.exception.RetryableIOException: java.util.concurrent.ExecutionException: com.hazelcast.core.HazelcastException: java.net.ConnectException: Connection refused
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$OwnerConnectionFuture.createNew(ClientConnectionManagerImpl.java:649)
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl$OwnerConnectionFuture.access$300(ClientConnectionManagerImpl.java:605)
at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.ownerConnection(ClientConnectionManagerImpl.java:268)
at com.hazelcast.client.spi.impl.ClusterListenerThread.connectToOne(ClusterListenerThread.java:245)
...which subsequently results in my Hazelcast Client bean not being instantiated and therefore everything downstream that relies upon it's existence blowing up as well.
SEVERE: Context initialization failed
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'hazelcast': Instantiation of bean failed; nested exception is org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public static com.hazelcast.core.HazelcastInstance com.hazelcast.client.HazelcastClient.newHazelcastClient(com.hazelcast.client.config.ClientConfig)] threw exception; nested exception is java.lang.IllegalStateException: Cannot get initial partitions!
How can I detect and handle the error being thrown by ClusterListenerThread during the instantiation (by Spring) of my Hazelcast Client bean?
nb. I'm frustrated by the fact that the Client bean is not being constructed (despite there being no available members) because I know for a fact that an already constructed Client can handle having all of it's member's become unavailable and will happily start working again when one or more of those members become available again.
What you need to do is configure connectionAttemptLimit and connectionAttemptPeriod.
If you set connectionAttemptLimit to INT_MAX and pick a reasonable connectiontAttemptPeriod, the client will practically try to connect to given memberlist forever every X seconds. This is currently more like a workaround, because when you first open HazelcastClient it will block until one of the members become available. A further workaround, you can try to open the client in another dedicated thread.
About making a fully supported feature, there was a discussion going on by the hazelcast team here.
https://github.com/hazelcast/hazelcast/issues/552
I am adding a question as a reference to the issue.