Databricks error:Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster - databricks

I'm working in databricks 8.1. If I’m running my Pyspark code line by line using Shift + Enter then my code is running fine but if I’m running the entire notebook using Run All button then I’m getting internal error message. The complete error message is
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
com.databricks.rpc.RPCResponseTooLarge: rpc response (of 20984709 bytes) exceeds limit of 20971520
bytes at com.databricks.rpc.Jetty9Client$$anon$1.onContent(Jetty9Client.scala:370) at
shaded.v9_4.org.eclipse.jetty.client.api.Response$Listener$Adapter.onContent(Response.java:248) at
shaded.v9_4.org.eclipse.jetty.client.ResponseNotifier.notifyContent(ResponseNotifier.java:135) at
shaded.v9_4.org.eclipse.jetty.client.ResponseNotifier.notifyContent(ResponseNotifier.java:126) at
shaded.v9_4.org.eclipse.jetty.client.HttpReceiver.responseContent(HttpReceiver.java:340) at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.content(HttpReceiverOverHTTP.java:283)
at shaded.v9_4.org.eclipse.jetty.http.HttpParser.parseContent(HttpParser.java:1762) at
shaded.v9_4.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1490) at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.parse(HttpReceiverOverHTTP.java:172)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:135)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:73)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:133)
at shaded.v9_4.org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:151)
at shaded.v9_4.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at shaded.v9_4.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:426)
at shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:320) at
shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:158) at
shaded.v9_4.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at
shaded.v9_4.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at shaded.v9_4.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367) at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782) at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914) at java.base/java.lang.Thread.run(Thread.java:834)
Several times, I restarted the cluster but still I'm getting same issue
Can you suggest me the steps to resolve this issue?

Related

"Error: Key not loaded" in h2o deployed through a K3s cluster, using python3 client

I can confirm the 3-replica cluster of h2o inside K3s is correctly deployed, as executing in the Python3 interpreter h2o.init(ip="x.x.x.x") works as expected. I followed the instructions noted here: https://www.h2o.ai/blog/running-h2o-cluster-on-a-kubernetes-cluster/
Nevertheless, I had to modify the service.yaml and comment out the line which says clusterIP: None, as K3s was complaining about something related to its inability to set the clusterIP to None. But even though, I can certify it is working correctly, and I am able to use an external IP to connect to the cluster.
If I try to load the dataset using the h2o cluster inside the K3s cluster using the exact same steps as described here http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html, this is the output that I get:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Key not loaded: Key<Frame> https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv
Request: POST /3/ParseSetup
data: {'check_header': '0', 'source_frames': '["https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv"]'}
The same error occurs if I use the h2o.upoad_file("x.csv") method.
There is a clue about what may be happening here: Key not loaded: Key<Frame> while POSTing source frame through ParseSetup in H2O API call but I am not using curl, and I can not find any parameter that could help me overcome this issue: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=import_file#h2o.import_file
I need to use the Python client inside the same K3s cluster due to different technical reasons, so I am not able to kick off nor Flow nor Firebug to know what may be happening.
I can confirm it is working correctly when I simply issue a h2o.init(), using the local Java instance.
UPDATE 1:
I have tried in different K3s clusters without success. I changed the service.yaml to a NodePort, and now this is the error traceback:
>>> train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
...
h2o.exceptions.H2OResponseError: Server error java.lang.IllegalArgumentException:
Error: Job is missing
Request: GET /3/Jobs/$03010a2a016132d4ffffffff$_a2366be93ec99a78d7bc161de8c54d67
UPDATE 2:
I have tried using different services (NodePort, LoadBalancer, ClusterIP) and none of them work. I also have tried using Minikube with the official image, and with a custom image made by me, without success. I suspect this is something related to either h2o itself, or the clustering between pods. I will keep digging and let's think there will be some gold in it.
UPDATE 3:
I also found out that the post about running H2O in Docker is really outdated https://www.h2o.ai/blog/h2o-docker/ nor is working the Dockerfile present at GitHub (I changed it to uncomment the ENTRYPOINT section without success): https://github.com/h2oai/h2o-3/blob/master/Dockerfile
Even though, I tried with the custom image I built for h2o-k8s and it is working seamlessly in pure Docker. I am wondering why it is still not working in K8s...
UPDATE 4:
I have tried modifying the environment variable called H2O_KUBERNETES_SERVICE_DNS without success.
In the meantime, the cluster started to be unavailable, that is, the readinessProbe's would not successfully complete. No matter what I change now, it does not work.
I spinned up a K3d cluster in local to see what happened, and surprisingly, the readinessProbe's were not failing, using v3.30.0.6. But now I started testing it with R instead of Python. I am glad I tried, because I may have pinpointed what was wrong. There is a version mismatch between the client and the server. So I updated accordingly the image to v3.30.0.1.
But now again, the readinessProbe is not working in my k3d cluster, so I am unable to test it.
It seems it is working now. R client version 3.30.0.1 with server version 3.30.0.1. Also tried with Python version 3.30.0.7 and server version 3.30.0.7 and it started working. Marvelous. The problem was caused by a version mismatch between the client and the server, as the python client was updated to 3.30.0.7 while the latest server for docker was 3.30.0.6.

Application failed 2 times due to AM container for appattempt_ exited with exitCode: 0

When I submit my spark program, it fails at the end but with a ExitCode:0 as shown in the picture.
The program should write a table on hive and despite the failure, the table was created successfully.
But I can't figure out the origin of the error. Can you help please.
Yarn logs -appID gives the following output here
I finally solved my problem. In fact today I suprisingly got another error from the same program saying
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver X:37478 disassociated! Shutting down.
And I found so many solutions talking about memory or timeOut.
What did the trick is just that I forgot to close my sparkSession (spark.close()).

Unable to start Kudu master

While starting kudu-master, I am getting the below error and unable to start kudu cluster.
F0706 10:21:33.464331 27576 master_main.cc:71] Check failed: _s.ok() Bad status: Invalid argument: Unable to initialize catalog manager: Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ. Their symmetric difference is: :0, hadoop-master:7051, slave2:7051, slave3:7051
It is a cluster of 8 nodes and i have provided 3 masters as given below in master.gflagfile on master nodes.
--master_addresses=hadoop-master,slave2,slave3
TL;DR
If this is a new installation, working under the assumption that master ip addresses are correct, I believe the easiest solution is to
Stop kudu masters
Nuke the <kudu-data-dir>/master directory
Start kudu masters
Explanation
I believe the most common (if not only) cause of this error (Failed to initialize sys tables async: on-disk master list (hadoop-master:7051, slave2:7051, slave3:7051) and provided master list (:0) differ.) is when a kudu master node gets added incorrectly. The error suggests that kudu-master thinks it's running on a single node rather than 3-node cluster.
Maybe you did not intend to "add a node", but that's most likely what happened. I'm saying this because I had the same problem; after some googling and debugging, I discovered that during the installation, I started kudu-master before putting the correct IP address in master.gflagfile, so that kudu-master was spun up thinking it was running on a single node, not 3 node. Using steps above to clean install kudu-master again, my problem was solved.

Errors trying to resize instance group with aws cli

I have a standing EMR cluster and a daily job I want to run. I was trying to use the aws cli to resize the cluster, with the plan of adding this to the crontab so the cluster would grow and then shrink later. (I don't have ability for auto-scaling, so that's out)
I have read the Amazon documentation and the examples they give don't work. I've tried the natural variations, but end up getting nowhere.
According to the documentation, the command is
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
However, when I try this with my own instance ID, i get:
Error parsing parameter '--instance-groups': Expected: '<second>', received: '<none>' for input:InstanceGroupId=ig-31JXXXXXXBTO,
I've tried doing things like removing the instance count, hoping for more documentation...
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-WCXEP0AXCGJS
which gives the response
An error occurred (ValidationException) when calling the ModifyInstanceGroups operation: Please provide either an instance count or a list of EC2 instance ids to terminate.
I've tried several variations without luck. Any ideas? Thanks.
I ended up submitting a trouble ticket through Amazon.
The resize command requires that no space occur after the comma. The trouble shooter has reported this behavior and unhelpful error to the developers.
aws emr modify-instance-groups --instance-groups InstanceGroupId=ig-31JXXXXXXBTO,InstanceCount=4
will work, as long as there is no space after the comma. hopefully they'll either fix that or provide a better error message.

Getting "Error initializing cluster data" in OpsCenter

I am getting following error in Opscenter, after some time the issue got resolved itself.
Error initializing cluster data: The request to
/APP_Live/keyspaces?ksfields=column_families%2Creplica_placement_strategy%2Cstrategy_options%2Cis_system%2Cdurable_writes%2Cskip_repair%2Cuser_types%2Cuser_functions%2Cuser_aggregates&cffields=solr_core%2Ccreate_query%2Cis_in_memory%2Ctiers timed out after 10 seconds..
If you continue to see this error message, you can workaround this timeout by setting [ui].default_api_timeout to a value larger than 10 in opscenterd.conf and restarting opscenterd.
Note that this is a workaround and you should also contact DataStax
Support to follow up.
Workaround of this timeout is by setting [ui].default_api_timeout to a value larger than 10 in opscenterd.conf and restarting opscenterd.

Resources