Spark JobServer NullPointerException - apache-spark

I'm trying to start a spark jobserver, here are the steps I'm following:
I configure the local.sh based on the template.
Then I run ./bin/server_deploy.sh and it finishes without any error.
Configure local.conf.
Run ./bin/server_start.sh in the deploy server.
But when I do the last step I get the following error:
Error: Exception thrown by the agent : java.lang.NullPointerException
Note: I'm using spark 1.4.1. I'm using version 0.5.2 from jobserver (https://github.com/spark-jobserver/spark-jobserver/tree/v0.5.2)
Any idea in how I can fix this (or at least debug it).
Thanks

The error log does not provide much information.
I encountered the same error. For my case, I had another instance of the JobServer running (and somehow ./bin/server_stop.sh did not catch it). It works after I manually killed the other process.
Hint : Error: Exception thrown by the agent : java.lang.NullPointerException when starting Java application

Related

Unable to save model in Apache Spark -- Py4JJavaError

We're getting an error while trying to save a model. model.save('DT')
Py4JJavaError: An error occurred while calling o822.save.
: org.apache.spark.SparkException: Job aborted.```
Complete Error Stack --> http://dpaste.com/16Y07B9
Anything we missed here? It is creating the folder but not writing anything.
OS: Windows 10
TIA
So it turns out I was using Spark 3.0.0Preview and ran into trouble. Switched to 2.4.5 and resolved it.

Spark 2.4 Got an error when resolving hostNames Falling back to /default-rack

Running an application in in client mode, the driver logs are printed with the below info messages, any idea on how to resolve this? Any spark configs to be updated? or missing?
[INFO ][dispatcher-event-loop-29][SparkRackResolver:54] Got an error when resolving hostNames. Falling back to /default-rack for all
The jobs runs fine, this msg is not in the executor logs.
Check this bug:
https://issues.apache.org/jira/browse/SPARK-28005
If you want to suppress this in the logs you can try to add this into your log4j.properties
log4j.logger.org.apache.spark.deploy.yarn.SparkRackResolver=ERROR
This can happen while using spart-submit with master yarn in a deploy mode local (not using --deploy-mode cluster) and the path to topology.py script is not correct into your core-site.xml.
Path to core-site.xml can be set via environment variable HADOOP_CONF_DIR (or YARN_CONF_DIR).
Check the path in the param net.topology.script.file.name value of core-site.xml.
If the path is incorrect, deploying driver in local mode will lead to error of executing with the following warning:
23/01/15 18:39:43 WARN ScriptBasedMapping: Exception running /home/alexander/xxx/.conf/topology.py 10.15.21.199
java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "/home/john"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
...
23/01/15 18:39:43 INFO SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all

CDIIntegrationService errors were found when Weblogic upgrade from 12.1.2 to 12.1.3

I have upgraded Weblogic version in linux server by changing the wl_home path in the setDomainEnv.sh for 12.1.2 to 12.1.3 and restart. when restarting it gives below errors.
Appreciate if anyone can give idea about this.
java.lang.IllegalAccessError: tried to access method com.bea.logging.LogBufferHandler.bufferLogObject(Ljava/lang/Object;)V from class weblogic.logging.log4j.WLLog4jMemoryBufferAppender
java.lang.IllegalStateException: Unable to perform operation: post construct on weblogic.diagnostics.lifecycle.LoggingServerService
java.lang.IllegalArgumentException: While attempting to resolve the dependencies of weblogic.diagnostics.lifecycle.DiagnosticFoundationService errors were found
java.lang.IllegalStateException: Unable to perform operation: resolve on weblogic.diagnostics.lifecycle.DiagnosticFoundationService
java.lang.IllegalArgumentException: While attempting to resolve the dependencies of com.oracle.injection.integration.CDIIntegrationService errors were found
java.lang.IllegalStateException: Unable to perform operation: resolve on com.oracle.injection.integration.CDIIntegrationService
Use the wllog4j.jar which is part of WLS 12.1.3 build. The one part of WLS_HOME/server/lib diretory.
If your domains-home/lib folder holds older wllog4.jar which was part of 12.1.2 then you will face this issue

Stopping a job in spark

I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application

Cassandra 2.1.8: Node refuses to start with NPE in removeUnfinishedCompactionLeftovers

I am trying to upgrade a Cassandra 2.1.0 cluster to 2.1.8 (latest release).
When I start a first node with 2.1.8 runtime, I get an error and the node refuses to start.
This is the error's stack trace :
org.apache.cassandra.io.FSReadError: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:642) ~[apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:302) [apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524) [apache-cassandra-2.1.8.jar:2.1.8]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613) [apache-cassandra-2.1.8.jar:2.1.8]
Caused by: java.lang.NullPointerException: null
at org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:634) ~[apache-cassandra-2.1.8.jar:2.1.8]
... 3 common frames omitted
FSReadError in Failed to remove unfinished compaction leftovers (file: /home/nudgeca2/datas/data/main/segment-97b5ba00571011e49a928bffe429b6b5/main-segment-ka-15432-Statistics.db). See log for details.
at org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:642)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:302)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613)
Caused by: java.lang.NullPointerException
at org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:634)
... 3 more
Exception encountered during startup: java.lang.NullPointerException
The cluster has 7 nodes and it turns on AWS Linux EC2 instances.
The node I try to upgrade was stopped after a nodetool drain.
Then I tried to come back to 2.1.0 runtime but I now get a similar error.
I also tried to stop and start another node and everything was ok, the node restarted without any problem.
I tried to touch the missing file (as it should be removed, I thought it would perhaps not need a specific content). I had two other files with the same error that I also touched. And finally the node fails further while trying to read these files.
Anyone has any idea what I should do ?
Thank you for any help.
It might be worth opening a Jira for that issue, so if nothing else, they can catch the NPE and provide a better error message.
It looks like it's trying to open:
file: /home/nudgeca2/datas/data/main/segment-97b5ba00571011e49a928bffe429b6b5/main-segment-ka-15432-Statistics.db
It's possible that it's trying to read that file because it finds the associated data file: (/home/nudgeca2/datas/data/main/segment-97b5ba00571011e49a928bffe429b6b5/main-segment-ka-15432-Data.db). Does that data file exist? I'd be tempted to move it out of the way, and see if it starts properly.

Resources