DSE 4.8.8 upgrade throws spark cassandra connector config variables error - cassandra

I have recently upgraded my DSE version from 4.8.4 to 4.8.8 and I am seeing the following unexpected result.
I am using : spark-cassandra-connector-java_2.10: 1.4.2
spark-streaming_2.10: 1.4.1
spark-core_2.10 1.4.1
WARN 2016-06-23 21:28:50,132 org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 3365.0 (TID 6731, 172.31.17.116): java.lang.ExceptionInInitializerError
at com.dynosense.dynospark.SparkApp$2.call(SparkApp.java:107)
at com.dynosense.dynospark.SparkApp$2.call(SparkApp.java:103)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1508)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1792)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1792)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.host.autoUpdate is not a valid Spark Cassandra Connector variable.
No likely matches found.
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:46)
at com.datastax.spark.connector.cql.CassandraConnectorConf$.apply(CassandraConnectorConf.scala:171)
at com.datastax.spark.connector.cql.CassandraConnector$.apply(CassandraConnector.scala:192)
at com.datastax.spark.connector.cql.CassandraConnector.apply(CassandraConnector.scala)
at com.dynosense.dynospark.SparkApp.<clinit>(SparkApp.java:62)
... 15 more

Related

Cannot run livy in sparkR mode on zeppelin 0.8

We've install the zeppelin that is one of HDP 3.1.5 service and install the suitable R version (from RHEL 7 repos) that compatible with the zeppelin version (0.8).
At first, when we scripting and running the Spark code using spark interpreter especially using sparkR one, it runs well but when we change the interpreter into livy, we always got an error with message "Fail to start interpreter".
When we check into Yarn UI, the application is created and running but when it's checked into spark ui, we've got the stderror that said
22/07/28 10:21:17 WARN SparkRInterpreter$: Fail to init Spark RBackend, using different method signature
java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.livy.repl.SparkRInterpreter$$anon$1.run(SparkRInterpreter.scala:88)
22/07/28 10:21:17 WARN Session: Fail to start interpreter sparkr
java.io.IOException: Cannot run program "R": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.livy.repl.SparkRInterpreter$.apply(SparkRInterpreter.scala:143)
at org.apache.livy.repl.Session.liftedTree1$1(Session.scala:107)
at org.apache.livy.repl.Session.interpreter(Session.scala:98)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply$mcV$sp(Session.scala:168)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply(Session.scala:163)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply(Session.scala:163)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 11 more
Is there any step/configuration that we skip?
-Thanks-

Elasticsearch SSL trustAnchor issue

I am trying to connect ElasticSerach 5.6.6 from Spark 1.6.
Following are configured in my code. Is there anything else I am missing ?
"xpack.security.user"
"xpack.security.transport.ssl.enabled"
"xpack.security.transport.ssl.keystore.path"
"xpack.security.transport.ssl.keystore.password"
"xpack.security.transport.ssl.truststore.path"
"xpack.security.transport.ssl.truststore.password"
Getting follwoung issues:
2018-03-07 07:33:01,509 WARN [task-result-getter-2] scheduler.TaskSetManager(70): Lost task 2.0 in stage 6.0 (TID 150, ip-172-30-4-42.ec2.internal, executor 3): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: javax.net.ssl.SSLException: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:124)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopTransportException: javax.net.ssl.SSLException: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:124)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:424)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:154)
at org.elasticsearch.hadoop.rest.RestClient.remoteEsVersion(RestClient.java:609)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:243)
As we were executing the job in spark file path need to be file:///<yourdirectory>/<filename>

spark application completes with SUCCESS status when an exception is thrown

I am running a spark application on yarn, which my goal is do some ETL from jdbc to elasticsearch.
However, when I check the log ,there is some errors like,this error is due to network problem :
17/12/01 00:35:19 WARN scheduler.TaskSetManager: Lost task 1317.0 in stage 0.0 (TID 1381, worker50.hadoop, executor 1): org.apache.spark.util.TaskCompletionListenerException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.200.154:8201, 192.168.200.156:9200, 192.168.200.155:8201]]
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:124)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This means that connection failed and lost some data in this process.The job finalStatus should be failed, but spark returned me with {"state":"FINISHED","finalStatus":"SUCCEEDED"}
WHY? My spark version is 2.2.0

Spark Streaming Error with snappy compression codec

I am getting the following exception on running my spark streaming job.
The same job has been running fine since long and when I added two new machines to my cluster I see the job failing with the following exception.
16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage 4229.0 (TID 22594)
java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:59)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1254)
... 11 more
Caused by: java.lang.IllegalArgumentException
at org.apache.spark.io.SnappyCompressionCodec.<init>(CompressionCodec.scala:152)
... 20 more
On changing the default compression codec which is snappy to lzf the errors are gone !!
How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark ?
Spark Version is 1.4.0

Zeppelin Pyspark on HDP 2.3 giving error

I am trying to configure zeppelin to work with HDP 2.3 (Spark 1.3). I have successfully installed zeppelin via Ambari and the zeppelin service is running.
But when I am trying to run any %pyspark command I am getting the below error.
I read few blogs but seems like there is some issue with jar being compiled on Java 6 and Java 7 that are being shared between Python and Spark.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, sandbox.hortonworks.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject id=o68), <traceback object at 0x2618bd8>)
Took 0 seconds
Can you check in your zeppelin-env.sh if you have the below line?
export PYTHONPATH=${SPARK_HOME}/python
If missing, this can be added via Ambari under Zeppelin > Configs > Advanced zeppelin-env > zeppelin-env template
Although, if you installed using the latest version of Ambari service for zeppelin then it should have done this for you:
https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/configuration/zeppelin-env.xml#L63
I just setup a fresh HDP 2.3 setup (2.3.0.0-2557) on Centos 6.5 using Ambari 2.1 and installed zeppelin using Ambari zeppelin service (using default configs). Pyspark seems to work fine for me.
Based on your error it sounds like PYTHONPATH is not getting set to the correct value:
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
In zeppelin can you enter the below in a cell and run it and provide the output?
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
Here is the output on my setup:
res41: String = yarn-client
res42: String = hdfs:///apps/zeppelin/zeppelin-spark-0.6.0-SNAPSHOT.jar
res43: String = /etc/hadoop/conf
res44: String = /usr/java/default
res45: String = /usr/hdp/current/spark-client/
res46: String = null
res47: String = /usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/lib/pyspark.zip:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip
res48: String = -Dhdp.version=2.3.0.0-2557 -Dspark.executor.memory=512m -Dspark.yarn.queue=default

Resources