%spark.r interpreter is not working in Zeppelin 0.6.1 - apache-spark

I am having Spark 1.6.2 cluster with Hadoop YARN, Oozie. I have installed Zeppelin 0.6.1(Binary package with all interpreters: zeppelin-0.6.1-bin-all.tgz). When I am trying to use SparkR script with %spark.r interpreter,
%spark.r
# Creating SparkConext and connecting to Cloudant DB
sc1 <- sparkR.init(sparkEnv = list("cloudant.host"="host_name","cloudant.username"="user_name","cloudant.password"="password", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "sensordata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "sensordata" Cloudant DB
sensorDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
# Get basic information about the DataFrame(sensorDataDF)
printSchema(sensorDataDF)
I am getting the following error(log):
ERROR [2016-08-25 03:28:37,336] (
{Thread-77}
JobProgressPoller.java[run]:54) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:373)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:111)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:237)
at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:296)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:281)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:370)
... 3 more
Help would be much appreciated.

I faced the same similar issue after migrating to 0.6.1. The issue is Zeppelin is built with scala 2.11 and Apache Spark 1.6.2 is built with scala 2.10.
You need to build spark 1.6.x with scala 2.11 or migrate your spark code to 2.0.0

Setting local[2] in the interpreter section fixed my issues. This was originally suggested by vgunnu
"Try setting spark master as local[2], if that works, you might be missing few environmental variables in env file – vgunnu Aug 25 at 4:37"

Related

Unrecognized Hadoop major version number

I am trying to initialize an Apache Spark instance on Windows 10 to run a local test. My problem is during the initialization of the Spark instance, I get an error message. This code has worked for me a lot of times previously, so I am guessing something might have changed in the dependencies or the configuration. I am running using JDK version 1.8.0_192, Hadoop should be 3.0.0 and Spark version is 2.4.0. I am also using Maven as a build tool if that is relevant.
Here is the way I am setting up the session:
def withSparkSession(testMethod: SparkSession => Any) {
val uuid = UUID.randomUUID().toString
val pathRoot = s"C:/data/temp/spark-testcase/$uuid" // TODO: make this independent from Windows
val derbyRoot = s"C:/data/temp/spark-testcase/derby_system_root"
// TODO: clear me up -- Derby based metastore should be cleared up
System.setProperty("derby.system.home", s"${derbyRoot}")
val conf = new SparkConf()
.set("testcase.root.dir", s"${pathRoot}")
.set("spark.sql.warehouse.dir", s"${pathRoot}/test-hive-dwh")
.set("spark.sql.catalogImplementation", "hive")
.set("hive.exec.scratchdir", s"${pathRoot}/hive-scratchdir")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
.setMaster("local[*]")
.setAppName("Spark Hive Test case")
val spark = SparkSession.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
try {
testMethod(spark)
}
finally {
spark.sparkContext.stop()
println(s"Deleting test case root directory: $pathRoot")
deleteRecursively(nioPaths.get(pathRoot))
}
}
And this is the error message I receive:
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1117)
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:866)
.
.
.
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:454)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:35)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:34)
at org.scalamock.MockFactoryBase$class.withExpectations(MockFactoryBase.scala:41)
at org.scalamock.scalatest.AbstractMockFactory$class.withFixture(AbstractMockFactory.scala:34)
at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:451)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:464)
at org.scalatest.FunSpec.runTest(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:497)
at org.scalatest.FunSpec.runTests(FunSpec.scala:1630)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:501)
at org.scalatest.FunSpec.run(FunSpec.scala:1630)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1346)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.0.0-cdh6.3.4
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
... 64 more
Process finished with exit code 2
So far I tried changing up the JDK versions to jdk1.8.0_181 and jdk11+28-x64. I also tried deleting the HADOOP_HOME environment variables from the system, but they didn't help. (Currently they are set to C:\Data\devtools\hadoop-win\3.0.0)
If you're on windows, you shouldn't be pulling CDH dependencies (3.0.0-cdh6.3.4), as Cloudera doesn't support Windows, last I checked.
But, you should be using Spark3, if you have Hadoop3+, and keep HADOOP_HOME, as that is definitely necessary.
Also, only Hadoop 3.3.4 has introduced Java 11 runtime support, so Java 8 is what you should stick with.
I have solved the problem. During the project development we also added HBase to the build, which pulled in a different Hadoop version from Cloudera as its dependency, so the versions got mixed up. Taking it out HBase dependency from the pom.xml solved the problem.

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport - error while reading data from datalake to spark

I have created a new databricks cluster with the following libraries installed.
Cluster version - 7.6 (includes Apache Spark 3.0.1, Scala 2.12)
spark_cdm_connector_0_19_1.jar ,
neo4j_connector_apache_spark_2_12_4_0_1_for_spark_3.jar ,
com.microsoft.azure:spark-cdm-connector:0.19.1 - maven file ,
neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.1_for_spark_3 - maven file ,
But while loaing CDM data from datalake to databricks I'm facing the subjected error
java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport
Should I need to change the configuration of databricks cluster or update the configuration file?
Thanks

Zeppelin Null Pointer Exception

I wrote this simple code in my zeppelin notebook
import org.apache.spark.sql.SQLContext
val sqlConext = new SQLContext(sc)
val df = sqlContext.read.format("csv").option("header", "true").load("hdfs:///user/admin/foo/2018.csv")
df.printSchema()
Earlier it was not able to find spark-csv. so I added it as a dependency to spark1 and spark2 interpreters. But when I run this code I get an error
java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:614)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
This file has just 300 rows. So I don't think it causes any memory issues. I have a 4 node cluster, so how can I determine where is the log file where a more detailed error may reside?
OK. I resolved it. It seems Zeppelin uses Scala 2.10 I had added dependency of Scala csv for version 2.11 that caused the null pointer error.
I went and changed my dependency to 2.10 and restarted the interpreter and now it works fine.

When Running Spark job in hadoop cluster i am getting java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

When i tried to run my scala code which connects hbase database it works perfectly in my local IDE . But when i run the same in hadoop cluster i am getting "Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration" error .
Please help me in this
Add all the HBase library jars to HADOOP_CLASSPATH -
export HBASE_HOME="YOUR_HBASE_HOME_PATH"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
You can append any external jar needed to HADOOP_CLASSPATH, so that you don't need to explicitly set it in spark-submit command. All dependent jars will be loaded and provided to your Spark application.

Why Zeppelin notebook is not able to connect to S3

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.
Spark Version:
Standalone: spark-1.2.1-bin-hadoop1.tgz
I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.
Code:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
val file = "s3n://<bucket>/<key>"
val data = sc.textFile(file)
data.count
file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
I have build the Zeppelin by following command:
mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests
when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.
I have also tried -Phadoop-1 mentioned in this spark website. but got the same error.
1.x to 2.1.x hadoop-1
Please let me know what I am missing here.
The following installation worked for me (spent also many days with the problem):
Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):
mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)
After this installation steps my notebook worked with S3 files:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.
Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)
For me it worked in one two steps-
1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. -
val performanceFactor = sqlContext.
read. parquet("s3n://<accessKey>:<secretKey>#mybucket/myfile/")
where access key and secret key you need to supply.
in #2 I am using s3n protocol and access and secret keys in path itself.

Resources