Why Zeppelin notebook is not able to connect to S3 - apache-spark

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.
Spark Version:
Standalone: spark-1.2.1-bin-hadoop1.tgz
I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.
Code:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
val file = "s3n://<bucket>/<key>"
val data = sc.textFile(file)
data.count
file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
I have build the Zeppelin by following command:
mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests
when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.
I have also tried -Phadoop-1 mentioned in this spark website. but got the same error.
1.x to 2.1.x hadoop-1
Please let me know what I am missing here.

The following installation worked for me (spent also many days with the problem):
Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):
mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)
After this installation steps my notebook worked with S3 files:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.
Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)

For me it worked in one two steps-
1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. -
val performanceFactor = sqlContext.
read. parquet("s3n://<accessKey>:<secretKey>#mybucket/myfile/")
where access key and secret key you need to supply.
in #2 I am using s3n protocol and access and secret keys in path itself.

Related

Unrecognized Hadoop major version number

I am trying to initialize an Apache Spark instance on Windows 10 to run a local test. My problem is during the initialization of the Spark instance, I get an error message. This code has worked for me a lot of times previously, so I am guessing something might have changed in the dependencies or the configuration. I am running using JDK version 1.8.0_192, Hadoop should be 3.0.0 and Spark version is 2.4.0. I am also using Maven as a build tool if that is relevant.
Here is the way I am setting up the session:
def withSparkSession(testMethod: SparkSession => Any) {
val uuid = UUID.randomUUID().toString
val pathRoot = s"C:/data/temp/spark-testcase/$uuid" // TODO: make this independent from Windows
val derbyRoot = s"C:/data/temp/spark-testcase/derby_system_root"
// TODO: clear me up -- Derby based metastore should be cleared up
System.setProperty("derby.system.home", s"${derbyRoot}")
val conf = new SparkConf()
.set("testcase.root.dir", s"${pathRoot}")
.set("spark.sql.warehouse.dir", s"${pathRoot}/test-hive-dwh")
.set("spark.sql.catalogImplementation", "hive")
.set("hive.exec.scratchdir", s"${pathRoot}/hive-scratchdir")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
.setMaster("local[*]")
.setAppName("Spark Hive Test case")
val spark = SparkSession.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
try {
testMethod(spark)
}
finally {
spark.sparkContext.stop()
println(s"Deleting test case root directory: $pathRoot")
deleteRecursively(nioPaths.get(pathRoot))
}
}
And this is the error message I receive:
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:1117)
at org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:866)
.
.
.
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSpecLike$$anon$1.apply(FunSpecLike.scala:454)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:35)
at org.scalamock.scalatest.AbstractMockFactory$$anonfun$withFixture$1.apply(AbstractMockFactory.scala:34)
at org.scalamock.MockFactoryBase$class.withExpectations(MockFactoryBase.scala:41)
at org.scalamock.scalatest.AbstractMockFactory$class.withFixture(AbstractMockFactory.scala:34)
at org.scalatest.FunSpecLike$class.invokeWithFixture$1(FunSpecLike.scala:451)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.FunSpecLike$$anonfun$runTest$1.apply(FunSpecLike.scala:464)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSpecLike$class.runTest(FunSpecLike.scala:464)
at org.scalatest.FunSpec.runTest(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.FunSpecLike$$anonfun$runTests$1.apply(FunSpecLike.scala:497)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSpecLike$class.runTests(FunSpecLike.scala:497)
at org.scalatest.FunSpec.runTests(FunSpec.scala:1630)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at org.scalatest.FunSpec.org$scalatest$FunSpecLike$$super$run(FunSpec.scala:1630)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.FunSpecLike$$anonfun$run$1.apply(FunSpecLike.scala:501)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSpecLike$class.run(FunSpecLike.scala:501)
at org.scalatest.FunSpec.run(FunSpec.scala:1630)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1346)
at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$1.apply(Runner.scala:1340)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1011)
at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1010)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.0.0-cdh6.3.4
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
... 64 more
Process finished with exit code 2
So far I tried changing up the JDK versions to jdk1.8.0_181 and jdk11+28-x64. I also tried deleting the HADOOP_HOME environment variables from the system, but they didn't help. (Currently they are set to C:\Data\devtools\hadoop-win\3.0.0)
If you're on windows, you shouldn't be pulling CDH dependencies (3.0.0-cdh6.3.4), as Cloudera doesn't support Windows, last I checked.
But, you should be using Spark3, if you have Hadoop3+, and keep HADOOP_HOME, as that is definitely necessary.
Also, only Hadoop 3.3.4 has introduced Java 11 runtime support, so Java 8 is what you should stick with.
I have solved the problem. During the project development we also added HBase to the build, which pulled in a different Hadoop version from Cloudera as its dependency, so the versions got mixed up. Taking it out HBase dependency from the pom.xml solved the problem.

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
I am able to read the data from Hudi using my jupyter notebook using below commands
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate
import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
"path"
)
import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)
This jupyter notebook was opened using a cluster which was created with one of the below properties
--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \
however, when I try to run the job using spark-submit with the same cluster, I get the error above.
I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.
As your application is dependent on hudi jar, hudi itself has some dependencies, when you add the maven package to your session, spark will install hudi jar and its dependencies, but in your case, you provide only the hudi jar file from a GCS bucket.
You can try this property instead:
--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \
Or directly from you notebook:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config(
"spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
).config(
"spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
).config(
"spark.jars.package", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0"
).enableHiveSupport().getOrCreate

How to add hbase-site.xml config file using spark-shell

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?
1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

Connect to Spark running on VM

I have a Spark enviroment running on Ubuntu 16.2 over VirtualBox. Its configured to run locally and when I start Spark with
./start-all
I can access to it on VM via web-ui using the URL: http://localhost:8080
From the host machine (windows), I can access it too using the VM IP: http://192.168.x.x:8080.
The problem appears when I try to create a context from my host machine. I have a project in eclipse that uses maven, and I try to run the following code:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
val conf = new SparkConf().setMaster(ConfigLoader.masterEndpoint).setAppName("SimpleApp")
val sc = new SparkContext(conf)
I got this error:
16/12/21 00:52:05 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.132:8080...
16/12/21 00:52:06 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.1.132:8080
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
I've tried changing the URL for:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
Unsuccessfully.
Also, if I try to access directly to the master URL via web (http://localhost:7077 in VM), I don't get anything. I don't know if its normal.
What am I missing?
In your VM go to spark-2.0.2-bin-hadoop2.7/conf directory and create spark-env.sh file using below command.
cp spark-env.sh.template spark-env.sh
Open spark-env.sh file in vi editor and add below line.
SPARK_MASTER_HOST=192.168.1.132
Stop and start Spark using stop-all.sh and start-all.sh. Now in your program you can set the master like below.
val spark = SparkSession.builder()
.appName("SparkSample")
.master("spark://192.168.1.132:7077")
.getOrCreate()

%spark.r interpreter is not working in Zeppelin 0.6.1

I am having Spark 1.6.2 cluster with Hadoop YARN, Oozie. I have installed Zeppelin 0.6.1(Binary package with all interpreters: zeppelin-0.6.1-bin-all.tgz). When I am trying to use SparkR script with %spark.r interpreter,
%spark.r
# Creating SparkConext and connecting to Cloudant DB
sc1 <- sparkR.init(sparkEnv = list("cloudant.host"="host_name","cloudant.username"="user_name","cloudant.password"="password", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "sensordata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "sensordata" Cloudant DB
sensorDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
# Get basic information about the DataFrame(sensorDataDF)
printSchema(sensorDataDF)
I am getting the following error(log):
ERROR [2016-08-25 03:28:37,336] (
{Thread-77}
JobProgressPoller.java[run]:54) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:373)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:111)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:237)
at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:296)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:281)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:370)
... 3 more
Help would be much appreciated.
I faced the same similar issue after migrating to 0.6.1. The issue is Zeppelin is built with scala 2.11 and Apache Spark 1.6.2 is built with scala 2.10.
You need to build spark 1.6.x with scala 2.11 or migrate your spark code to 2.0.0
Setting local[2] in the interpreter section fixed my issues. This was originally suggested by vgunnu
"Try setting spark master as local[2], if that works, you might be missing few environmental variables in env file – vgunnu Aug 25 at 4:37"

Resources