How to load hiveContext in Zeppelin? - apache-spark

I am new to zeppelin notebook. But i noticed one thing that unlike spark-shell hiveContext is not automatically created in zeppelin when i start the notebook.
And when i tried to manually load the hiveContext in zeppelin like:
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
I get this error
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42)
I think the error means that the previous metastore_db is not allowing to override the new one.
I am using spark 1.6.1
Any help would be appreciated.

check your metastore_db permission...
then you test by on REPL Mode..
then you have to move zeppelin.

Can you please try to connect Hive from shell. I just wanted you to check if Hive is installed properly because I had a similar issue some times back. Also try to connect Hive from Scala shell. If it works, then it should work from Zeppelin.

try creating the HIVE context as follows :
PYSPARK CODE.
sc = SparkContext(conf=conf)
sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
hiveContext = HiveContext(sc)
Hope it Helps.
Regards,
Neeraj

Related

How to use Databricks S3-SQS connector to read SQS messages in Structured Streaming?

I am trying to read messages from sqs using spark streaming using below code
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val df = spark.readStream.format("s3-sqs").option("queueUrl", "https://sqs.us-east-1.amazonaws.com/XXXX").option("region","us-east-1").option("awsAccessKey","xxxxx").option("fileFormat", "json").option("sqsFetchInterval", "1m") .load()
spark2-shell --jars /jars_aws/hadoop-aws-2.7.3.jar,/jars_aws/aws-java-sdk-1.11.582.jar,/jars_aws/aws-java-sdk-s3-1.11.584.jar,/jars_aws/aws-java-sdk-sqs-1.11.584.jar
I am getting below Exception Saying ClassNotFound Exception
java.lang.ClassNotFoundException: Failed to find data source: s3-sqs. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.ClassNotFoundException: s3-sqs.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 54 more
Please help
Added required jars
That errors says that no jar in --jars has the required classes for s3-sqs data source.
After a bit of googling and reading Optimized S3 File Source with SQS (that seems the official documentation) I think s3-sqs data source (aka Databricks S3-SQS connector) is part of Databricks Runtime (DBR) and Databricks-specific.
In other words, I think the connector is only available in Databricks notebooks and there seems no way to use it outside.

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Exception when trying to write a file to HDFS from Zeppelin

When trying to write to HDFS from Spark within Zeppelin, I am receiving this ClassNotFoundException for org.apache.hadoop.mapred.DirectFileOutputCommitter:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectFileOutputCommitter not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2106)
at org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:725)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:983)
Code that is trying to run:
val model = LinearRegressionWithSGD.train(someRDD, numIterations)
val modelPath = "hdfs:///some_path/LinearRegressionWithSGD"
model.save(sc, modelPath)
When searching for this class, I cannot even find it. The closest I can find is org.apache.hadoop.mapred.FileOutputCommitter in Hadoop.
I am using commit 18c8c9ea512a0d87699a73e2ca26192d03748661 (Oct 9) of Zeppelin, Spark 1.5.0 on YARN, and Hadoop 2.6.
I had the same problem. Looked for that file in "hadoop-mapreduce-client-core.X.X.X.jar", but couldn't find that in the jar.
I fixed the problem by adding org.apache.hadoop.mapred.DirectFileOutputCommitter to my repository. Source of that file is found here : https://gist.github.com/apivovarov
Not sure yet what's the root cause of this issue. Digging into it. Will update here once I have the answer.

loading parquet file into vertica database using spark

How to load a parquet file into vertica database using spark???
link (http://www.sparkexpert.com/2015/04/17/save-apache-spark-dataframe-to-database/)
I tried to load data frame(parquet files) using the above link into mysql it worked. But when i tried to load it into vertica database this is the error i am facing.The error below is because vertica db doesn’t support the datatypes(String) which is in the data frames(parquet file). I do not wanted to type cast the columns since its going to be a performance issue. we are looking to load around 280 million rows. Could you please suggest the best way to load the data into vertica db.
Exception in thread “main” java.sql.SQLSyntaxErrorException: [Vertica][VJDBC](5108) ERROR: Type “TEXT” does not exist
at com.vertica.util.ServerErrorData.buildException(Unknown Source)
at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepareImpl(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.(Unknown Source)
at com.vertica.jdbc.jdbc4.S4PreparedStatement.(Unknown Source)
at com.vertica.jdbc.VerticaJdbc4PreparedStatementImpl.(Unknown Source)
at com.vertica.jdbc.VJDBCObjectFactory.createPreparedStatement(Unknown Source)
at com.vertica.jdbc.common.SConnection.prepareStatement(Unknown Source)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
at org.apache.spark.sql.DataFrame.createJDBCTable(DataFrame.scala:1611)
at com.sparkread.SparkVertica.JdbctoVertica.main(JdbctoVertica.java:51)
Caused by: com.vertica.support.exceptions.SyntaxErrorException: [Vertica][VJDBC](5108) ERROR: Type “TEXT” does not exist
… 13 more
Since you are getting the error on the createJDBCTable, you could just create the table yourself and use insertIntoJDBC instead.
Another idea would be to try and set spark.sql.dialect to Postgres since I noticed registerDialect(PostgresDialect) in spark. That said, I don't know how to do this other than to use jdbc:postgresql, but if you use that driver you would not get any advantage of a optimal insert that Vertica's JDBC driver would give you. You might need to modify here to allow it to use that dialect for jdbc:vertica. If for some reason that doesn't work you'd need to add in a new dialect.
Personally I think the first option is simpler.
When the Vertica table exists with the same column names as the dataFrame (and the corresponding types, VARCHAR) the following has worked for me (while keeping vertica's jdbc):
myDataFrame.write().mode(SaveMode.Append).jdbc(url, "MY_VERTICA_TABLE", new Properties());

Reading Avro into spark using spark-avro

I'm not being able to read spark files using the spark-avro library. Here are the steps I took:
Got the jar from: http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10/0.1
Invoked spark-shell using spark-shell --jars avro/spark-avro_2.10-0.1.jar
Executed commands as given in the git readme:
import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
The action sqlContext.avroFile("episodes.avro") fails with the following error:
scala> val episodes = sqlContext.avroFile("episodes.avro")
java.lang.IncompatibleClassChangeError: class com.databricks.spark.avro.AvroRelation has interface org.apache.spark.sql.sources.TableScan as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
My bad. The readme clearly says:
Versions
Spark changed how it reads / writes data in 1.4, so please use the correct version of this dedicated for your spark version
1.3 -> 1.0.0
1.4+ -> 1.1.0-SNAPSHOT
I used spark:1.3.1 and spark-avro: 1.1.0. When I used spark-avro: 1.0.0, it worked.
Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter.
To load/save data in Avro format, you need to specify the data source option format as avro.
Example:
val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName(appName).master(master).getOrCreate()
val sqlContext = spark.sqlContext
val episodes = sqlContext.read.format("com.databricks.spark.avro")
.option("header","true")
.option("inferSchema","true")
.load("episodes.avro")
episodes.show(10)

Resources