Py4JJavaError An error occurred while calling o37.load. : net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error - apache-spark

I am trying to connect (PySpark + Snowflake) been continuously getting the error.
I am using PySpark 3.1, JDBC and Spark_Snowflake jar files also placed in Classpath. Not sure why I am getting the following error.
Code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "Test App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('Testing Spark SF')
sfOptions = {
"sfURL" : "<account_identifier>.snowflakecomputing.com",
"sfUser" : "<user_name>",
"sfPassword" : "<password>",
"sfDatabase" : "<database>",
"sfSchema" : "<schema>",
"sfWarehouse" : "<warehouse>"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
df.show()
Error:
Py4JJavaError: An error occurred while calling o37.load. :
net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver
encountered communication error. Message: Exception encountered for
HTTP request: sun.security.validator.ValidatorException: No trusted
certificate found. at
net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:284)
at
net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:639)
at
net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:584)
at
net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:551)
at
net.snowflake.client.core.SessionUtil.newSession(SessionUtil.java:587)
at
net.snowflake.client.core.SessionUtil.openSession(SessionUtil.java:285)
at net.snowflake.client.core.SFSession.open(SFSession.java:446) at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initialize(DefaultSFConnectionHandler.java:104)
at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initializeConnection(DefaultSFConnectionHandler.java:79)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.initConnectionWithImpl(SnowflakeConnectionV1.java:116)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.(SnowflakeConnectionV1.java:96)
at
net.snowflake.client.jdbc.SnowflakeDriver.connect(SnowflakeDriver.java:172)
at java.sql.DriverManager.getConnection(DriverManager.java:664) at
java.sql.DriverManager.getConnection(DriverManager.java:208) at
net.snowflake.spark.snowflake.JDBCWrapper.getConnector(SnowflakeJDBCWrapper.scala:209)
at
net.snowflake.spark.snowflake.SnowflakeRelation.$anonfun$schema$1(SnowflakeRelation.scala:60)
at
net.snowflake.spark.snowflake.SnowflakeRelation$$Lambda$866/22415031.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
net.snowflake.spark.snowflake.SnowflakeRelation.schema$lzycompute(SnowflakeRelation.scala:57)
at
net.snowflake.spark.snowflake.SnowflakeRelation.schema(SnowflakeRelation.scala:56)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:449)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader$$Lambda$858/5135046.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:745) Caused by:
javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: No trusted certificate
found at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1958)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:322) at
sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at
sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1526)
at
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:215)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1024)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:954)
at
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1065)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1384)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1412)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1396)
at
net.snowflake.client.jdbc.internal.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)

It's obvious you have issues with the SSL certificate. You can override that temporarily.
sfOptions = {
...
"sfSSL" : "false",
}
However, you can check if you access Snowflake through a proxy.
You will need to import proxy's certificate and include it inside your cacerts.
the default location of cacerts of your running java version. you can locate it inside the java home directory under lib/security.
keytool -import -trustcacerts -alias cert_ssl -file proxy.cer -noprompt -storepass changeit -keystore cacerts

Related

pyspark configuration for connecting google cloud platform

i can't connect to my google cloud platform via pyspark, can anyone help?
i am not using dataproc, just a local spark instance
background:
i have downloaded all the jars file into $SPARK_HOME/jars, including
google-api-client-2.0.0.jar
google-auth-library-credentials-1.12.1.jar
google-auth-library-oauth2-http-1.12.1.jar
google-http-client-1.42.2.jar
gcs-connector-hadoop3-2.2.8.jar
guava-14.0.1.jar
guava-31.1-jre.jar
i am using the docker image: jupyter-notebook
code:
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName('GCSFilesRead').config("google.cloud.auth.service.account.enable", "true")\
.config("google.cloud.auth.service.account.json.keyfile","/home/jovyan/work/gcs_admin.json")\
.config('fs.gs.auth.type','SERVICE_ACCOUNT_JSON_KEYFILE')
spark = builder.getOrCreate()
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/jovyan/work/gcs_admin.json'
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
bucket_name="mybucket"
path=f"gs://{bucket_name}/my_file.csv"
df=spark.read.option("header",True).csv(path, header=True)
df.show()
Py4JJavaError: An error occurred while calling o59.csv.
: java.lang.NoClassDefFoundError: com/google/api/client/auth/oauth2/Credential
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:467)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2590)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: com.google.api.client.auth.oauth2.Credential
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
... 40 more

How to authenticate with BigQuery from Apache Spark (pyspark)?

I have create a client id and client secret for my bigquery project, but I don't know how to use those to successfully save a dataframe from a pyspark script to my bigquery table. My python code below results in the following error. Is there a way I can connect to BigQuery using the save options on the pyspark dataframe?
Code
df.write \
.format("bigquery") \
.option("client_id", "<MY_CLIENT_ID>") \
.option("client_secret", "<MY_CLIENT_SECRET>") \
.option("project", "bigquery-project-id") \
.option("table", "dataset.table") \
.save()
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
:
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" } at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:268)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:664)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:661)
at
com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:660)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.getTable(BigQueryInsertableRelation.scala:68)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.exists(BigQueryInsertableRelation.scala:54)
at
com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:86)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpResponseException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" }
From spark-bigquery-connector :
How do I authenticate outside GCE / Dataproc?
Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as
described here.
Credentials can also be provided explicitly either as a parameter or
from Spark runtime configuration. It can be passed in as a
base64-encoded string directly, or a file path that contains the
credentials (but not both).
So you should be using this:
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>")

Get a java.lang.LinkageError: ClassCastException when use spark sql hivesql on yarn

This is the driver i upload to yarn-cluster:
package com.baidu.spark.forhivetest
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.hive._
import org.apache.spark.SparkContext
object ForTest {
def main(args : Array[String]){
val sc = new SparkContext()
val sqlc = new SQLContext(sc)
val hivec = new HiveContext(sc)
hivec.sql("CREATE TABLE IF NOT EXISTS newtest (time TIMESTAMP,word STRING,current_city_name STRING,content_src_name STRING,content_name STRING)")
val schema = hivec.table("newtest").schema
println(schema)
}
In hive config file: i set the hive.metastore.uris and hive.metastore.warehouse.dir
On spark-sumbit I do added jars
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
Even if I added the mysql-connector-java-5.1.38-bin.jar and spark-1.6.0-bin-hadoop2.6/lib/guava-14.0.1.jar , I still get this error!
But when i run this spark on ide it works successly!
Hope someone can help me ! thx a lot!
This is the error information:
java.lang.LinkageError: ClassCastException: attempting to castjar:file:/mnt/hadoop/yarn/local/filecache/18/spark-assembly-1.6.0-hadoop2.6.0.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/mnt/hadoop/yarn/local/filecache/18/spark-assembly-1.6.0-hadoop2.6.0.jar!/javax/ws/rs/ext/RuntimeDelegate.class
at javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:116)
at javax.ws.rs.ext.RuntimeDelegate.getInstance(RuntimeDelegate.java:91)
at javax.ws.rs.core.MediaType.<clinit>(MediaType.java:44)
at com.sun.jersey.core.header.MediaTypes.<clinit>(MediaTypes.java:64)
at com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
at com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
at com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
at com.sun.jersey.api.client.Client.init(Client.java:342)
at com.sun.jersey.api.client.Client.access$000(Client.java:118)
at com.sun.jersey.api.client.Client$1.f(Client.java:191)
at com.sun.jersey.api.client.Client$1.f(Client.java:187)
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)
at com.sun.jersey.api.client.Client.<init>(Client.java:187)
at com.sun.jersey.api.client.Client.<init>(Client.java:170)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:268)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.hive.ql.hooks.ATSHook.<init>(ATSHook.java:67)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at org.apache.hadoop.hive.ql.hooks.HookUtils.getHooks(HookUtils.java:60)
at org.apache.hadoop.hive.ql.Driver.getHooks(Driver.java:1309)
at org.apache.hadoop.hive.ql.Driver.getHooks(Driver.java:1293)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1347)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:473)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:473)
at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:463)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:605)
at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at com.baidu.spark.forhivetest.ForTest$.main(ForTest.scala:12)
at com.baidu.spark.forhivetest.ForTest.main(ForTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
16/03/22 17:04:32 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.LinkageError: ClassCastException: attempting to castjar:file:/mnt/hadoop/yarn/local/filecache/18/spark-assembly-1.6.0-hadoop2.6.0.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/mnt/hadoop/yarn/local/filecache/18/spark-assembly-1.6.0-hadoop2.6.0.jar!/javax/ws/rs/ext/RuntimeDelegate.class)
This has to do with your class path. Try not to build a fat jar.

Unable to find CassandraSQLContext

I am using spark 1.6 prebuild with hadoop 2.6
spark cassandra connector 1.6
cassandra 2.1.12
I wrote a simple scala program to run simple select count(*) query on cassandra here is my code
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql._
object Hi {
def main(args: Array[String])
{
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "172.16.4.196")
val sc = new SparkContext("spark://naresh-pc:7077", "test", conf)
val csc = new CassandraSQLContext(sc)
val rdd1 = csc.sql("SELECT count(*) from cw.testdata")
println(rdd1.count)
println(rdd1.first)
}
}
it is successfully building with sbt assembly and creating jar
when i submit using spark submit
it gives the following error
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/cassandra/CassandraSQLContext
at Hi$.main(trySpark.scala:15)
at Hi.main(trySpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.cassandra.CassandraSQLContext
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
Any help on this ?
Moreover when i run with spark-shell it works fine

Spark Tutorial for Avro

I've started with Spark and my use case is to read Avro file (data source) and perform ETL based on rules. As a start I just wanted to try reading the AVRO and create an RDD. Based on a recommendation in one of the stackoverflow sites I `
object abc {
def main(args: Array[String]): Unit =
{
//val master = Properties.envOrElse("MASTER",args(0))
val path = args(0)
val sparkContext = new SparkContext(new SparkConf().setAppName("My-spark-app"))
val jobConf = new JobConf(sparkContext.hadoopConfiguration)
val rdd = sparkContext.hadoopFile (
path,
classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
classOf[org.apache.hadoop.io.NullWritable],
10)
println(rdd.first)
}
}`
My environment is CDH 5.1.3. I'm getting the following error.
15/03/17 08:53:58 INFO YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroInputFormat
at com.scif.afw.abc$.main(abc.scala:30)
at com.scif.afw.abc.main(abc.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.mapred.AvroInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
I've built the project using Maven and my POM has the Avro jar and I can see the class in the jar.
Appreciate any help
If you are using yarn cluster, there could be avro jar present from yarn.application.classpath. NoClassDefFound could be caused by multiple instances of the same class in classpath(1 from your jar and 2nd from default yarn app classpath)

Resources