com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Connection reset - apache-spark

I'm trying to read a GBQ table using the below approach,
from pyspark.sql.functions import col, current_timestamp
def analyze(spark, config, query = None):
df = spark.read \
.format("bigquery") \
.load("projectName.dataset.tablename")
resultDf = df \
.filter(col('colName')=='123') \
.withColumn('processedTS',current_timestamp()) resultDf.write.mode('overwrite').option("header",True).csv("output/claims")
My spark-submit command is as follows,
spark-submit --name "ApplicationName" --master "local[2]" --deploy-mode "client" --jars jars/spark-bigquery-with-dependencies_2.12-0.25.0.jar,jars/netty-tcnative-2.0.52.Final.jar --verbose main.py --jobName jobnameToRun
Environment variable is set to point to the location where the key Json is located,
GOOGLE_APPLICATION_CREDENTIALS=/PATH_TO_KEY/file.json
When i submit my application I get the below error,
py4j.protocol.Py4JJavaError: An error occurred while calling o26.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Connection reset
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:299)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$18.call(BigQueryImpl.java:778)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$18.call(BigQueryImpl.java:775)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:774)
at com.google.cloud.bigquery.connector.common.BigQueryClient.getTable(BigQueryClient.java:119)
at com.google.cloud.bigquery.connector.common.BigQueryClient.getReadTable(BigQueryClient.java:240)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:76)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:151)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:297)
My system is uses Apple M1 Pro chip.
How can I resolve this issue?

I think the Issue is with --jars you are submitting Please submit jar with full path like "--jars /fullpath/first.jar,/fullpath/second.jar"

Related

Invalid configuration value detected for fs.azure.account.key with com.crealytics:spark-excel

I have setup my Databricks notebook to use Service Principal to access ADLS using below confguration.
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I am able to read csv file from ADLS however getting Invalid configuration value detected for fs.azure.account.key with excel file. Below is the code to read excel file.
#libaray used com.crealytics:spark-excel_2.12:3.2.2_0.18.0
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("dataAddress", "'Sheet1'!A1:BA100000")\
.option("delimiter", ",") \
.option("inferSchema", "true") \
.option("multiline", "true") \
.load(file_path_full)
Stack Trace
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:577)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1832)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:224)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:142)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:60)
at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$4(WorkbookReader.scala:79)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$3(WorkbookReader.scala:102)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:102)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:33)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:32)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:87)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:48)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:48)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:121)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:120)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:189)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:188)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:52)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:52)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:24)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: Invalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.ConfigurationBasicValidator.validate(ConfigurationBasicValidator.java:49)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.Base64StringConfigurationBasicValidator.validate(Base64StringConfigurationBasicValidator.java:40)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.validateStorageAccountKey(SimpleKeyProvider.java:70)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:49)
... 42 more
ok, found the solution.
Need to add following configuration as well.
spark._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

py4j.protocol.Py4JJavaError: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found

I am trying to read the csv file from pyspark, while reading it is throwing
py4j.protocol.Py4JJavaError: An error
occurred while calling o30.csv. : java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) at
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
... 25 more
Process finished with exit code 1
Error, So could anyone please suggest me where am i doing wrong in below code.
from pyspark.sql import SparkSession
SECRET_ACCESS_KEY = "XXXXXXXXXXX"
STORAGE_NAME = "azuresvkstorageaccount11123"
CONTAINER = "inputstorage1"
FILE_NAME = "movies.csv"
spark = SparkSession.builder.appName("Azure_PySpark_Connectivity")\
.master("local[*]")\
.getOrCreate()
fs_acc_key = "fs.azure.account.key." + STORAGE_NAME + ".blob.core.windows.net"
spark.conf.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(fs_acc_key, SECRET_ACCESS_KEY)
file_path = "wasb://inputstorage1#azuresvkstorageaccount.blob.core.windows.net/movies.csv"
print(file_path)
Df = spark.read.csv(path=file_path,header=True,inferSchema=True) #Error Coming from this line it is unable to read the csv file
#Df.show(20,True)
I have figured out the problem, it is coming from maven jars, the solution is
Download hadoop-azure and azure-storage jars from the maven portal manually and copy these jars to spark/jars/. folder.
Do the same for the jetty-utils jar, add this jar to spark/jars/. folder
Then refresh and run the script again, it works perfectly.

DB2 Connection with Pyspark local

I am trying to connect to db2 via pyspark, below is my connection string.
from pyspark import SparkConf, SparkContext, SQLContext
conf = SparkConf().setAppName("test").setMaster("local").set("spark.jars","\IBM\IBM_DATA_SERVER_DRIVER\java\db2jcc4.jar")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = (sqlContext.read.format('jdbc')\
.option('url', 'jdbc:db2://********.COM:*****/*****')\
.option('driver', 'com.ibm.db2.jcc.DB2Driver')\
.option('dbtable', "(SELECT * FROM table.table limit 100) as t")\
.option('user', 'user')\
.option('password', 'password')).load()
However, I am getting an error as below
Py4JJavaError: An error occurred while calling o161.load.
: java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$5.apply(JDBCOptions.scala:99)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:99)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I have downloaded latest driver and have it specified.
Could you help me resolve this issue to connect to db2 via pyspark.
Try not specifying the driver. Putting the jar file for db2 in $SPARK_HOME/jars should be enough.
Also use SparkSession to read input files. SQLContext is deprecated.

Access Vertica with kerberos in PySpark

I am running PySpark 2 & trying to access data from Vertica authenticated by kerberos.
I am using following mechanism with JDBC driver:
# PySpark python 3.5
krb_url = "jdbc:vertica://vertica.example.net:5433/DB;" + \
"integratedSecurity=true;authenticationSchema=JavaKeberos;principal=userx;keytab=/home/userx/userx.keytab;"
query = " (SELECT id FROM schema.table limit 10) as t1"
kdf = spark.read.format("jdbc") \
.option("driver", "com.vertica.jdbc.Driver") \
.option("url", krb_url) \
.option("dbtable", query) \
.load()
However, I'm getting following error:
Py4JJavaError: An error occurred while calling o68.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

query apache drill form within apache spark

I am querying Apache drill from within apache spark. My question is, how to send sql commands other than select * from from spark to drill. By default, spark is sending the queries inside select * from. Also, when I am querying schema other than dfs, I am getting NullPointerException. Please help!
My spark version is 2.2.0
Here are my codes:
1. schema = dfs:
dataframe_mysql = spark.read.format("jdbc").option("url", "jdbc:drill:zk=%s;schema=%s;" % (foreman,schema)).option("driver","org.apache.drill.jdbc.Driver").option("dbtable","\"/user/titanic_data/test.csv\"").load()
Schema = MySQL
dataframe_mysql = spark.read.format("jdbc").option("url", "jdbc:drill:zk=%s;schema=MySQL;" % (foreman)).option("driver","org.apache.drill.jdbc.Driver").option("dbtable","MySQL.\"spark3\"").load()
This is the complete error:
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Py4JJavaError: An error occurred while calling o40.load.
: java.sql.SQLException: Failed to create prepared statement: SYSTEM ERROR: NullPointerException
[Error Id: d1e4b310-f4df-4e7c-90ae-983cc5c89f94 on inpunpclx1825e.kih.kmart.com:31010]
at org.apache.drill.jdbc.impl.DrillJdbc41Factory.newServerPreparedStatement(DrillJdbc41Factory.java:147)
at org.apache.drill.jdbc.impl.DrillJdbc41Factory.newPreparedStatement(DrillJdbc41Factory.java:108)
at org.apache.drill.jdbc.impl.DrillJdbc41Factory.newPreparedStatement(DrillJdbc41Factory.java:50)
at oadd.org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:278)
at org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareStatement(DrillConnectionImpl.java:389)
at oadd.org.apache.calcite.avatica.AvaticaConnection.prepareStatement(AvaticaConnection.java:119)
at org.apache.drill.jdbc.impl.DrillConnectionImpl.prepareStatement(DrillConnectionImpl.java:422)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:113)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error occurred while calling o40.load.\n', JavaObject id=o41), <traceback object at 0x7f00106d6488>)
StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:163)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:163)
scala.Option.foreach(Option.scala:257)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:162)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:280)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
I have changed the default drill quote from `` to "" so that there won't be any quoting identifier issue between spark and drill.

Resources