Spark BigQuery Connector: BaseEncoding$DecodingException: Unrecognized character: 0xa - apache-spark

I am trying to save a spark dataframe to a BigQuery table from an AWS EMR cluster. I am using the spark-bigquery-connector to do this. I have encoded my gcloud credentials service json file to Base64 from the command-line and am simply pasting the string for the credentials options. This does not work however, and causes the encoding error below. I know my json file is correct because I use it while running my script locally. What is causing this issue?
GCLOUD Service Credentials JSON File Structure
{
"type": "service_account",
"project_id": "<MY_PROJECT_NAME>",
"private_key_id": "<PRIVATE_KEY_ID>",
"private_key": "-----BEGIN PRIVATE KEY-----<LONG LIST OF CHARS>-----END PRIVATE KEY-----\n",
"client_email": "service#project.iam.gserviceaccount.com",
"client_id": "<CLIENT_ID>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/<service>%40<project>.iam.gserviceaccount.com"
}
Spark Code
df \
.drop(*cols_to_drop) \
.write \
.format("bigquery") \
.option("temporaryGcsBucket", "emr_spark") \
.option("credentials", "<LONG_BASE64_STRING>") \
.option("project", "<MY_PROJECT_NAME>") \
.option("parentProject", "<MY_PROJECT_NAME>") \
.option("table", "<MY_PROJECT_NAME>:dataset.table") \
.mode("overwrite") \
.save()
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o137.save.
: java.lang.IllegalArgumentException: com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: 0xa
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decode(BaseEncoding.java:219)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.util.Base64.decodeBase64(Base64.java:104)
at com.google.cloud.spark.bigquery.SparkBigQueryOptions.createCredentials(SparkBigQueryOptions.scala:47)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.createBigQuery(BigQueryRelationProvider.scala:125)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at scala.Option.getOrElse(Option.scala:121)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.getOrCreateBigQuery(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:79)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: 0xa
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$Alphabet.decode(BaseEncoding.java:490)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$Base64Encoding.decodeTo(BaseEncoding.java:974)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decodeChecked(BaseEncoding.java:233)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decode(BaseEncoding.java:217)
... 39 more

I believe the issue arises because the connector expects string input, but the Base64 encoding generates a bytes object. So simply utf-8 decoding the Base64 encoded bytes solved my DecodingException.
import json
import base64
creds = JSON credentials
# Dump json to str. Encode to utf-8 to get bytes. Encode bytes to Base64 bytes.
# Decode bytes to get a string
creds64 = base64.b64encode(json.dumps(creds).encode('utf-8')).decode('utf-8')

The issue was how the string was represented in my python script.
Representing the string in multiple lines caused the encoding error:
credentials="""
ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiPE1ZX1BST0pFQ1RfTkFNRT4iL
AogICJwcml2YXRlX2tleV9pZCI6ICI8UFJJVkFURV9LRVlfSUQ+IiwKICAicHJpdmF0ZV9rZXkiOiAiLS0tLS1CRU
dJTiBQUklWQVRFIEtFWS0tLS0tPExPTkcgTElTVCBPRiBDSEFSUz4tLS0tLUVORCBQUklWQVRFIEtFWS0tLS0tXG4
iLAogICJjbGllbnRfZW1haWwiOiAic2VydmljZUBwcm9qZWN0LmlhbS5nc2VydmljZWFjY291bnQuY29tIiwKICAi
Y2xpZW50X2lkIjogIjxDTElFTlRfSUQ+IiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY
29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG
9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20
vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBp
cy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS88c2VydmljZT4lNDA8cHJvamVjdD4uaWFtLmdzZXJ2aWNlYWNjb
3VudC5jb20iCn0=
"""
Representing the string in one line fixed my issue:
credentials="ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiPE1ZX1BST0pFQ1RfTkFNRT4iLAogICJwcml2YXRlX2tleV9pZCI6ICI8UFJJVkFURV9LRVlfSUQ+IiwKICAicHJpdmF0ZV9rZXkiOiAiLS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tPExPTkcgTElTVCBPRiBDSEFSUz4tLS0tLUVORCBQUklWQVRFIEtFWS0tLS0tXG4iLAogICJjbGllbnRfZW1haWwiOiAic2VydmljZUBwcm9qZWN0LmlhbS5nc2VydmljZWFjY291bnQuY29tIiwKICAiY2xpZW50X2lkIjogIjxDTElFTlRfSUQ+IiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS88c2VydmljZT4lNDA8cHJvamVjdD4uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iCn0="
As mentioned in comments, you can also use the backslash at the end of each line as well

Related

pyspark configuration for connecting google cloud platform

i can't connect to my google cloud platform via pyspark, can anyone help?
i am not using dataproc, just a local spark instance
background:
i have downloaded all the jars file into $SPARK_HOME/jars, including
google-api-client-2.0.0.jar
google-auth-library-credentials-1.12.1.jar
google-auth-library-oauth2-http-1.12.1.jar
google-http-client-1.42.2.jar
gcs-connector-hadoop3-2.2.8.jar
guava-14.0.1.jar
guava-31.1-jre.jar
i am using the docker image: jupyter-notebook
code:
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName('GCSFilesRead').config("google.cloud.auth.service.account.enable", "true")\
.config("google.cloud.auth.service.account.json.keyfile","/home/jovyan/work/gcs_admin.json")\
.config('fs.gs.auth.type','SERVICE_ACCOUNT_JSON_KEYFILE')
spark = builder.getOrCreate()
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/jovyan/work/gcs_admin.json'
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
bucket_name="mybucket"
path=f"gs://{bucket_name}/my_file.csv"
df=spark.read.option("header",True).csv(path, header=True)
df.show()
Py4JJavaError: An error occurred while calling o59.csv.
: java.lang.NoClassDefFoundError: com/google/api/client/auth/oauth2/Credential
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:467)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2590)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: com.google.api.client.auth.oauth2.Credential
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
... 40 more

Py4JJavaError An error occurred while calling o37.load. : net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error

I am trying to connect (PySpark + Snowflake) been continuously getting the error.
I am using PySpark 3.1, JDBC and Spark_Snowflake jar files also placed in Classpath. Not sure why I am getting the following error.
Code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "Test App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('Testing Spark SF')
sfOptions = {
"sfURL" : "<account_identifier>.snowflakecomputing.com",
"sfUser" : "<user_name>",
"sfPassword" : "<password>",
"sfDatabase" : "<database>",
"sfSchema" : "<schema>",
"sfWarehouse" : "<warehouse>"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
df.show()
Error:
Py4JJavaError: An error occurred while calling o37.load. :
net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver
encountered communication error. Message: Exception encountered for
HTTP request: sun.security.validator.ValidatorException: No trusted
certificate found. at
net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:284)
at
net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:639)
at
net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:584)
at
net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:551)
at
net.snowflake.client.core.SessionUtil.newSession(SessionUtil.java:587)
at
net.snowflake.client.core.SessionUtil.openSession(SessionUtil.java:285)
at net.snowflake.client.core.SFSession.open(SFSession.java:446) at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initialize(DefaultSFConnectionHandler.java:104)
at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initializeConnection(DefaultSFConnectionHandler.java:79)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.initConnectionWithImpl(SnowflakeConnectionV1.java:116)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.(SnowflakeConnectionV1.java:96)
at
net.snowflake.client.jdbc.SnowflakeDriver.connect(SnowflakeDriver.java:172)
at java.sql.DriverManager.getConnection(DriverManager.java:664) at
java.sql.DriverManager.getConnection(DriverManager.java:208) at
net.snowflake.spark.snowflake.JDBCWrapper.getConnector(SnowflakeJDBCWrapper.scala:209)
at
net.snowflake.spark.snowflake.SnowflakeRelation.$anonfun$schema$1(SnowflakeRelation.scala:60)
at
net.snowflake.spark.snowflake.SnowflakeRelation$$Lambda$866/22415031.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
net.snowflake.spark.snowflake.SnowflakeRelation.schema$lzycompute(SnowflakeRelation.scala:57)
at
net.snowflake.spark.snowflake.SnowflakeRelation.schema(SnowflakeRelation.scala:56)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:449)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader$$Lambda$858/5135046.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:745) Caused by:
javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: No trusted certificate
found at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1958)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:322) at
sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at
sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1526)
at
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:215)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1024)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:954)
at
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1065)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1384)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1412)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1396)
at
net.snowflake.client.jdbc.internal.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
It's obvious you have issues with the SSL certificate. You can override that temporarily.
sfOptions = {
...
"sfSSL" : "false",
}
However, you can check if you access Snowflake through a proxy.
You will need to import proxy's certificate and include it inside your cacerts.
the default location of cacerts of your running java version. you can locate it inside the java home directory under lib/security.
keytool -import -trustcacerts -alias cert_ssl -file proxy.cer -noprompt -storepass changeit -keystore cacerts

Databricks checkpoint java.io.FileNotFoundException: No such file or directory:

I try to execute this writeStream
def _write_stream(data_frame, checkpoint_path, write_stream_path):
data_frame.writeStream.format("delta") \
.option("checkpointLocation", checkpoint_path) \
.trigger(processingTime="1 second") \
.option("mergeSchema", "true") \
.outputMode("append") \
.table(write_stream_path)
but I get this error
at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:428)
at
org.apache.spark.util.ThreadUtils$.parallelMap(ThreadUtils.scala:399)
at
com.databricks.sql.streaming.state.RocksDBFileManager.loadImmutableFilesFromDbfs(RocksDBFileManager.scala:433)
at
com.databricks.sql.streaming.state.RocksDBFileManager.loadCheckpointFromDbfs(RocksDBFileManager.scala:202)
at
com.databricks.sql.rocksdb.CloudRocksDB.$anonfun$open$5(CloudRocksDB.scala:437)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:627) at
com.databricks.sql.rocksdb.CloudRocksDB.timeTakenMs(CloudRocksDB.scala:523)
at
com.databricks.sql.rocksdb.CloudRocksDB.$anonfun$open$2(CloudRocksDB.scala:435)
at
com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:369)
at
com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:457)
at
com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:477)
at
com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at
com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at
com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at
com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:20)
at
com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at
com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at
com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:20)
at
com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:452)
at
com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:378)
at
com.databricks.spark.util.PublicDBLogging.recordOperationWithResultTags(DatabricksSparkUsageLogger.scala:20)
at
com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:369)
at
com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:341)
at
com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:20)
at
com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:57)
at
com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:125)
at
com.databricks.spark.util.UsageLogger.recordOperation(UsageLogger.scala:70)
at
com.databricks.spark.util.UsageLogger.recordOperation$(UsageLogger.scala:57)
at
com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:86)
at
com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:402)
at
com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:381)
at
com.databricks.sql.rocksdb.CloudRocksDB.recordOperation(CloudRocksDB.scala:52)
at
com.databricks.sql.rocksdb.CloudRocksDB.recordRocksDBOperation(CloudRocksDB.scala:542)
at
com.databricks.sql.rocksdb.CloudRocksDB.$anonfun$open$1(CloudRocksDB.scala:427)
at
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:377)
at
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:363)
at
com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
at
com.databricks.sql.rocksdb.CloudRocksDB.open(CloudRocksDB.scala:427)
at
com.databricks.sql.rocksdb.CloudRocksDB.(CloudRocksDB.scala:80)
at
com.databricks.sql.rocksdb.CloudRocksDB$.open(CloudRocksDB.scala:595)
at
com.databricks.sql.fileNotification.autoIngest.CloudFilesSource.(CloudFilesSource.scala:82)
at
com.databricks.sql.fileNotification.autoIngest.CloudFilesNotificationSource.(CloudFilesNotificationSource.scala:44)
at
com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceProvider.createSource(CloudFilesSourceProvider.scala:172)
at
org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:326)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.$anonfun$applyOrElse$1(MicroBatchExecution.scala:100)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:97)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$1.applyOrElse(MicroBatchExecution.scala:95)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:484)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:86)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:484)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:262)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:258)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:460)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:428)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.planQuery(MicroBatchExecution.scala:95)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.logicalPlan$lzycompute(MicroBatchExecution.scala:165)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.logicalPlan(MicroBatchExecution.scala:165)
at
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:349)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:341)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:268)
Caused by: java.io.FileNotFoundException: No such file or directory:
s3:///**/*/checkpoint/sources/0/rocksdb/SSTs/.sst
at
shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3254)
at
shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3137)
at
shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3076)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2034)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2003)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1979)
at
com.databricks.sql.streaming.state.RocksDBFileManager.$anonfun$loadImmutableFilesFromDbfs$6(RocksDBFileManager.scala:442)
at
com.databricks.sql.streaming.state.RocksDBFileManager.$anonfun$loadImmutableFilesFromDbfs$6$adapted(RocksDBFileManager.scala:433)
at
org.apache.spark.util.ThreadUtils$.$anonfun$parallelMap$2(ThreadUtils.scala:397)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at
scala.util.Success.$anonfun$map$1(Try.scala:255) at
scala.util.Success.map(Try.scala:213) at
scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at
scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at
scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at
org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:104)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:68)
at
org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:54)
at
org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:101)
at
org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:104)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Please check on checkpoint_path location is present or not . Error log clearly tells , path is not exists.
Caused by: java.io.FileNotFoundException: No such file or directory: s3:///**/*/checkpoint/sources/0/rocksdb/SSTs/.sst

Issue while reading delta file placed under wasb storage from mapr cluster

I am trying to read delta format file from azure storage using code below in jupyter notebook which is running in mapr cluster.
when i am running this code it is throwing issue that java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.(java.net.URI, org.apache.hadoop.conf.Configuration)
%%configure -f
{ "conf": {"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0,org.apache.hadoop:hadoop-azure:3.3.1"}, "kind": "spark",
"driverMemory" : "4g", "executorMemory": "2g", "executorCores": 3, "numExecutors":4}
val storage_account_name_dim = "storageAcc"
val sas_access_key_dim = "saskey"
spark.conf.set("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.AzureLogStore")
spark.conf.set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"fs.azure.sas.dsr.$storage_account_name_dim.blob.core.windows.net", sas_access_key_dim)
val refdf = spark.read.format("delta").load("wasbs://dsr#storageAcc.blob.core.windows.net/dim/ds/DS_CUSTOM_WAG_VENDOR_UPC")
refdf.show(1)
it is giving me below issue
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:135)
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:164)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:249)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:331)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:331)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:448)
at org.apache.spark.sql.delta.storage.HDFSLogStore.getFileContext(HDFSLogStore.scala:53)
at org.apache.spark.sql.delta.storage.HDFSLogStore.listFrom(HDFSLogStore.scala:137)
at org.apache.spark.sql.delta.Checkpoints$class.findLastCompleteCheckpoint(Checkpoints.scala:174)
at org.apache.spark.sql.delta.DeltaLog.findLastCompleteCheckpoint(DeltaLog.scala:58)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:156)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.lastCheckpoint(Checkpoints.scala:133)
at org.apache.spark.sql.delta.DeltaLog.lastCheckpoint(DeltaLog.scala:58)
at org.apache.spark.sql.delta.DeltaLog.<init>(DeltaLog.scala:139)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1$$anonfun$apply$10.apply(DeltaLog.scala:744)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1$$anonfun$apply$10.apply(DeltaLog.scala:744)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1.apply(DeltaLog.scala:743)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1.apply(DeltaLog.scala:743)
at com.databricks.spark.util.DatabricksLogging$class.recordOperation(DatabricksLogging.scala:77)
at org.apache.spark.sql.delta.DeltaLog$.recordOperation(DeltaLog.scala:671)
at org.apache.spark.sql.delta.metering.DeltaLogging$class.recordDeltaOperation(DeltaLogging.scala:103)
at org.apache.spark.sql.delta.DeltaLog$.recordDeltaOperation(DeltaLog.scala:671)
at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:742)
at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:740)
at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:740)
at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:712)
at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:169)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 50 elided
Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getDeclaredConstructor(Class.java:2178)
at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:129)
... 95 more
same code is working for parquet file , any help will be much appreciated
I Resolved it by adding spark.delta.logStore.class in conf parameter
{ "conf": {"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0,org.apache.hadoop:hadoop-azure:3.3.1","spark.delta.logStore.class":"org.apache.spark.sql.delta.storage.AzureLogStore"}, "kind": "spark","driverMemory" : "4g", "executorMemory": "2g", "executorCores": 3, "numExecutors":4}

How to authenticate with BigQuery from Apache Spark (pyspark)?

I have create a client id and client secret for my bigquery project, but I don't know how to use those to successfully save a dataframe from a pyspark script to my bigquery table. My python code below results in the following error. Is there a way I can connect to BigQuery using the save options on the pyspark dataframe?
Code
df.write \
.format("bigquery") \
.option("client_id", "<MY_CLIENT_ID>") \
.option("client_secret", "<MY_CLIENT_SECRET>") \
.option("project", "bigquery-project-id") \
.option("table", "dataset.table") \
.save()
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
:
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" } at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:268)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:664)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:661)
at
com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:660)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.getTable(BigQueryInsertableRelation.scala:68)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.exists(BigQueryInsertableRelation.scala:54)
at
com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:86)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpResponseException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" }
From spark-bigquery-connector :
How do I authenticate outside GCE / Dataproc?
Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as
described here.
Credentials can also be provided explicitly either as a parameter or
from Spark runtime configuration. It can be passed in as a
base64-encoded string directly, or a file path that contains the
credentials (but not both).
So you should be using this:
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>")

Resources