How to authenticate with BigQuery from Apache Spark (pyspark)? - apache-spark

I have create a client id and client secret for my bigquery project, but I don't know how to use those to successfully save a dataframe from a pyspark script to my bigquery table. My python code below results in the following error. Is there a way I can connect to BigQuery using the save options on the pyspark dataframe?
Code
df.write \
.format("bigquery") \
.option("client_id", "<MY_CLIENT_ID>") \
.option("client_secret", "<MY_CLIENT_SECRET>") \
.option("project", "bigquery-project-id") \
.option("table", "dataset.table") \
.save()
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
:
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" } at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:268)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:664)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$17.call(BigQueryImpl.java:661)
at
com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at
com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getTable(BigQueryImpl.java:660)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.getTable(BigQueryInsertableRelation.scala:68)
at
com.google.cloud.spark.bigquery.BigQueryInsertableRelation.exists(BigQueryInsertableRelation.scala:54)
at
com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:86)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpResponseException:
400 Bad Request { "error": "invalid_grant", "error_description":
"Bad Request" }

From spark-bigquery-connector :
How do I authenticate outside GCE / Dataproc?
Use a service account JSON key and GOOGLE_APPLICATION_CREDENTIALS as
described here.
Credentials can also be provided explicitly either as a parameter or
from Spark runtime configuration. It can be passed in as a
base64-encoded string directly, or a file path that contains the
credentials (but not both).
So you should be using this:
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>")

Related

pyspark configuration for connecting google cloud platform

i can't connect to my google cloud platform via pyspark, can anyone help?
i am not using dataproc, just a local spark instance
background:
i have downloaded all the jars file into $SPARK_HOME/jars, including
google-api-client-2.0.0.jar
google-auth-library-credentials-1.12.1.jar
google-auth-library-oauth2-http-1.12.1.jar
google-http-client-1.42.2.jar
gcs-connector-hadoop3-2.2.8.jar
guava-14.0.1.jar
guava-31.1-jre.jar
i am using the docker image: jupyter-notebook
code:
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName('GCSFilesRead').config("google.cloud.auth.service.account.enable", "true")\
.config("google.cloud.auth.service.account.json.keyfile","/home/jovyan/work/gcs_admin.json")\
.config('fs.gs.auth.type','SERVICE_ACCOUNT_JSON_KEYFILE')
spark = builder.getOrCreate()
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/jovyan/work/gcs_admin.json'
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
bucket_name="mybucket"
path=f"gs://{bucket_name}/my_file.csv"
df=spark.read.option("header",True).csv(path, header=True)
df.show()
Py4JJavaError: An error occurred while calling o59.csv.
: java.lang.NoClassDefFoundError: com/google/api/client/auth/oauth2/Credential
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:467)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2625)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2590)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:537)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: com.google.api.client.auth.oauth2.Credential
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
... 40 more

Py4JJavaError An error occurred while calling o37.load. : net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error

I am trying to connect (PySpark + Snowflake) been continuously getting the error.
I am using PySpark 3.1, JDBC and Spark_Snowflake jar files also placed in Classpath. Not sure why I am getting the following error.
Code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "Test App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('Testing Spark SF')
sfOptions = {
"sfURL" : "<account_identifier>.snowflakecomputing.com",
"sfUser" : "<user_name>",
"sfPassword" : "<password>",
"sfDatabase" : "<database>",
"sfSchema" : "<schema>",
"sfWarehouse" : "<warehouse>"
}
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select 1 as my_num union all select 2 as my_num") \
.load()
df.show()
Error:
Py4JJavaError: An error occurred while calling o37.load. :
net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver
encountered communication error. Message: Exception encountered for
HTTP request: sun.security.validator.ValidatorException: No trusted
certificate found. at
net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:284)
at
net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:639)
at
net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:584)
at
net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:551)
at
net.snowflake.client.core.SessionUtil.newSession(SessionUtil.java:587)
at
net.snowflake.client.core.SessionUtil.openSession(SessionUtil.java:285)
at net.snowflake.client.core.SFSession.open(SFSession.java:446) at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initialize(DefaultSFConnectionHandler.java:104)
at
net.snowflake.client.jdbc.DefaultSFConnectionHandler.initializeConnection(DefaultSFConnectionHandler.java:79)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.initConnectionWithImpl(SnowflakeConnectionV1.java:116)
at
net.snowflake.client.jdbc.SnowflakeConnectionV1.(SnowflakeConnectionV1.java:96)
at
net.snowflake.client.jdbc.SnowflakeDriver.connect(SnowflakeDriver.java:172)
at java.sql.DriverManager.getConnection(DriverManager.java:664) at
java.sql.DriverManager.getConnection(DriverManager.java:208) at
net.snowflake.spark.snowflake.JDBCWrapper.getConnector(SnowflakeJDBCWrapper.scala:209)
at
net.snowflake.spark.snowflake.SnowflakeRelation.$anonfun$schema$1(SnowflakeRelation.scala:60)
at
net.snowflake.spark.snowflake.SnowflakeRelation$$Lambda$866/22415031.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
net.snowflake.spark.snowflake.SnowflakeRelation.schema$lzycompute(SnowflakeRelation.scala:57)
at
net.snowflake.spark.snowflake.SnowflakeRelation.schema(SnowflakeRelation.scala:56)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:449)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader$$Lambda$858/5135046.apply(Unknown
Source) at scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:745) Caused by:
javax.net.ssl.SSLHandshakeException:
sun.security.validator.ValidatorException: No trusted certificate
found at sun.security.ssl.Alerts.getSSLException(Alerts.java:198)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1958)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:322) at
sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at
sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1526)
at
sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:215)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1024)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:954)
at
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1065)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1384)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1412)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1396)
at
net.snowflake.client.jdbc.internal.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
It's obvious you have issues with the SSL certificate. You can override that temporarily.
sfOptions = {
...
"sfSSL" : "false",
}
However, you can check if you access Snowflake through a proxy.
You will need to import proxy's certificate and include it inside your cacerts.
the default location of cacerts of your running java version. you can locate it inside the java home directory under lib/security.
keytool -import -trustcacerts -alias cert_ssl -file proxy.cer -noprompt -storepass changeit -keystore cacerts

Local spark connect to to remote hive and hadoop

I have a single local spark node and want to connect to a remote hive, so I coppied hive-site.xml to spark/conf path. These are the main properties of this file:
<property>
<name>hive.metastore.db.type</name>
<value>DERBY</value>
<description>
Expects one of [derby, oracle, mysql, mssql, postgres].
Type of database used by the metastore. Information schema & JDBCStorageHandler
depend on it.
</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://remote_host:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to
remote metastore.</description>
</property>
<property>
<name>spark.sql.warehouse.dir</name>
<value>hdfs://remote_host:9000/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://remote_host:9000/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
Also,environment variables are set in .bashrc file:
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
Now, when running below code, I encounter the error:
sc = SparkSession \
.builder \
.appName(config.spark["name"]) \
.master(config.spark["master"]) \
.config("spark.sql.debug.maxToStringFields", config.spark["maxToStringFields"]) \
.config("spark.hadoop.fs.defaultFS", "hdfs://remote_host:9000")\
.config("spark.sql.warehouse.dir", "hdfs://remote_host:9000/user/hive/warehouse")\
.config("spark.hive.metastore.uris", "thrift://remote_host:9083")\
.enableHiveSupport()\
.getOrCreate()
sc.sql("use database1")
data_df = sc.sql("select * from table1")
data_df.show()
This is the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o51.showString.
: java.net.ConnectException: Call From my-pc/127.0.0.1 to 0.0.0.0:9000 failed on
connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy22.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:776)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy23.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:437)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
I don't understand where 0.0.0.0:9000 comes from.
How can I resolve this issue?
Much appreciated you help
Based on the shown config, that's your hdfs namenode on port 9000, which would be configured in core-site.xml

Spark BigQuery Connector: BaseEncoding$DecodingException: Unrecognized character: 0xa

I am trying to save a spark dataframe to a BigQuery table from an AWS EMR cluster. I am using the spark-bigquery-connector to do this. I have encoded my gcloud credentials service json file to Base64 from the command-line and am simply pasting the string for the credentials options. This does not work however, and causes the encoding error below. I know my json file is correct because I use it while running my script locally. What is causing this issue?
GCLOUD Service Credentials JSON File Structure
{
"type": "service_account",
"project_id": "<MY_PROJECT_NAME>",
"private_key_id": "<PRIVATE_KEY_ID>",
"private_key": "-----BEGIN PRIVATE KEY-----<LONG LIST OF CHARS>-----END PRIVATE KEY-----\n",
"client_email": "service#project.iam.gserviceaccount.com",
"client_id": "<CLIENT_ID>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/<service>%40<project>.iam.gserviceaccount.com"
}
Spark Code
df \
.drop(*cols_to_drop) \
.write \
.format("bigquery") \
.option("temporaryGcsBucket", "emr_spark") \
.option("credentials", "<LONG_BASE64_STRING>") \
.option("project", "<MY_PROJECT_NAME>") \
.option("parentProject", "<MY_PROJECT_NAME>") \
.option("table", "<MY_PROJECT_NAME>:dataset.table") \
.mode("overwrite") \
.save()
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o137.save.
: java.lang.IllegalArgumentException: com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: 0xa
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decode(BaseEncoding.java:219)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.util.Base64.decodeBase64(Base64.java:104)
at com.google.cloud.spark.bigquery.SparkBigQueryOptions.createCredentials(SparkBigQueryOptions.scala:47)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.createBigQuery(BigQueryRelationProvider.scala:125)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider$$anonfun$getOrCreateBigQuery$1.apply(BigQueryRelationProvider.scala:107)
at scala.Option.getOrElse(Option.scala:121)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.getOrCreateBigQuery(BigQueryRelationProvider.scala:107)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:79)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$DecodingException: Unrecognized character: 0xa
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$Alphabet.decode(BaseEncoding.java:490)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding$Base64Encoding.decodeTo(BaseEncoding.java:974)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decodeChecked(BaseEncoding.java:233)
at com.google.cloud.spark.bigquery.repackaged.com.google.common.io.BaseEncoding.decode(BaseEncoding.java:217)
... 39 more
I believe the issue arises because the connector expects string input, but the Base64 encoding generates a bytes object. So simply utf-8 decoding the Base64 encoded bytes solved my DecodingException.
import json
import base64
creds = JSON credentials
# Dump json to str. Encode to utf-8 to get bytes. Encode bytes to Base64 bytes.
# Decode bytes to get a string
creds64 = base64.b64encode(json.dumps(creds).encode('utf-8')).decode('utf-8')
The issue was how the string was represented in my python script.
Representing the string in multiple lines caused the encoding error:
credentials="""
ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiPE1ZX1BST0pFQ1RfTkFNRT4iL
AogICJwcml2YXRlX2tleV9pZCI6ICI8UFJJVkFURV9LRVlfSUQ+IiwKICAicHJpdmF0ZV9rZXkiOiAiLS0tLS1CRU
dJTiBQUklWQVRFIEtFWS0tLS0tPExPTkcgTElTVCBPRiBDSEFSUz4tLS0tLUVORCBQUklWQVRFIEtFWS0tLS0tXG4
iLAogICJjbGllbnRfZW1haWwiOiAic2VydmljZUBwcm9qZWN0LmlhbS5nc2VydmljZWFjY291bnQuY29tIiwKICAi
Y2xpZW50X2lkIjogIjxDTElFTlRfSUQ+IiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY
29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG
9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20
vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBp
cy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS88c2VydmljZT4lNDA8cHJvamVjdD4uaWFtLmdzZXJ2aWNlYWNjb
3VudC5jb20iCn0=
"""
Representing the string in one line fixed my issue:
credentials="ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiPE1ZX1BST0pFQ1RfTkFNRT4iLAogICJwcml2YXRlX2tleV9pZCI6ICI8UFJJVkFURV9LRVlfSUQ+IiwKICAicHJpdmF0ZV9rZXkiOiAiLS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tPExPTkcgTElTVCBPRiBDSEFSUz4tLS0tLUVORCBQUklWQVRFIEtFWS0tLS0tXG4iLAogICJjbGllbnRfZW1haWwiOiAic2VydmljZUBwcm9qZWN0LmlhbS5nc2VydmljZWFjY291bnQuY29tIiwKICAiY2xpZW50X2lkIjogIjxDTElFTlRfSUQ+IiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS88c2VydmljZT4lNDA8cHJvamVjdD4uaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iCn0="
As mentioned in comments, you can also use the backslash at the end of each line as well

Spark 2.3.0 giving error : Provider org.apache.spark.ml.source.libsvm.LibSVMFileFormat not a subtype

I am running Spark 2.3.0, and trying to load dataset from MySql (MariaDB)
using : mysql-connector-java : version 6.0.6
val dataSet:Dataset[Row] = spark
.read
.format("jdbc")
.options(Map("url" -> jdbcUrl
,"user" -> username
,"password" -> password
,"dbtable" -> dataSourceTableName
,"driver" -> driver
))
.load()
When I am running using spark-submit .. .. , I am getting the following error
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.ml.source.libsvm.LibSVMFileFormat not a subtype
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:258)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:270)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:614)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at analytics.utils.QuantumDataSets$.loadDataSet(QuantumDataSets.scala:93)
at analytics.utils.QuantumDataSets$.<init>(QuantumDataSets.scala:50)
at analytics.utils.QuantumDataSets$.<clinit>(QuantumDataSets.scala)
at analytics.utils.QuantumDataSets.main(QuantumDataSets.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Resources