currently using GCP and Dataproc, i´m new to apache spark, pyspark and debian vm. So, i´m trying to replicated inside dataproc cluster (Debian VM) a spark job that i run perfectly in my local machine (W10, VS Code, Spark 3.3.1). Ingestion from SQL Server to Spark dataframe, via JDBC driver.
When i tried inside this Debian VM, SparkSession.read() works correctly but dataframe.show() not.
Debian VM Configuration:
Debian 10 with Hadoop 3.2 and Spark 3.1.3.
JDBC driver: mssql-jdbc-11.2.1.jre8.jar from here
Java version: openjdk version "1.8.0_352"
to run correctly SparkSession.read() i have to delete java.security inside this path in Debian VM: $JAVA_HOME/jre/lib/security/java.security
Pyspark launch
pyspark --jars gs://bucket/mssql-jdbc-11.2.1.jre8.jar
Parameters
server_name = "jdbc:sqlserver://sqlhost:1433;"
database_name = "dbname"
url = server_name + ";" + "databaseName=" + database_name + ";encrypt=false;"
query = "SELECT * FROM dbo.tablename"
username = "user"
password = "pass"
sparkSession.read()
dataFrame = SparkSession.read \
.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("query", table) \
.load()
Results in debian VM:
dataFrame.show(5)
22/11/17 13:08:16 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/11/17 13:09:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (cluster-name.internal executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "Connection reset ClientConnectionId:".
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:3806)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1906)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3329)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2950)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2790)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1663)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1064)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:77)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Connection reset ClientConnectionId:700149ae-3483-4315-8c2e-de1bc11ce6b3
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.readInternal(IOBuffer.java:974)
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.read(IOBuffer.java:961)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1207)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1194)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1795)
... 25 more
But, like i said, is only with .show() or also .write().
.printSchema() works fine.
For sparkSession.read(), works fine after java.security delete, before i have the same SSL exception.
why is showing again the SSL exception? Any clues?
It seems that this is caused by the fact that MS SQL JDBC connector jar is not compatible with Conscrypt library that Dataproc uses to improve performance. I would advise to file a bug for MS SQL JDBC connector owners to fix this compatibility issue.
As a workaround you can disable Conscrypt when creating a Dataproc clsuster using dataproc:dataproc.conscrypt.provider.enable=false property, but it may have negative impact on performance:
gcloud dataproc clusters create ${CLUSTER_NAME} \
. . . \
--properties=dataproc:dataproc.conscrypt.provider.enable=false \
. . .
Related
Although the latest spark-cassandra-connector from DataStax states it supports reading/writing TTL and WRITETIME I am still receiving a SQL undefined function error.
Using Databricks with library com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.1.0 and a Spark Config for CassandraSparkExtensions on a 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12) Cluster. CQL version 3.4.5.
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions
Confirmed the config with Notebook code:
spark.conf.get("spark.sql.extensions")
Out[7]: 'com.datastax.spark.connector.CassandraSparkExtensions'
# Cassandra connection configs using Data Source API V2
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.connection.host", "10.1.4.4")
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.connection.port", "9042")
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.auth.username", dbutils.secrets.get(scope = "myScope", key = "CassUsername"))
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.auth.password", dbutils.secrets.get(scope = "myScope", key = "CassPassword"))
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.connection.ssl.enabled", True)
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.connection.ssl.trustStore.path", "/dbfs/user/client-truststore.jks")
spark.conf.set("spark.sql.catalog.cassandrauat.spark.cassandra.connection.ssl.trustStore.password", dbutils.secrets.get("key-vault-secrets", "cassTrustPassword"))
spark.conf.set("spark.sql.catalog.cassandrauat.spark.dse.continuous_paging_enabled", False)
# catalog name will be "cassandrauat" for Cassandra
spark.conf.set("spark.sql.catalog.cassandrauat", "com.datastax.spark.connector.datasource.CassandraCatalog")
spark.conf.set("spark.sql.catalog.cassandrauat.prop", "key")
spark.conf.set("spark.sql.defaultCatalog", "cassandrauat") # will override Spark to use Cassandra for all databases
%sql
select id, did, ts, val, ttl(val)
from cassandrauat.myKeyspace.myTable
Error in SQL statement: AnalysisException: Undefined function: 'ttl'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 25
When running this same CQL query on the Cassandra cluster directly it produces a result.
Any help with why the CassandraSparkExtensions aren't loading appreciated.
Adding full stack trace for NoSuchMethodError that occured after pre-loading library
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(Lscala/PartialFunction;)Lorg/apache/spark/sql/catalyst/plans/logical/LogicalPlan;
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$.replaceMetadata(CassandraMetadataFunctions.scala:152)
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$$anonfun$apply$1.$anonfun$applyOrElse$2(CassandraMetadataFunctions.scala:187)
at scala.collection.immutable.Stream.foldLeft(Stream.scala:549)
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$$anonfun$apply$1.applyOrElse(CassandraMetadataFunctions.scala:186)
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$$anonfun$apply$1.applyOrElse(CassandraMetadataFunctions.scala:183)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:484)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:484)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:262)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:258)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:460)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:428)
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$.apply(CassandraMetadataFunctions.scala:183)
at org.apache.spark.sql.cassandra.CassandraMetaDataRule$.apply(CassandraMetadataFunctions.scala:90)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$3(RuleExecutor.scala:221)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:221)
at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:89)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:218)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:210)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:210)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:271)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:264)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:191)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:188)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:109)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:188)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:246)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:347)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:245)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:96)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:134)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:180)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:180)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:97)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:103)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:689)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:854)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:684)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221)
at java.lang.Thread.run(Thread.java:748)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:129)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:541)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:50)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:50)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:518)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221)
at java.lang.Thread.run(Thread.java:748)
If you just added Spark Cassandra Connector via Clusters UI, then it will not work - the reason for that is libraries are installed into cluster after Spark already started, so class specified in spark.sql.extensions isn't found.
To fix this you need to put Jar file to cluster nodes before Spark starts - you can do it using the cluster init script that will either download jar directly with something like this (but it will download multiple copies - for each node):
#!/bin/bash
wget -q -O /databricks/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar \
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector-assembly_2.12/3.1.0/spark-cassandra-connector-assembly_2.12-3.1.0.jar
or it's better to download assembly jar, put onto DBFS, and then copy from DBFS into destination directory (for example, if it's uploaded to /FileStore/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar):
#!/bin/bash
cp /dbfs/FileStore/jars/spark-cassandra-connector-assembly_2.12-3.1.0.jar \
/databricks/jars/
Update (13.11.2021): SCC 3.1.0 isn't fully compatible with Spark 3.2.0 (parts of it are already in DBR 9.1). See SPARKC-670 for more details.
I am migrating a proof of concept from AWS / EMR to Azure.
It’s written in python and uses Spark, Hadoop and Cassandra on AWS EMR and S3. It calculates Potential Forward Exposure for a small set of OTC derivatives.
I have one roadblock at present: How do I save a pyspark dataframe to Azure storage?
In AWS / S3 this is quite simple, however I’ve yet to make it work on Azure. I may be doing something stupid!
I've tested out writing files to blob and file storage on Azure, but have yet to find pointers to dataframes.
On AWS, I currently use the following:
npv_dataframe.coalesce(1).saveAsTextFile(output_dir + '/exposure_scenarios/' + str(counterparty))
where output_dir is in the format s3://s3_bucket_name/directory_name
I set up a Data Lake Storage Gen2 storage account and container. I have enabled public access to it.
I have explored various methods e.g:
https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python
https://learn.microsoft.com/en-us/azure/storage/common/storage-samples-python?toc=/azure/storage/blobs/toc.json
https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html
Write data from pyspark to azure blob? (I believe this is old and that hadoop 3.2.1 comes with abfs support)
Some of these examples use a file-upload pattern but what I wanted was a direct save from a pyspark dataframe.
The test code I used was:
import traceback
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.#myaccount#.blob.core.windows.net', '#mykey#')
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df \
.coalesce(1) \
.write.format('csv') \
.option('header', True) \
.mode('overwrite') \
.save('wasbs://#mycontainer###myaccount#.blob.core.windows.net/result_csv')
print("Hadoop version: " + spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
The example above fails at the df.write - the error is
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
I receive the same error when using spark-submit
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.3 ./test.py
I believe this may be a version compatibility problem. I noticed that the hadoop.jars in pyspark were all version 2.7.4, whereas I was referencing the 3.2.1 installation.
I am / was using:
Java 8 (1.8.0_265)
Spark 3.0.0
Hadoop 3.2.1
Python 3.6
Ubuntu 18.04
I ensured all hadoop jars in the Spark directory were the same as in the Hadoop jar directory.
After following another stack trace error I updated the command to: spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.5 test.py
I then received a different Java error, which looks like a problem with the key??
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoSuchMethodError: 'org.apache.hadoop.conf.Configuration org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(org.apache.hadoop.conf.Configuration, java.lang.Class)'
at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:989)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1078)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:543)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1344)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:832)
Also, after adding the Azure account secure key to the hadoop config, if I try:
hdfs dfs -ls wasbs://CONTAINER#ACCOUNT.blob.core.windows.net/
I receive the error: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
Any help appreciated! Bit stuck for ideas. It also seems that, relative to AWS, there are few solved posts about Azure storage / Dataframe issues.
According to my test, we can use the package com.microsoft.azure:azure-storage:8.6.3 to upload files to Azure blob in spark.
For example
I am using
Java 8 (1.8.0_265) Spark 3.0.0 Hadoop 3.2.0 Python 3.6.9 Ubuntu 18.04
My code
import traceback
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.jimtestdiag924.blob.core.windows.net', '')
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df \
.coalesce(1) \
.write.format('csv') \
.option('header', True) \
.mode('overwrite') \
.save('wasbs://testupload#<account name>.blob.core.windows.net/result_csv')
print("Hadoop version: " + spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
My command
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3 spark.py
I resolved the issue by changing the storage account to a Blobstorage type, rather than Storage Gen2. Windows Azure Storage Blob (WASB) driver is unsupported with Data Lake Storage Gen2.
I'm running a python script using pyspark that connects to a Kubernetes cluster to run jobs using executor pods. The idea of the script is to create an SQLContext that queries a Snowflake database. However, I'm getting the following exception, but this exception is not descriptived enough
20/07/15 12:10:39 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
at net.snowflake.client.jdbc.internal.io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:399)
at net.snowflake.client.jdbc.internal.io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
at net.snowflake.client.jdbc.internal.io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
at net.snowflake.client.jdbc.internal.io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:247)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:68)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:106)
at net.snowflake.client.jdbc.ArrowResultChunk.readArrowStream(ArrowResultChunk.java:117)
at net.snowflake.client.core.SFArrowResultSet.buildFirstChunk(SFArrowResultSet.java:352)
at net.snowflake.client.core.SFArrowResultSet.<init>(SFArrowResultSet.java:230)
at net.snowflake.client.jdbc.SnowflakeResultSetSerializableV1.getResultSet(SnowflakeResultSetSerializableV1.java:1079)
at net.snowflake.spark.snowflake.io.ResultIterator.liftedTree1$1(SnowflakeResultSetRDD.scala:85)
at net.snowflake.spark.snowflake.io.ResultIterator.<init>(SnowflakeResultSetRDD.scala:78)
at net.snowflake.spark.snowflake.io.SnowflakeResultSetRDD.compute(SnowflakeResultSetRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:464)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
have anyone run into a similar case? if so how did you fix it?
I ran into the same problem and was able to fix it. I figured out that io.netty.tryReflectionSetAccessible needs to be explicitly set to true in Java >=9 for Spark-Snowflake connector to be able to read data returned back from Snowflake, in Kubernetes executor pods.
Now, since io.netty packages are shaded in snowflake-jdbc, we need to qualify the property with the full package name, i.e. net.snowflake.client.jdbc.internal.io.netty.tryReflectionSetAccessible=true.
This property needs to be set as a JVM option of the Spark executor pod. This can be accomplished by setting the executor JVM Option or the executor extra JVM Option Spark property. For example:
Property Name: spark.executor.extraJavaOptions
Value: -Dnet.snowflake.client.jdbc.internal.io.netty.tryReflectionSetAccessible=true
Note: Below changes are done in local system. It may or may not work for you.
Posting steps how I have fixed issue.
I had similar issue where my brew installed spark with openjdk#11 by default & To fix this issue I have changed java version from openjdk#11 to oracle jdk 1.8 (You can use open jdk 1.8 instead of oracle jdk 1.8)
> cat spark-submit
#!/bin/bash
JAVA_HOME="/root/.linuxbrew/opt/openjdk#11" exec "/root/.linuxbrew/Cellar/apache-spark/3.0.0/libexec/bin/pyspark" "$#"
Changed java version from openjdk#11 to oracle jdk 1.8. Now my spark-submit command looks like below.
> cat spark-submit
#!/bin/bash
JAVA_HOME="/usr/share/jdk1.8.0_202" exec "/root/.linuxbrew/Cellar/apache-spark/3.0.0/libexec/bin/pyspark" "$#"
Another workaround to fix this issue, try setting below conf to your spark-submit
spark-submit \
--conf 'spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true' \
--conf 'spark.driver.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true' \
...
I am trying to install some maven libraries to existing azure data bricks' cluster/newly created cluster through API from python.
Cluster details:
Python 3
5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)
Node type: Standard_D3_v2
spark_submit_packages = "org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3," \
"com.databricks:spark-redshift_2.11:3.0.0-preview1," \
"org.postgresql:postgresql:9.3-1103-jdbc3," \
"com.amazonaws:aws-java-sdk:1.11.98," \
"com.amazonaws:aws-java-sdk-core:1.11.98," \
"com.amazonaws:aws-java-sdk-sns:1.11.98," \
"org.apache.hadoop:hadoop-aws:2.7.3," \
"com.amazonaws:aws-java-sdk-s3:1.11.98," \
"com.databricks:spark-avro_2.11:4.0.0," \
"com.microsoft.azure:azure-data-lake-store-sdk:2.0.11," \
"org.apache.hadoop:hadoop-azure-datalake:3.0.0-alpha2," \
"com.microsoft.azure:azure-storage:3.1.0," \
"org.apache.hadoop:hadoop-azure:2.7.2"
install_lib_url = "https://<region>.azuredatabricks.net/api/2.0/libraries/install"
packages = spark_submit_packages.split(",")
maven_packages = []
for pack in packages:
maven_packages.append({"maven": {"coordinates": pack}})
headers = {"Authorization": "Bearer {}".format(TOKEN)}
headers['Content-type'] = 'application/json'
data = {
"cluster_id": cluster_id,
"libraries": maven_packages
}
res = requests.post(install_lib_url, headers=headers, data=json.dumps(data))
_response = res.json()
print(json.dumps(_response))
The response is empty json which is as expected.
But sometimes this api call results in the following error in the UI and the library installation is failed,
Library resolution failed. Cause: java.lang.RuntimeException: commons-httpclient:commons-httpclient download failed.
at com.databricks.libraries.server.MavenInstaller.$anonfun$resolveDependencyPaths$5(MavenLibraryResolver.scala:253)
at scala.collection.MapLike.getOrElse(MapLike.scala:131)
at scala.collection.MapLike.getOrElse$(MapLike.scala:129)
at scala.collection.AbstractMap.getOrElse(Map.scala:63)
at com.databricks.libraries.server.MavenInstaller.$anonfun$resolveDependencyPaths$4(MavenLibraryResolver.scala:253)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at com.databricks.libraries.server.MavenInstaller.resolveDependencyPaths(MavenLibraryResolver.scala:249)
at com.databricks.libraries.server.MavenInstaller.doDownloadMavenPackages(MavenLibraryResolver.scala:455)
at com.databricks.libraries.server.MavenInstaller.$anonfun$downloadMavenPackages$2(MavenLibraryResolver.scala:381)
at com.databricks.backend.common.util.FileUtils$.withTemporaryDirectory(FileUtils.scala:431)
at com.databricks.libraries.server.MavenInstaller.$anonfun$downloadMavenPackages$1(MavenLibraryResolver.scala:380)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:417)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.libraries.server.MavenInstaller.withAttributionContext(MavenLibraryResolver.scala:57)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.libraries.server.MavenInstaller.withAttributionTags(MavenLibraryResolver.scala:57)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:398)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:337)
at com.databricks.libraries.server.MavenInstaller.recordOperation(MavenLibraryResolver.scala:57)
at com.databricks.libraries.server.MavenInstaller.downloadMavenPackages(MavenLibraryResolver.scala:379)
at com.databricks.libraries.server.MavenInstaller.downloadMavenPackagesWithRetry(MavenLibraryResolver.scala:137)
at com.databricks.libraries.server.MavenInstaller.resolveMavenPackages(MavenLibraryResolver.scala:113)
at com.databricks.libraries.server.MavenLibraryResolver.resolve(MavenLibraryResolver.scala:44)
at com.databricks.libraries.server.ManagedLibraryManager$GenericManagedLibraryResolver.resolve(ManagedLibraryManager.scala:263)
at com.databricks.libraries.server.ManagedLibraryManagerImpl.$anonfun$resolvePrimitives$1(ManagedLibraryManagerImpl.scala:193)
at com.databricks.libraries.server.ManagedLibraryManagerImpl.$anonfun$resolvePrimitives$1$adapted(ManagedLibraryManagerImpl.scala:188)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.databricks.libraries.server.ManagedLibraryManagerImpl.resolvePrimitives(ManagedLibraryManagerImpl.scala:188)
at com.databricks.libraries.server.ManagedLibraryManagerImpl$ClusterStatus.installLibs(ManagedLibraryManagerImpl.scala:772)
at com.databricks.libraries.server.ManagedLibraryManagerImpl$InstallLibTask$1.run(ManagedLibraryManagerImpl.scala:473)
at com.databricks.threading.NamedExecutor$$anon$1.$anonfun$run$1(NamedExecutor.scala:317)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:256)
at com.databricks.threading.NamedExecutor$$anon$1.run(NamedExecutor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Is it due to installing multiple maven libraries in a single API? (But there we need to give a list to the API :| )
EDIT: This issue occurs while restarting the cluster too. Let's say that i have manually installed some 10 maven libraries to a cluster. All the installations are successful. But when i restart the cluster, even these successful installations become failed.
Got the following response from Azure support team:
Seems there is a problem with a particular maven
jar(org.apache.hadoop:hadoop-azure-datalake:3.0.0-alpha2)
Work around:
1. Download the jar from maven repository.
2. Upload it to dbfs.
3. Use the jar from dbfs for creating library.
We are running a few dataproc jobs with dataproc image 1.0 and spark-redshift.
We have two clusters, here are some details:
Cluster A -> Runs PySpark Streaming job, last created 2016. Jul 15. 11:27:12 AEST
Cluster B -> Runs PySpark Batch jobs, the cluster is created everytime the job is run and teardown afterwards.
A & B runs the same code base, use the same init script, same node types etc.
Since sometime last Friday (2016-08-05 AEST), our code stopped working on cluster B with the following error, while cluster A is running without issues.
The following code can reproduce the issue on Cluster B (or any new cluster with image v1.0.0) while it runs fine on cluster A.
Sample PySpark Code:
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
rdd = sc.parallelize([{'user_id': 'test'}])
df = rdd.toDF()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "FOO")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BAR")
df\
.write\
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://foo.ap-southeast-2.redshift.amazonaws.com/bar") \
.option("dbtable", 'foo') \
.option("tempdir", "s3n://bar") \
.option("extracopyoptions", "TRUNCATECOLUMNS") \
.mode("append") \
.save()
The above code fails in both of the following situations on Cluster B, while running fine on A. note that the RedshiftJDBC41-1.1.10.1010.jar is created via cluster init script.
Running in interactive mode on master node:
PYSPARK_DRIVER_PYTHON=ipython pyspark \
--verbose \
--master "local[*]"\
--jars /usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
--packages com.databricks:spark-redshift_2.10:1.0.0
Submit the job via gcloud dataproc
gcloud --project foo \
dataproc jobs submit pyspark \
--cluster bar \
--properties ^#^spark.jars.packages=com.databricks:spark-redshift_2.10:1.0.0#spark.jars=/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
foo.bar.py
The error it produces (Trace):
2016-08-08 06:12:23 WARN TaskSetManager:70 - Lost task 6.0 in stage 45.0 (TID 121275, foo.bar.internal):
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.avro.mapreduce.AvroKeyRecordWriter.<init>(AvroKeyRecordWriter.java:55)
at org.apache.avro.mapreduce.AvroKeyOutputFormat$RecordWriterFactory.create(AvroKeyOutputFormat.java:79)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at com.databricks.spark.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:82)
at com.databricks.spark.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:31)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-08-08 06:12:24 ERROR YarnScheduler:74 - Lost executor 63 on kinesis-ma-sw-o7he.c.bupa-ma.internal: Container marked as failed: container_1470632577663_0003_01_000065 on host: kinesis-ma-sw-o7he.c.bupa-ma.internal. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1470632577663_0003_01_000065
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
SparkRedshift:1.0.0 requires com.databricks.spark-avro:2.0.1, which requires org.apache.avro:1.7.6.
Upon checking the version of org.apache.avro.generic.GenericData on Cluster A:
root#foo-bar-m:/home/foo# spark-shell \
> --verbose \
> --master "local[*]" \
> --deploy-mode client \
> --packages com.databricks:spark-redshift_2.10:1.0.0 \
> --jars "/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar"
It produces (Trace):
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#496a514f
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.5-hadoop2.jar <no signer certificates>)
While running the same command on Cluster B:
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#72bec302
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.7-hadoop2.jar <no signer certificates>)
Screenshot of Env on Cluster B. (Apologies for all the redactions).
We've tried method described on here and here without any success.
This is really frustrating as the DataProc updates the image content without bumping the release version as the complete opposite of immutable releases. Now our code is broke and there is no way we could roll back to the previous version.
Sorry for the trouble! It's certainly not intended for breaking changes to occur within an image version. Note that subminor versions are rolled out "under the hood" for non-breaking bug fixes and Dataproc-specific patches.
You can revert to using the 1.0.* version from before last week by simply specifying --image-version 1.0.8 when deploying clusters from the command-line:
gcloud dataproc clusters create --image-version 1.0.8
Edit: For additional clarification, we've investigated the Avro versions in question and verified that Avro version numbers actually did not change in any recent subminor Dataproc release. The core issue is that Hadoop itself has had a latent bug where Hadoop itself brings avro-1.7.4 under /usr/lib/hadoop/lib/ and Spark uses avro-1.7.7. Coincidentally Google's bigquery connectory also uses avro-1.7.7 but this turns out to be orthogonal to the known Spark/Hadoop problem with 1.7.4 vs 1.7.7. The recent image update was deemed nonbreaking because versions in fact did not change, but classloading ordering changed in a nondeterministic way where Hadoop's bad avro version used to be hidden from the Spark job by pure luck, and is no longer accidentally hidden in the latest image.
Dataproc's preview image currently includes a fix to the avro version in the Hadoop layer which should make it into any future Dataproc 1.1 version when it comes out; you might want to consider trying the preview version to see if Spark 2.0 is a seamless transition.