How to run Spark structured streaming using local JAR files - apache-spark

I'm using one of the Docker images of EMR on EKS (emr-6.5.0:20211119) and investigating how to work on Kafka with Spark Structured Programming (pyspark). As per the integration guide, I run a Python script as following.
$SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--master local \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
<myscript>.py
It download the package from Maven central and I see some JAR files are downloaded into ~/.ivy2/jars.
com.github.luben_zstd-jni-1.4.8-1.jar org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.2.jar org.slf4j_slf4j-api-1.7.30.jar
org.apache.commons_commons-pool2-2.6.2.jar org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.2.jar org.spark-project.spark_unused-1.0.0.jar
org.apache.kafka_kafka-clients-2.6.0.jar org.lz4_lz4-java-1.7.1.jar org.xerial.snappy_snappy-java-1.1.8.2.jar
However I find the main JAR file is already download into $SPARK_HOME/external/lib and I wonder how to make use of it instead of downloading it.
spark-avro_2.12-3.1.2-amzn-1.jar spark-ganglia-lgpl.jar spark-streaming-kafka-0-10-assembly_2.12-3.1.2-amzn-1.jar spark-streaming-kinesis-asl-assembly.jar
spark-avro.jar **spark-sql-kafka-0-10_2.12-3.1.2-amzn-1.jar spark-streaming-kafka-0-10-assembly.jar spark-token-provider-kafka-0-10_2.12-3.1.2-amzn-1.jar
spark-ganglia-lgpl_2.12-3.1.2-amzn-1.jar **spark-sql-kafka-0-10.jar spark-streaming-kinesis-asl-assembly_2.12-3.1.2-amzn-1.jar spark-token-provider-kafka-0-10.jar
UPDATE 2022-03-09
I tried after updating spark-defaults.conf as shown below - added the external lib folder.
spark.driver.extraClassPath /usr/lib/spark/external/lib/*:...
spark.driver.extraLibraryPath ...
spark.executor.extraClassPath /usr/lib/spark/external/lib/*:...
spark.executor.extraLibraryPath ...
I can run without --packages but it fails with the following error.
22/03/09 05:37:25 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:623)
at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.<init>(KafkaBatchPartitionReader.scala:52)
at org.apache.spark.sql.kafka010.KafkaBatchReaderFactory$.createReader(KafkaBatchPartitionReader.scala:40)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.pool2.impl.GenericKeyedObjectPoolConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 33 more
It doesn't help although I tried with adding --packages org.apache.commons:commons-pool2:2.6.2.

You would use --jars to refer to local filesystem in-place of --packages

Unfortunately I cannot submit an app only with the JAR files in $SPARK_HOME/external/lib due to an error. The details of the error are updated to the question. Instead I ended up pre-downloading the package JAR files and using those.
I first ran with the following command. Here foo.py is an empty file and it'll download the package JAR files into /home/hadoop/.ivy2/jars.
$SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--master local \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
foo.py
Then I updated spark-defaults.conf as following.
spark.driver.extraClassPath /home/hadoop/.ivy2/jars/*:...
spark.driver.extraLibraryPath ...
spark.executor.extraClassPath /home/hadoop/.ivy2/jars/*:...
spark.executor.extraLibraryPath ...
After that, I ran the submit command without --packages and it worked without an error.
$SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--master local \
<myscript>.py
This approach is likely to be useful when it takes long to download package JAR files as they can be pre-downloaded. Note EMR on EKS supports using a custom image from ECR.

Related

dataframe.show() not work in Pyspark inside a Debian VM (Dataproc)

currently using GCP and Dataproc, i´m new to apache spark, pyspark and debian vm. So, i´m trying to replicated inside dataproc cluster (Debian VM) a spark job that i run perfectly in my local machine (W10, VS Code, Spark 3.3.1). Ingestion from SQL Server to Spark dataframe, via JDBC driver.
When i tried inside this Debian VM, SparkSession.read() works correctly but dataframe.show() not.
Debian VM Configuration:
Debian 10 with Hadoop 3.2 and Spark 3.1.3.
JDBC driver: mssql-jdbc-11.2.1.jre8.jar from here
Java version: openjdk version "1.8.0_352"
to run correctly SparkSession.read() i have to delete java.security inside this path in Debian VM: $JAVA_HOME/jre/lib/security/java.security
Pyspark launch
pyspark --jars gs://bucket/mssql-jdbc-11.2.1.jre8.jar
Parameters
server_name = "jdbc:sqlserver://sqlhost:1433;"
database_name = "dbname"
url = server_name + ";" + "databaseName=" + database_name + ";encrypt=false;"
query = "SELECT * FROM dbo.tablename"
username = "user"
password = "pass"
sparkSession.read()
dataFrame = SparkSession.read \
.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("query", table) \
.load()
Results in debian VM:
dataFrame.show(5)
22/11/17 13:08:16 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
22/11/17 13:09:57 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (cluster-name.internal executor 2): com.microsoft.sqlserver.jdbc.SQLServerException: The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption. Error: "Connection reset ClientConnectionId:".
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:3806)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1906)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3329)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2950)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2790)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1663)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1064)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:77)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Connection reset ClientConnectionId:700149ae-3483-4315-8c2e-de1bc11ce6b3
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.readInternal(IOBuffer.java:974)
at com.microsoft.sqlserver.jdbc.TDSChannel$SSLHandshakeInputStream.read(IOBuffer.java:961)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.readInternal(IOBuffer.java:1207)
at com.microsoft.sqlserver.jdbc.TDSChannel$ProxyInputStream.read(IOBuffer.java:1194)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
at org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
at com.microsoft.sqlserver.jdbc.TDSChannel.enableSSL(IOBuffer.java:1795)
... 25 more
But, like i said, is only with .show() or also .write().
.printSchema() works fine.
For sparkSession.read(), works fine after java.security delete, before i have the same SSL exception.
why is showing again the SSL exception? Any clues?
It seems that this is caused by the fact that MS SQL JDBC connector jar is not compatible with Conscrypt library that Dataproc uses to improve performance. I would advise to file a bug for MS SQL JDBC connector owners to fix this compatibility issue.
As a workaround you can disable Conscrypt when creating a Dataproc clsuster using dataproc:dataproc.conscrypt.provider.enable=false property, but it may have negative impact on performance:
gcloud dataproc clusters create ${CLUSTER_NAME} \
. . . \
--properties=dataproc:dataproc.conscrypt.provider.enable=false \
. . .

Pyspark - java.lang.OutOfMemoryError: Java heap space while writing to csv file

when trying to write to a csv file using below code
DF.coalesce(1).write.option("header","false").option("sep",",").option("escape",'"').option("ignoreTrailingWhiteSpace","false").option("ignoreLeadingWhiteSpace","false").mode("overwrite").csv(filename)
I am getting the below error
ileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
Could someone advise a workaround ?
Try increasing the executor.memory in your spark-submit application
Something like this
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
For me adding the below spark config fixed the issue
spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "15g").appName('sl-app').getOrCreate()

got java.lang.UnsupportedOperationException on executor pod

I'm running a python script using pyspark that connects to a Kubernetes cluster to run jobs using executor pods. The idea of the script is to create an SQLContext that queries a Snowflake database. However, I'm getting the following exception, but this exception is not descriptived enough
20/07/15 12:10:39 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
at net.snowflake.client.jdbc.internal.io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:399)
at net.snowflake.client.jdbc.internal.io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
at net.snowflake.client.jdbc.internal.io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
at net.snowflake.client.jdbc.internal.io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:247)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:68)
at net.snowflake.client.jdbc.internal.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:106)
at net.snowflake.client.jdbc.ArrowResultChunk.readArrowStream(ArrowResultChunk.java:117)
at net.snowflake.client.core.SFArrowResultSet.buildFirstChunk(SFArrowResultSet.java:352)
at net.snowflake.client.core.SFArrowResultSet.<init>(SFArrowResultSet.java:230)
at net.snowflake.client.jdbc.SnowflakeResultSetSerializableV1.getResultSet(SnowflakeResultSetSerializableV1.java:1079)
at net.snowflake.spark.snowflake.io.ResultIterator.liftedTree1$1(SnowflakeResultSetRDD.scala:85)
at net.snowflake.spark.snowflake.io.ResultIterator.<init>(SnowflakeResultSetRDD.scala:78)
at net.snowflake.spark.snowflake.io.SnowflakeResultSetRDD.compute(SnowflakeResultSetRDD.scala:41)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:464)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
have anyone run into a similar case? if so how did you fix it?
I ran into the same problem and was able to fix it. I figured out that io.netty.tryReflectionSetAccessible needs to be explicitly set to true in Java >=9 for Spark-Snowflake connector to be able to read data returned back from Snowflake, in Kubernetes executor pods.
Now, since io.netty packages are shaded in snowflake-jdbc, we need to qualify the property with the full package name, i.e. net.snowflake.client.jdbc.internal.io.netty.tryReflectionSetAccessible=true.
This property needs to be set as a JVM option of the Spark executor pod. This can be accomplished by setting the executor JVM Option or the executor extra JVM Option Spark property. For example:
Property Name: spark.executor.extraJavaOptions
Value: -Dnet.snowflake.client.jdbc.internal.io.netty.tryReflectionSetAccessible=true
Note: Below changes are done in local system. It may or may not work for you.
Posting steps how I have fixed issue.
I had similar issue where my brew installed spark with openjdk#11 by default & To fix this issue I have changed java version from openjdk#11 to oracle jdk 1.8 (You can use open jdk 1.8 instead of oracle jdk 1.8)
> cat spark-submit
#!/bin/bash
JAVA_HOME="/root/.linuxbrew/opt/openjdk#11" exec "/root/.linuxbrew/Cellar/apache-spark/3.0.0/libexec/bin/pyspark" "$#"
Changed java version from openjdk#11 to oracle jdk 1.8. Now my spark-submit command looks like below.
> cat spark-submit
#!/bin/bash
JAVA_HOME="/usr/share/jdk1.8.0_202" exec "/root/.linuxbrew/Cellar/apache-spark/3.0.0/libexec/bin/pyspark" "$#"
Another workaround to fix this issue, try setting below conf to your spark-submit
spark-submit \
--conf 'spark.executor.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true' \
--conf 'spark.driver.extraJavaOptions=-Dio.netty.tryReflectionSetAccessible=true' \
...

Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place them in the /opt/spark/jars directory of the spark instances. I get a java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
Why is spark not seeing the jars? Do I have to have to jars in all the slaves and specify a spark-defaults.conf for the master and slaves? Is there something that needs to be configured in zeppelin to recognize the new jar files?
I have placed jar files /opt/spark/jars on the spark master. I have created a spark-defaults.conf and added the lines
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-java-sdk-1.11.179.jar
I have zeppelin interpreter sending a spark submit to the spark master.
I have also placed the jars in the /opt/spark/jars in the slaves too but did not create a spark-deafults.conf.
%spark.pyspark
#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#creating the context
sqlContext = SQLContext(sc)
#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))
#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])
#print the dataframe
df1.show()
Error thrown:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more
I was able to address the above to make sure I had the correct versions of the hadoop aws jar per the version of spark hadoop that I was running, downloading the correct version of aws-java-sdk, and lastly downloading the dependency jets3t library
In the /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
Testing it out
scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", [ACCESS KEY ID])
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )
scala> val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")
scala> myRDD.count()
res2: Long = 49
If S3 access is by assume_role from local cluster then below worked for me.
import boto3
import pyspark as pyspark
from pyspark import SparkContext
session = boto3.session.Session(profile_name='profile_name')
sts_connection = session.client('sts')
response = sts_connection.assume_role(RoleArn='arn:aws:iam:::role/role_name', RoleSessionName='role_name',DurationSeconds=3600)
credentials = response['Credentials']
conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0') //crosscheck the version.
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
sc._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
url = str('s3a://data.csv')
l1 = sc.textFile(url).collect()
for each in l1:
print(str(each))
break
keep below proper version of class files also in $SPARK_HOME/jars
jets3t
aws-java-sdk
hadoop-aws
I prefer to delete unwanted jars from ~/.ivy2/jars
From the official Hadoop troubleshooting documentation:
ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
These are Hadoop filesystem client classes, found in the `hadoop-aws`
JAR. An exception reporting this class as missing means that this JAR
is not on the classpath.
To solve this problem first need to know what is org.apache.hadoop.fs.s3a:
In the Hadoop website, it explains in detail what Hadoop-AWS module: Integration with Amazon Web Services is. And the prerequisite to use it is having these two jars installed under /Spark/jars directory:
hadoop-aws Jar
aws-java-sdk-bundle Jar
When downloading these jars, make sure two things:
Hadooop version matches with hadoop-aws version, a hadoop-aws-3.xx.jar works for a hadoop-3.xx
aws SDK for Java matches the Java version installed. Check this official document from AWS on exact version requirements.
For more troubleshooting, can always refer to the official Hadoop troubleshooting documentation:
Following worked for me
My system config:
Ubuntu 16.04.6 LTS
python3.7.7
openjdk version 1.8.0_252
spark-2.4.5-bin-hadoop2.7
Configure PYSPARK_PYTHON path:
add following line in $spark_home/conf/spark-env.sh
export PYSPARK_PYTHON= python_env_path/bin/python
Start pyspark
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.11.760,org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-2.amazonaws.com
com.amazonaws:aws-java-sdk-pom:1.11.760 : depends on jdk version
hadoop:hadoop-aws:2.7.0: depends on your hadoop version
s3.us-west-2.amazonaws.com: depends on your s3 location
3.Read data from s3
df2=spark.read.parquet("s3a://s3location_file_path")
Credits
Each hadoop version should match aws-java-sdk-...jar, hadoop-aws-...jar.
And every aws-java-sdk version matched with hadoop-aws-..jar (it does not mean the same number).
For example ( aws-java-sdk-bundle-1.11.375.jar, hadoop-aws-3.2.0.jar are pair versions).
Lastly you should enroll the s3 domain in the hive.cnf configuration file.
If nothing works in the above then do a cat and grep for the missing class. High possibility that the Jar is corrupted.
For example, if you get class AmazonServiceException not found, then do a grep where the jar is already present as shown below.
grep "AmazonServiceException" *.jar
Add the following to this file hadoop/etc/hadoop/core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>***</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>***</value>
</property>
Inside the Hadoop installation directory, find aws jars, for MAC installation directory is /usr/local/Cellar/hadoop/
find . -type f -name "*aws*"
sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/
Credit

Dataproc Spark returns java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II) when accessing Hive

I'm moving from Dataproc 1.2 to 1.3. When I created a new Spark cluster on Dataproc using image version 1.3. I got
HiveMetaException: Metastore schema version is not compatible. Hive Version: 2.3.0, Database Schema Version: 2.1.0
because of database schema incompatibility. So I ssh-ed to Dataproc master instance and ran
schematool -dbType mysql -upgradeSchemaFrom 2.1.0
everything worked as expected. I then recreated a new Spark cluster to make sure it doesn't throw this exception again. However, when I ran
val df = spark.sql("select * from daily_active_user_trx")
df.show
on Zeppelin notebook and spark-shell, I got the following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 249, development-cluster-w-3.c.true-dmp.internal, executor 70): java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II)I
at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:565)
at org.apache.parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:205)
at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.<init>(PlainValuesDictionary.java:89)
at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.<init>(PlainValuesDictionary.java:72)
at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:90)
at org.apache.parquet.column.Encoding$4.initDictionary(Encoding.java:149)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.<init>(VectorizedColumnReader.java:114)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:312)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:258)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:139)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1092)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
After googling, I found a similar thread but it's on CDH
http://community.cloudera.com/t5/CDH-Manual-Installation/spark2-upgrade-to-2-3-0-from-2-2-0-wont-read-or-write-snappy/td-p/66735
I tried adding snappy-java-1.1.4.jar to /usr/lib/spark/jars on master node as suggested but it didn't work.
thanks
Peernat F.
This is SPARK-24018, which the Dataproc team is currently working on addressing.
I believe to fix it you need the jar on all workers not just the master, which I why your fix did not work.
I would recommend a simple initialization action of:
rm -f /usr/lib/spark/jars/snappy*
wget https://repo1.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.2.6/snappy-java-1.1.2.6.jar \
-P /usr/lib/spark/jars
This should roll out to new Dataproc 1.3 images in a couple weeks after we are sure we fully understand the issue.

Resources