Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark - apache-spark

I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_ACCESS_KEY_ID) \
.config("spark.hadoop.fs.s3a.secret.key", configuration.AWS_ACCESS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
lines = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_server) \
.option('subscribe', kafka_topic) \
.option("startingOffsets", "earliest") \
.load()
streaming_query = lines.writeStream \
.format('parquet') \
.outputMode('append') \
.option('path', configuration.S3_PATH) \
.start()
streaming_query.awaitTermination()
Hadoop version: 3.2.1,
Spark version 3.2.1
I have added the dependency jars to pyspark jars:
spark-sql-kafka-0-10_2.12:3.2.1,
aws-java-sdk-s3:1.11.375,
hadoop-aws:3.2.1,
I get the following error when executing:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

In my case, it worked in the end by adding the following statement:
.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
Also, all the hadoop jars in site-package/pyspark/jars must be in the same version, hadoop-aws:3.2.2, hadoop-client-api-3.2.2, hadoop-client-runtime-3.2.2, hadoop-yam-server-web-proxy-3.2.2
For version 3.2.2 of hadoop-aws, aws-java-sdk-s3:1.11.563 package is needed.
Also I replaced guava-14.0.jar with guava-23.0.jar.

I used same package with you.
in my case, when i added below the line.
config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
i got this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
....
For solving this I installed `guava-30.0

try to download jar lib put into spark/jars
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar

Related

Failed to run pyspark's .withcolumn function

I am trying to run Pyspark on pycharm in Windows 10, but I kept getting some weird error message on Node 81 related to JVM when trying to execute the simple function .withColumn() and .withColumnRenamed. I have a tmp folder on my desktop (see the attached image), and I set all the environment variables for HADOOP_PATH, JAVA_HOME, PATH, PYTHON_PATH and SPARK_HOME. I was also able to create the spark object with the following lines of code
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("Data Est") \
.config("spark.driver.memory", memory='4g') \
.config("spark.sql.shuffle.partitions", partitions=400) \
.config("spark.sql.broadcastTimeout", -1) \
.config("spark.sql.session.timezone", "UTC") \
.config("spark.local.dir", spark_local_dir=[some directory path on desktop]) \
.getOrCreate()
System Environment Variables - Windows 10 64-bit

Pyspark is unable to find bigquery datasource

This is my pyspark configuration. Ive followed the steps mentioned here and didnt create a sparkcontext.
spark = SparkSession \
.builder \
.appName(appName) \
.config(conf=spark_conf) \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0') \
.config('spark.jars.packages','com.google.cloud.bigdataoss:gcsio:1.5.4') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar,spark-bigquery-with-dependencies_2.12-0.21.1.jar,spark-bigquery-latest_2.11.jar') \
.config('spark.jars', 'postgresql-42.2.23.jar,bigquery-connector-hadoop2-latest.jar') \
.getOrCreate()
Then when i try to write a demo spark dataframe to bigquery
df.write.format('bigquery') \
.mode(mode) \
.option("credentialsFile", "creds.json") \
.option('table', table) \
.option("temporaryGcsBucket",bucket) \
.save()
It throws and error
File "c:\sparktest\vnenv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
My problem was with faulty jar versions. I am using spark 3.1.2 and hadoop 3.2 this was the maven jars with code which worked for me.
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config('spark.jars.packages','com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.22.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-1.9.5,com.google.guava:guava:r05') \
.config('spark.jars','guava-11.0.1.jar,gcsio-1.9.0-javadoc.jar') \ # you will have to download these jars manually
.getOrCreate()

Not able to configure custom hive-metastore-client in spark

We are facing some challenges working with spark and hive.
We need to connect to hive-metastore from spark and we have to use custom hive-metastore-client inside spark.
code snippet:
spark = SparkSession \
.builder \
.config('spark.sql.hive.metastore.version','3.1.2' ) \
.config('spark.sql.hive.metastore.jars', 'hive-standalone-metastore-3.1.2-sqlquery.jar') \
.config("spark.yarn.dist.jars", "hive-exec-3.1.2.jar") \
.config('hive.metastore.uris', "thrift://localhost:9090") \
the above code works with inbuilt hive-metastore-client but fails with custom one with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.databaseExists.
: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
To resolve this issue, we have configured custom hive-exec in code and also tried to pass using command line:
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --jars hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --conf spark.executor.extraClassPath hive-exec-3.1.2.jar
/usr/local/Cellar/apache-spark/3.1.2/libexec/bin/spark-submit spark-session.py --driver-class-path hive-exec-3.1.2.jar
But the issue is not resolved yet, any suggestion would help us.

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Why does spark-submit ignore the package that I include as part of the configuration of my spark session?

I am trying to include the org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 package as part of my spark code (via the SparkSession Builder). I understand that I can download the JAR myself and include it but I would like to figure out why the following is not working as expected:
from pyspark.sql import SparkSession
import pyspark
import json
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "first_topic") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = df \
.writeStream \
.format("console") \
.outputMode("update") \
.start()
When I run the job:
spark-submit main.py
I receive the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I instead include the packages via the --packages flag, the dependencies are downloaded and the code runs as expected:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 main.py
The code also works if I open the PySpark shell and paste the code above. Is there a reason that the spark-submit ignores the configuration?
I think that for configurations like "spark.jars.packages", these should be configured either in spark-defaults or passed by command-line arguments, setting it in the runtime shouldn't work.
Against better judgement
I remember some people claimed something like this worked for them, but I would say that the dependency is already somewhere there (installed in local repo), just loaded.
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3")
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config(conf = conf) \
.getOrCreate()
When you run spark-submit, it already creates a SparkSession that is reused by your code - thus you have to provide everything through spark-submit.
However, you do not need to actually use spark-submit to run your Spark code. Assuming your main method looks like this:
def main():
spark = SparkSession.builder.config(...).getOrCreate()
# your spark code below
...
You can run this code just via python:
> python ./my_spark_script.py
This will run your program correctly
I faced same problem, after google I found that link
https://issues.apache.org/jira/browse/SPARK-21752
According to #srowen Sean R. Owen "At that point, your app has already launched. You can't change the driver classpath."

Resources