Why does spark-submit ignore the package that I include as part of the configuration of my spark session? - apache-spark

I am trying to include the org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 package as part of my spark code (via the SparkSession Builder). I understand that I can download the JAR myself and include it but I would like to figure out why the following is not working as expected:
from pyspark.sql import SparkSession
import pyspark
import json
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "first_topic") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = df \
.writeStream \
.format("console") \
.outputMode("update") \
.start()
When I run the job:
spark-submit main.py
I receive the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I instead include the packages via the --packages flag, the dependencies are downloaded and the code runs as expected:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 main.py
The code also works if I open the PySpark shell and paste the code above. Is there a reason that the spark-submit ignores the configuration?

I think that for configurations like "spark.jars.packages", these should be configured either in spark-defaults or passed by command-line arguments, setting it in the runtime shouldn't work.
Against better judgement
I remember some people claimed something like this worked for them, but I would say that the dependency is already somewhere there (installed in local repo), just loaded.
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3")
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config(conf = conf) \
.getOrCreate()

When you run spark-submit, it already creates a SparkSession that is reused by your code - thus you have to provide everything through spark-submit.
However, you do not need to actually use spark-submit to run your Spark code. Assuming your main method looks like this:
def main():
spark = SparkSession.builder.config(...).getOrCreate()
# your spark code below
...
You can run this code just via python:
> python ./my_spark_script.py
This will run your program correctly

I faced same problem, after google I found that link
https://issues.apache.org/jira/browse/SPARK-21752
According to #srowen Sean R. Owen "At that point, your app has already launched. You can't change the driver classpath."

Related

Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found when trying to write data on S3 bucket from Spark

I am trying to write data on an S3 bucket from my local computer:
spark = SparkSession.builder \
.appName('application') \
.config("spark.hadoop.fs.s3a.access.key", configuration.AWS_ACCESS_KEY_ID) \
.config("spark.hadoop.fs.s3a.secret.key", configuration.AWS_ACCESS_SECRET_KEY) \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
lines = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', kafka_server) \
.option('subscribe', kafka_topic) \
.option("startingOffsets", "earliest") \
.load()
streaming_query = lines.writeStream \
.format('parquet') \
.outputMode('append') \
.option('path', configuration.S3_PATH) \
.start()
streaming_query.awaitTermination()
Hadoop version: 3.2.1,
Spark version 3.2.1
I have added the dependency jars to pyspark jars:
spark-sql-kafka-0-10_2.12:3.2.1,
aws-java-sdk-s3:1.11.375,
hadoop-aws:3.2.1,
I get the following error when executing:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
In my case, it worked in the end by adding the following statement:
.config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
Also, all the hadoop jars in site-package/pyspark/jars must be in the same version, hadoop-aws:3.2.2, hadoop-client-api-3.2.2, hadoop-client-runtime-3.2.2, hadoop-yam-server-web-proxy-3.2.2
For version 3.2.2 of hadoop-aws, aws-java-sdk-s3:1.11.563 package is needed.
Also I replaced guava-14.0.jar with guava-23.0.jar.
I used same package with you.
in my case, when i added below the line.
config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
i got this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o56.parquet.
: java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
....
For solving this I installed `guava-30.0
try to download jar lib put into spark/jars
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.563/aws-java-sdk-bundle-1.11.563.jar

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"

I am using the following code to read some json data from S3:
df = spark_sql_context.read.json("s3a://test_bucket/test.json")
df.show()
The above code throws the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.json.
: java.lang.NumberFormatException: For input string: "100M"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1538)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:248)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:391)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I have read several other SO posts on this topic (like this one or this) and have done all they have mentioned but nothing seems to fix my issue.
I am using spark-2.4.4-bin-without-hadoop and hadoop-3.1.2. As for the jar files, I've got:
aws-java-sdk-bundle-1.11.199.jar
hadoop-aws-3.0.0.jar
hadoop-common-3.0.0.jar
Also, using the following spark-submit command to run the code:
/opt/spark-2.4.4-bin-without-hadoop/bin/spark-submit
--conf spark.app.name=read_json --master yarn --deploy-mode client --num-executors 2
--executor-cores 2 --executor-memory 2G --driver-cores 2 --driver-memory 1G
--jars /home/my_project/jars/aws-java-sdk-bundle-1.11.199.jar,
/home/my_project/jars/hadoop-aws-3.0.0.jar,/home/my_project/jars/hadoop-common-3.0.0.jar
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.rpc.askTimeout=600s" /home/my_project/read_json.py
Anything I might be missing here?
From the stack trace the error is thrown when it's trying to read one of the configuration options, so the issue is with one of the default configuration options that now require numeric format.
In my case the error was resolved after I added the following configuration parameter to the spark-submit command:
--conf fs.s3a.multipart.size=104857600
See Tuning S3A Uploads.
I am posting what I ended up doing to fix the issue for anyone who might see the same exception:
I added hadoop-aws to HADOOP_OPTIONAL_TOOLS in hadoop-env.sh. I also removed all configurations in spark for s3a except the access/secret and everything worked. My code before the changes:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")\
.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")\
.set("com.amazonaws.services.s3.enableV4", "true")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.multipart.size", "104857600"
)
And after:
# Setup the Spark Process
conf = SparkConf() \
.setAppName(app_name) \
.set("spark.hadoop.mapred.output.compress", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "true") \
.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec") \
.set("spark.hadoop.mapred.output.compression.`type", "BLOCK") \
.set("spark.speculation", "false")
# Some other configs
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.access.key", s3_key
)
spark_context._jsc.hadoopConfiguration().set(
"fs.s3a.secret.key", s3_secret
)
That probably means that it was a class path issue. The hadoop-aws wasn't getting added to the class path and so under the covers it was defaulting to some other implementation of S3AFileSystem.java. Hadoop and spark are a huge pain in this area because there are so many different places and ways to load things and java is particular about the order as well because if it doesn't happen in the right order, it will just go with whatever was loaded last. Hope this helps others facing the same issue.

Resources