Running Pyspark app on Hadoop cluster yields java.lang.NoClassDefFoundError - apache-spark

Whenever I run this error shows up:
Traceback (most recent call last):
File "~/test-tung/spark_tf.py", line 69, in <module>
'spark_tf').master('yarn').getOrCreate()
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/session.py", line 186, in getOrCreate
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 371, in getOrCreate
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 131, in __init__
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 193, in _do_init
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/context.py", line 310, in _initialize_context
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1569, in __call__
File "~/main-projects/spark/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: org/spark_project/guava/base/Preconditions
A part of my python app spark_tf.py:
spark = SparkSession.builder.appName(
'spark_tf').master('yarn').getOrCreate()
model = tf.keras.models.load_model('./model/kdd_binary.h5')
weights = model.get_weights()
config = model.get_config()
bc_weights = spark.sparkContext.broadcast(weights)
bc_config = spark.sparkContext.broadcast(config)
scheme = StructType().add('#timestamp', StringType()).add('#address', StringType())
stream = spark.readStream.format('kafka') \
.option('kafka.bootstrap.servers', 'my-host:9092') \
.option('subscribe', 'dltest') \
.load() \
.selectExpr("CAST(value AS STRING)") \
.select(from_json('value', scheme).alias('json'),
online_predict('value').alias('result')) \
.select(to_json(struct('result', 'json.#timestamp', 'json.#address'))
.alias('value'))
x = stream.writeStream \
.format('kafka') \
.option("kafka.bootstrap.servers", 'my-host:9092') \
.option('topic', 'dlpred') \
.option('checkpointLocation', './kafka_checkpoint') \
.start()
x.awaitTermination()
My submit line: spark-submit --deploy-mode client --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 spark_tf.py
I think it's probably because of improper Spark setup but I don't know what caused that.
EDIT: This code I think apparently runs on client instead of Hadoop cluster but running it on the cluster yields the same error.

Related

Spark expecting HDFS location instead of Local Dir

I am trying to run spark streaming, but getting this issue. Please help
from pyspark.sql import SparkSession
if __name__ == "__main__":
print("Application started")
spark = SparkSession \
.builder \
.appName("Socker streaming demo") \
.master("local[*]")\
.getOrCreate()
# Steam will return unbounded table
stream_df = spark\
.readStream\
.format("socket")\
.option("host","localhost")\
.option("port","1100")\
.load()
print(stream_df.isStreaming)
stream_df.printSchema()
write_query = stream_df \
.writeStream\
.format("console")\
.start()
# this line of code will turn to streaming application into never ending
write_query.awaitTermination()
print("Application Completed")
Error is getting
22/07/31 00:13:16 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\786000702\AppData\Local\Temp\temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
Traceback (most recent call last):
File "D:\PySparkProject\pySparkStream\socker_streaming.py", line 23, in <module>
write_query = stream_df \
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\streaming.py", line 1202, in start
return self._sq(self._jwrite.start())
File "D:\PySparkProject\venv\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\PySparkProject\venv\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "D:\PySparkProject\venv\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o36.start.
**: org.apache.hadoop.fs.InvalidPathException: Invalid path name Path part /C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 from URI hdfs://0.0.0.0:19000/C:/Users/786000702/AppData/Local/Temp/temporary-9bfc22f8-6f1a-49e5-a3fb-3e4ac2c1de54 is not a valid filename.
at org.apache.hadoop.fs.AbstractFileSystem.getUriPath(AbstractFileSystem.java:427)
at org.apache.hadoop.fs.Hdfs.mkdir(Hdfs.java:366)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:809)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:805)
at**
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:812)
at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createCheckpointDirectory(CheckpointFileManager.scala:368)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:121)
at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:42)
at
You can modify the FS path that Spark defaults by editing fs.defaultFS in core-site.xml file located either in your Spark or Hadoop conf directorie
You seem to have set that at hdfs://0.0.0.0:19000/ rather than some file:// URI path, based on the error

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

ModuleNotFoundError in PySpark caused in serializers.py

I am trying to submit a Spark Application to the local Kubernetes cluster on my machine (created via Docker Dashboard). The application depends on a python package, let's call it X.
Here is the application code:
import sys
from pyspark import SparkContext
from pyspark.sql import SparkSession
datafolder = "/opt/spark/data" # Folder created in container by spark's docker file
sys.path.append(datafolder) # X is contained inside of datafolder
from X.predictor import * # import functionality from X
def apply_x_functionality_on(item):
predictor = Predictor() # class from X.predictor
predictor.predict(item)
def main():
spark = SparkSession\
.builder\
.appName("AppX")\
.getOrCreate()
sc = spark.sparkContext
data = []
# Read data: [no problems there]
...
data_rdd = sc.parallelize(data) # create RDD
data_rdd.foreach(lambda item: apply_network(item)) # call function
if __name__ == "__main__":
main()
Initially I hoped to avoid such problems by putting the X folder to the data folder of Spark. When container is built, all the content of data folder is being copied to the /opt/spark/data. My Spark application appends contents of data folder to the system path, as such consuming the package X. Well, I thought it does.
Everything works fine until the .foreach function is called. Here is a snippet from loggs with error description:
20/11/25 16:13:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.1.0.60, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'X'
There are a lot of similar questions here: one, two, three, but none of the answers to them have helped me so far.
What I have tried:
I submitted application with .zip(ed) X (I zip it in container, by applying zip to X):
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=kostjaigin/spark-py:v3.0.1-X_0.0.1 \
--py-files "local:///opt/spark/data/X.zip" \
local:///opt/spark/data/MyApp.py
I added .zip(ed) X to Spark Context:
sc.addPyFile("opt/spark/data/X.zip")
I have resolved the issue:
Created dependencies folder under /opt/spark/data
Put X to dependencies
Inside of my docker file I pack dependencies folder in a zip archive to submit it later as py-files: cd /opt/spark/data/**dependencies** && zip -r ../dependencies.zip .
In Application:
...
from X.predictor import * # import functionality from X
...
# zipped package
zipped_pkg = os.path.join(datafolder, "dependencies.zip")
assert os.path.exists(zipped_pkg)
sc.addPyFile(zipped_pkg)
...
Add --py-files flag to the submit command:
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--py-files "local:///opt/spark/data/dependencies.zip" \
local:///opt/spark/data/MyApp.py
Run it
Basically it is all about adding a dependencies.zip Archive with all the required dependencies in it.

How to read Druid data using JDBC driver with spark?

How can I read data from Druid using spark and Avatica JDBC Driver?
This is avatica JDBC document
Reading data from Druid using python and Jaydebeapi module, I succeed like below code.
$ python
import jaydebeapi
conn = jaydebeapi.connect("org.apache.calcite.avatica.remote.Driver",
"jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/",
{"user": "druid", "password":"druid"},
"/root/avatica-1.17.0.jar",
)
cur = conn.cursor()
cur.execute("SELECT * FROM INFORMATION_SCHEMA.TABLES")
cur.fetchall()
output is:
[('druid', 'druid', 'wikipedia', 'TABLE'),
('druid', 'INFORMATION_SCHEMA', 'COLUMNS', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'SCHEMATA', 'SYSTEM_TABLE'),
('druid', 'INFORMATION_SCHEMA', 'TABLES', 'SYSTEM_TABLE'),
('druid', 'sys', 'segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'server_segments', 'SYSTEM_TABLE'),
('druid', 'sys', 'servers', 'SYSTEM_TABLE'),
('druid', 'sys', 'supervisors', 'SYSTEM_TABLE'),
('druid', 'sys', 'tasks', 'SYSTEM_TABLE')] -> default tables
But I want to read using spark and JDBC.
I tried it but there is a problem using spark like below code.
$ pyspark --jars /root/avatica-1.17.0.jar
df = spark.read.format('jdbc') \
.option('url', 'jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/') \
.option("dbtable", 'INFORMATION_SCHEMA.TABLES') \
.option('user', 'druid') \
.option('password', 'druid') \
.option('driver', 'org.apache.calcite.avatica.remote.Driver') \
.load()
output is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/root/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2999.load.
: java.sql.SQLException: While closing connection
...
Caused by: java.lang.RuntimeException: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "rpcMetadata" (class org.apache.calcite.avatica.remote.Service$CloseConnectionResponse), not marked as ignorable (0 known properties: ])
at [Source: {"response":"closeConnection","rpcMetadata":{"response":"rpcMetadata","serverAddress":"172.18.0.7:8082"}}
; line: 1, column: 46]
...
Note:
I downloaded Avatica jar file(avatica-1.17.0.jar) from maven-repository
I installed Druid server using docker-compose and default setting values.
I found another way to solve this problem. I used spark-druid-connector to connect druid with spark.
But I changed some codes like this to use this code for my environment.
This is my environment:
spark: 2.4.4
scala: 2.11.12
python: python 3.6.8
druid:
zookeeper: 3.5
druid: 0.17.0
However, it has a problem.
If you use spark-druid-connector at least once, all sql queries like spark.sql("select * from tmep_view") used from the following will be entered into this planner.
but, if you use dataframe's api like df.distinct().count(), then there are no problems. I didn't solve yet.
I tried with spark-shell:
./bin/spark-shell --driver-class-path avatica-1.17.0.jar --jars avatica-1.17.0.jar
val jdbcDF = spark.read.format("jdbc")
.option("url", "jdbc:avatica:remote:url=http://0.0.0.0:8082/druid/v2/sql/avatica/")
.option("dbtable", "INFORMATION_SCHEMA.TABLES")
.option("user", "druid")
.option("password", "druid")
.load()

How to connect to a secured Kafka cluster from Zeppelin ("Failed to construct kafka consumer")?

I am trying to read some data from a Kafka broker using structured streaming to display it in a Zeppelin note. I am using Spark 2.4.3, Scala 2.11, Python 2.7, Java 9 and Kafka 2.2 with SSL enabled hosted on Heroku, but get the StreamingQueryException: 'Failed to construct kafka consumer'.
I am using the following dependencies (set in the Spark interpreter settings):
org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.3
org.apache.spark:spark-streaming_2.11:2.4.3
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
I have tried older and newer versions, but these should match Spark/Scala versions I am using.
I have successfully written and read from Kafka using simple Python producer and consumer.
The code I am using:
%pyspark
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when
schema = StructType().add("power", IntegerType()).add("colorR", IntegerType()).add("colorG",IntegerType()).add("colorB",IntegerType()).add("colorW",IntegerType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", brokers) \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", "/home/ubuntu/kafka/truststore.jks") \
.option("kafka.ssl.keystore.location", "/home/ubuntu/kafka/keystore.jks") \
.option("kafka.ssl.keystore.password", password) \
.option("kafka.ssl.truststore.password", password) \
.option("kafka.ssl.endpoint.identification.algorithm", "") \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
schema = ArrayType(
StructType([StructField("power", IntegerType()),
StructField("colorR", IntegerType()),
StructField("colorG", IntegerType()),
StructField("colorB", IntegerType()),
StructField("colorW", IntegerType())]))
readDF = df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))
query = readDF.writeStream.format("console").start()
query.awaitTermination()
And the error I get:
Fail to execute line 43: query.awaitTermination()
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2171412221151055324.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 43, in <module>
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 103, in awaitTermination
return self._jsq.awaitTermination()
File "/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
StreamingQueryException: u'Failed to construct kafka consumer\n=== Streaming Query ===\nIdentifier: [id = 2ee20c47-8293-469a-bc0b-ef71a1f118bc, runId = 72422290-090a-4b6d-bd66-088a5a534240]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [cast(key#7 as string) AS key#22, jsontostructs(ArrayType(StructType(StructField(power,IntegerType,true), StructField(colorR,IntegerType,true), StructField(colorG,IntegerType,true), StructField(colorB,IntegerType,true), StructField(colorW,IntegerType,true)),true), cast(value#8 as string), Some(Etc/UTC)) AS jsontostructs(CAST(value AS STRING))#21]\n+- StreamingExecutionRelation KafkaV2[Subscribe[tanana-44614.lightbulb]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
When I use read and write instead of readStream and writeStream I do not get any errors, but nothing appears on the console when I send some data to Kafka.
What else should I try?
It looks like the Kafka Consumer cannot access ~/kafka/truststore.jks and hence the exception. Replace ~ with the fully-specified path (without the tilde) and the issue should go away.

Resources