KinesisUtils.createStream error in Spark streaming + Kinesis - apache-spark

I'm trying to stream data from AWS Kinesis using Spark Streaming + Kinesis Integration
My code looks like:
sc = SparkContext('local[*]', 'app_name')
ssc = StreamingContext(sc, 10)
kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName='kinesis_app_name',
streamName='kinesis_stream_name',
endpointUrl='https://kinesis.ap-southeast-2.amazonaws.com',
regionName='ap-southeast-2',
initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
checkpointInterval=10)
The command to run the script: spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.2.0 script.py. I'm using Spark 2.2.0 with Pyspark.
The error I got:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1035, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "kinesis_to_s3.py", line 63, in
checkpointInterval=streaming_interval)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/pyspark/streaming/kinesis.py", line 92, in createStream
stsSessionName, stsExternalId)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/protocol.py", line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o27.createStream
Exception in thread "Thread-2" java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Related

Dynamically creating a structured streaming pipeline with python

I'm trying to create a structured streaming pipeline that will read N kafka topics, do some payload validation, explode the payload and write to:
n kafka topics
Amazon s3
I've followed this article to generate the pipeline.
The shape of my piepline can either be:
Subscription
|
---process---
| | | | |
N outputs
or
N Subscriptions
| | | | |
N Processes
| | | | |
N outputs
This is the code I'm using:
spark = SparkSession \
.builder \
.appName(f"ingest") \
.master("local[*]") \
.getOrCreate()
def start_job(spark, topic):
# pipeline logic here
for topic in pipeline_config.list_topics():
thread = threading.Thread(target=start_job, args=(spark, topic))
thread.start()
spark.streams.awaitAnyTermination()
Whenever I run this I get java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.:
Exception in thread Thread-2 (start_job):
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Exception in thread Thread-3 (start_job):
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 946, in run
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs)
File "/Users/USER/git/schema-tools/pipeline2.py", line 164, in start_job
File "/Users/USER/git/schema-tools/pipeline2.py", line 164, in start_job
parsed_with_metadata \parsed_with_metadata \
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/streaming.py", line 1491, in start
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/streaming.py", line 1491, in start
return self._sq(self._jwrite.start())
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/java_gateway.py", line 1304, in __call__
return self._sq(self._jwrite.start())
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
return_value = get_return_value( File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/utils.py", line 111, in deco
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
return f(*a, **kw)
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
File "/Users/USER/venvs/schema-tools-310/lib/python3.10/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o314.start.
: java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:411)
at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:466)
at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:456)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o312.start.
: java.util.ConcurrentModificationException: Another instance of this query was just started by a concurrent session.
Is this really not possible?
Running on:
py3.10
spark 3.1.2
packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.1.2, org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, org.apache.commons:commons-pool2:2.11.1
Mac M1 pro
PS: I can do something similar using foreach batch but I really don't like the approach.

Pandas UDF for pyspark - Package not found error

I am using the pandas UDF approach to scale my models. However, I am getting an error with the pmdarima package not found. The code works fine till I run it on my notebook on the pandas dataframe itself. So the package is available for use in the notebook. From few answers online, the error seems in package not being available on the worker nodes where the code is trying to parallelize. Can someone help on how to resolve this? How can I also install the package on my worker nodes, if that's the case.
FYI - I am working on Azure Databricks.
def funct1(grp_keys, df):
other statements
model = pm.auto_arima(train_data['sum_hlqty'],X=x,
test='adf',trace=False,
maxiter = 12,max_p=5,max_q=5,
njobs=-1)
forecast_df = sales.groupby('Col1','Col2').applyInPandas(funct1,schema="C1 string, C2 string, C3 date, C4 float, C5 float")
Py4JJavaError: An error occurred while calling o256.sql.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:230)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$5(TransactionalWriteEdge.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.$anonfun$writeFiles$1(TransactionalWriteEdge.scala:135)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:431)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:412)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:338)
at com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:19)
at com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:56)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:129)
at com.databricks.spark.util.UsageLogger.recordOperation(UsageLogger.scala:71)
at com.databricks.spark.util.UsageLogger.recordOperation$(UsageLogger.scala:58)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:85)
at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:401)
at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:380)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:108)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:94)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.recordDeltaOperation(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.writeFiles(TransactionalWriteEdge.scala:92)
at com.databricks.sql.transaction.tahoe.files.TransactionalWriteEdge.writeFiles$(TransactionalWriteEdge.scala:88)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:112)
at com.databricks.sql.transaction.tahoe.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:111)
at com.databricks.sql.transaction.tahoe.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:84)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.write(WriteIntoDelta.scala:112)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$2(WriteIntoDelta.scala:71)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$2$adapted(WriteIntoDelta.scala:70)
at com.databricks.sql.transaction.tahoe.DeltaLog.withNewTransaction(DeltaLog.scala:203)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:70)
at com.databricks.sql.acl.CheckPermissions$.trusted(CheckPermissions.scala:1128)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.run(WriteIntoDelta.scala:69)
at com.databricks.sql.transaction.tahoe.catalog.WriteIntoDeltaBuilder$$anon$1.insert(DeltaTableV2.scala:193)
at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1(V1FallbackWriters.scala:118)
at org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1$(V1FallbackWriters.scala:116)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.writeWithV1(V1FallbackWriters.scala:38)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExecV1.run(V1FallbackWriters.scala:44)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:234)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3709)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3707)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:234)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:680)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:675)
at sun.reflect.GeneratedMethodAccessor655.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 7774.0 failed 4 times, most recent failure: Lost task 98.3 in stage 7774.0 (TID 177293, 10.240.138.10, executor 133): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'pmdarima''** Full traceback below:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'pmdarima'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 638, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 438, in read_udfs
arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
File "/databricks/spark/python/pyspark/worker.py", line 255, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 75, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 180, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 177, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 466, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'pmdarima'****strong text**

How to lazy load log messages in Glue python?

Running a glueetl, GlueVersion 2.0, python 3, AWS Glue job I am trying to do the recommended python3 lazy loading log messages with
logger.info("Attempting to run python module 1 {entrypoint}".format(entrypoint=entrypoint))
logger.info("Attempting to run python module 2 %s", entrypoint)
This produces an error on the second line but the first line succeeds and prints the string.
2021-06-25 04:12:58,782 INFO [Thread-7] log.GlueLogger (GlueLogger.scala:info(8)): Attempting to run python module 1 main.test_program
2021-06-25 04:12:58,818 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
File "/tmp/main_etl_script.py", line 103, in <module>
spark=spark)
File "/tmp/main_etl_script.py", line 32, in main_handler
logger.info("Attempting to run python module 2 %s", entrypoint)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o85.info. Trace:
py4j.Py4JException: Method info([class java.lang.String, class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
My logger setup is simple:
import logging
...
glue_context = GlueContext(SparkContext())
logger = glue_context.get_logger()
Why does this occur?
It doesn't like the comma after the 2 %s":
logger.info("Attempting to run python module 2 %s", entrypoint)
try this with a f string:
logger.info(f"Attempting to run python module 2 {entrypoint}")

java.io.IOException: No FileSystem for scheme: C and WinError 10054: An existing connection was forcibly closed by the remote host

I was trying to Connect and Fetch data from BigQuery Dataset to Local Pycharm Using Pyspark.
I ran this below Script in Pycharm:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('spark.jars', "C:/Users/PycharmProjects/pythonProject/spark-bigquery-latest.jar")\
.getOrCreate()
conn = spark.read.format("bigquery")\
.option("credentialsFile", "C:/Users/PycharmProjects/pythonProject/google-bq-api.json")\
.option("parentProject", "Google-Project-ID")\
.option("project", "Dataset-Name")\
.option("table", "dataset.schema.tablename")\
.load()
conn.show()
For this I got the below error:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: C
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:191)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$4(SparkSubmit.scala:363)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:363)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "C:\Users\naveen.chandar\PycharmProjects\pythonProject\BigQueryConnector.py", line 4, in <module>
spark = SparkSession.builder.config('spark.jars', 'C:/Users/naveen.chandar/PycharmProjects/pythonProject/spark-bigquery-latest.jar').getOrCreate()
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\sql\session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 376, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\context.py", line 325, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python39\lib\site-packages\pyspark\java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
So, I researched and tried it from different Diecrtory like "D-drive" and also tried to fix a static port with set PYSPARK_SUBMIT_ARGS="--master spark://<IP_Address>:<Port>", but still I got the same error in Pycharm.
Then I thought of trying the same script in local Command Prompt under Pyspark and I got this error:
failed to find class org/conscrypt/CryptoUpcalls
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1152, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "C:\Users\naveen.chandar\AppData\Local\Programs\Python\Python37\lib\socket.py", line 589, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 381, in show
print(self._jdf.showString(n, 20, vertical))
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "D:\spark-2.4.7-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "D:\spark-2.4.7-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o42.showString
My Python Version is 3.7.9 and Spark Version is 2.4.7
So either way I ran out of idea's and I appreciate some help on any one of the situation I facing...
Thanks In Advance!!
Start your file system references with file:///c:/...
You need to replace / with \ for the path to work

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.
SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

Resources