StreamingQueryException: Option 'basePath' must be a directory - apache-spark

I wrote this code and I got this error: StreamingQueryException: Option 'basePath' must be a directory. My goal is to write the streams in file csv sink. The directories output_path/ and checkpoint/ were created but are empty.
pipe = Pipeline(stages=indexers)
pipe_model = pipe.fit(dataset)
dataset= pipe_model.transform(dataset)
pipe_model.save("pipe_model")
df = spark\
.readStream\
.option("header", "true")\
.schema(schema)\
.csv("KDDTrain+.txt")
model = PipelineModel.load("pipe_model")
dataset= model.transform(df)
q=dataset.writeStream\
.format("csv")\
.option("header", "true")\
.option("format", "append") \
.queryName("okk")\
.trigger(processingTime="10 seconds")\
.option("checkpointLocation", "checkpoint/")\
.option("path", "output_path/")\
.outputMode("append") \
.start()
q.awaitTermination()
I got this error:
---------------------------------------------------------------------------
StreamingQueryException Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 q.awaitTermination()
File /usr/local/spark/python/pyspark/sql/streaming.py:101, in StreamingQuery.awaitTermination(self, timeout)
99 return self._jsq.awaitTermination(int(timeout * 1000))
100 else:
--> 101 return self._jsq.awaitTermination()
File /usr/local/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py:1321, in
JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File /usr/local/spark/python/pyspark/sql/utils.py:117, in capture_sql_exception.<locals>.deco(*a, **kw)
113 converted = convert_exception(e.java_exception)
114 if not isinstance(converted, UnknownException):
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
StreamingQueryException: Option 'basePath' must be a directory === Streaming Query ===
Identifier: okk [id = a5a4ac3d-a533-409b-be0b-015ead8d2f4a, runId = 3e3b747d-ba4f 4948-bc9f-4ef360e08979]
Current Committed Offsets: {}
Current Available Offsets: {FileStreamSource[file:/home/jovyan/work/KDDTrain+.txt {"logOffset":0}}
Current State: ACTIVE
Thread State: RUNNABLE
where is the problem? and how to fix it please?

I fixed the issue by reading from a directory not a file. i.e.:
df = spark\
.readStream\
.option("header", "true")\
.schema(schema)\
.csv("KDD/")
instead of:
df = spark\
.readStream\
.option("header", "true")\
.schema(schema)\
.csv("KDDTrain+.txt")

Related

How can I handle with special characters or scape them on spark version 3.3.0?

Previously I was using spark version 3.2.1 to read data from SAP database by JDBC connector, I had no issues to perform the following steps:
df_1 = spark.read.format("jdbc") \
.option("url", "URL_LINK") \
.option("dbtable", 'DATABASE."/ABC/TABLE"') \
.option("user", "USER_HERE") \
.option("password", "PW_HERE") \
.option("driver", "com.sap.db.jdbc.Driver") \
.load()
display(df_1)
df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")
display(df_2)
This code above runs as it should, returning expected rows.
Since I updated my spark version to 3.3.0, because I need to have the new trigger 'availableNow' (trigger from streaming process), this process above started to fail, does not run at all.
Please follow the error message bellow.
-----------------------------------------------------------------
----------
ParseException Traceback (most recent
call last)
<command-963568451378752> in <cell line: 3>()
1 df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")
2
----> 3 display(df_2)
/databricks/python_shell/dbruntime/display.py in display(self,
input, *args, **kwargs)
81 raise Exception('Triggers can only be
set for streaming queries.')
82
---> 83 self.add_custom_display_data("table",
input._jdf)
84
85 elif isinstance(input, list):
/databricks/python_shell/dbruntime/display.py in
add_custom_display_data(self, data_type, data)
34 def add_custom_display_data(self, data_type, data):
35 custom_display_key = str(uuid.uuid4())
---> 36 return_code =
self.entry_point.addCustomDisplayData(custom_display_key,
data_type, data)
37 ip_display({
38 "application/vnd.databricks.v1+display":
custom_display_key,
/databricks/spark/python/lib/py4j-0.10.9.5-
src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer =
self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id,
self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
200 # Hide where the exception came from that
shows a non-Pythonic
201 # JVM exception message.
--> 202 raise converted from None
203 else:
204 raise
ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input
'/'(line 1, pos 0)
== SQL ==
/ABC/COLUMN
^^^
I've already tried to format in so many different ways, following the instructions on: https://spark.apache.org/docs/latest/sql-ref-literals.html . I've already tried to use function string by previously formatting the string, also tried raw string, but nothing seems to work as supposed.
Another important information, I've tried to create a dummy code for you to be able to replicate the issue, but when I create those tables with slashes '/ABC/TABLE' containing columns with slashes '/ABC/COLUMN' directly on pyspark, instead of using JDBC connector, it actually works, I was able to filter.
As described above, I am expecting to perform the filter, no matter the spark version.

I cannot connect to S3 using PySpark from local machine (Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found)

I'm reading streaming data from Kafka using PySpark but when I want to write the streaming data to S3 I receive an error message: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
This is part of my code:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col, from_json, from_unixtime, unix_timestamp
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, TimestampType, MapType, ArrayType
from sparknlp.pretrained import PretrainedPipeline
spark = SparkSession.builder.appName('twitter_app')\
.master("local[*]")\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,com.amazonaws:aws-java-sdk:1.11.563,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.hadoop:hadoop-client-api:3.2.2,org.apache.hadoop:hadoop-client-runtime:3.2.2,org.apache.hadoop:hadoop-yarn-server-web-proxy:3.2.2')\
.config('spark.streaming.stopGracefullyOnShutdown', 'true')\
.config('spark.hadoop.fs.s3a.awsAccessKeyId', ACCESS_KEY) \
.config('spark.hadoop.fs.s3a.awsSecretAccessKey', SECRET_ACCESS_KEY) \
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
schema = StructType() \
.add("data", StructType() \
.add("created_at", TimestampType())
.add("id", StringType()) \
.add("text", StringType())) \
.add("matching_rules", ArrayType(StructType() \
.add('id', StringType()) \
.add('tag', StringType())))
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094") \
.option("subscribe", "Zelensky,Putin,Biden,NATO,NoFlyZone") \
.option("startingOffsets", "latest") \
.load() \
.select((from_json(col("value").cast("string"), schema)).alias('text'),
col('topic'), col('key').cast('string'))
df.writeStream \
.format("parquet") \
.option("checkpointLocation", "s3a://data-lake-twitter-app/checkpoint/") \
.option("path", "s3a://data-lake-twitter-app/raw-datalake/") \
.start()
And this is the full error message:
22/03/28 21:27:37 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/03/28 21:27:37 WARN FileSystem: Failed to initialize fileystem s3a://data-lake-twitter-app/raw-datalake: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 df.writeStream \
2 .format("parquet") \
3 .option("checkpointLocation", "s3a://data-lake-twitter-app/checkpoint/") \
4 .option("path", "s3a://data-lake-twitter-app/raw-datalake/") \
5 .start()
File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/sql/streaming.py:1202, in DataStreamWriter.start(self, path, format, outputMode, partitionBy, queryName, **options)
1200 self.queryName(queryName)
1201 if path is None:
-> 1202 return self._sq(self._jwrite.start())
1203 else:
1204 return self._sq(self._jwrite.start(path))
File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/pyspark/sql/utils.py:111, in capture_sql_exception.<locals>.deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
File ~/.local/share/virtualenvs/spark_home_lab-iuwyZNhT/lib/python3.9/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o81.start.
: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
at org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:631)
at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:597)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:257)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink.<init>(FileStreamSink.scala:135)
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:326)
at org.apache.spark.sql.streaming.DataStreamWriter.createV1Sink(DataStreamWriter.scala:432)
at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:399)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:248)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClasses(Configuration.java:2642)
at org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:628)
... 25 more
After checking other questions in stackoverflow might be that I'm not using the correct versions of those jar packages, how can I check the ones I need to install? I'm also using a piping environment, I don't know if that's relevant.
try to add
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
to avoid java.lang.ClassNotFoundException

com.microsoft.sqlserver.jdbc.SQLServerException: Error when reading Azure SQLDB from Apache Spark Databricks

I answered one of my previous questions, however, having fixed the issue, I am faced with an another issue regarding SQLServerException.
I am trying to read in data on an Azure SQLDB.
I have successfully authenticated to the server however when I try to apply the function to read in data I get the following error:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near
'5'
The following is more details on the error:
Py4JJavaError Traceback (most recent call last)
<command-3741352302548628> in readFromDb(processId, query)
3 try:
----> 4 jdbcDF = (spark.read
5 .format("jdbc")
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
209 else:
--> 210 return self._df(self._jreader.load())
211
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name
The code is as follows:
def readFromDb(processId, query):
try:
jdbcDF = (spark.read
.format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", f"jdbc:sqlserver://{DBServer}.database.windows.net;database={DBDatabase}")
.option("user", DBUser)
.option("query", query)
.option("password", DBPword)
.load()
)
return jdbcDF
except Exception as e:
writeToLogs(processId,LogType.Error, EventType.FailReadFromDb, LogMessage.FailReadFromDb, errorType = ErrorType.FailReadFromDb)
raise Error(f"{LogMessage.FailReadFromDb.value} ERROR: {e}")
except:
writeToLogs(processId,LogType.FailReadFromDb, EventType.FailReadFromDb, LogMessage.FailReadFromDb, errorType = ErrorType.FailReadFromDb)
raise Error(f"{LogMessage.FailReadFromDb.value}")
Can someone let me know what the code generally means and best approach to fix it?
Like #mac and Alex Ott said, the error is mostly caused by the query statements.
And we are glad to hear the error is resolved bye modify the query.

getting error while reading data from amazon redshift using spark(py)The bucket name parameter must be specified when requesting a bucket's location

Can you help me with spark+redshift+data bricks driver, reading data.
for now iam getting error calling read method. below is my piece of code.
df = spark.read.format("com.databricks.spark.redshift")
.option("url",redshifturl).option("dbtable", "PG_TABLE_DEF")
.option("tempdir","s3n://KEY_ID:SECRET_KEY_ID#/S2_BUCKET_NAME/TEMP_FOLDER_UNDER_S3_BUCKET/")
.option("aws_iam_role","AWS_IAM_ROLE").load()
Below is error log i am getting
IllegalArgumentException: u"The bucket name parameter must be specified when requesting a bucket's location"
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<command-3255625043609925> in <module>()
----> 1 df = spark.read .format("com.databricks.spark.redshift") .option("url", redshifturl) .option("dbtable", "pg_table_def") .option("tempdir", "s3n://AKIAJXVW3IESJSQUTCUA:kLHR85WfcieNrd7B7Rm/1FK1JU4NeKTrpe8BkLbx#/weatherpattern/temp/") .option("aws_iam_role", "arn:aws:iam::190137980335:user/user1") .load()
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
163 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
164 else:
--> 165 return self._df(self._jreader.load())
166
167 #since(1.4)
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u"The bucket name parameter must be specified when requesting a bucket's location"
I think there is some problem with s3n path but way i have given in .option method looks correct with my real credentials.
Any suggestion would be appreciated.
Thanks
Imran :)
--
Your path url is incorrect,
Format should be,
s3n://ACCESSKEY:SECRETKEY#bucket/path/to/temp/dir
df = spark.read.format("com.databricks.spark.redshift")
.option("url",redshifturl).option("dbtable", "PG_TABLE_DEF")
.option("tempdir","s3n://KEY_ID:SECRET_KEY_ID#S2_BUCKET_NAME/TEMP_FOLDER_UNDER_S3_BUCKET/")
.option("aws_iam_role","AWS_IAM_ROLE").load()
Documentation:
https://github.com/databricks/spark-redshift
Hope it helps.

I am getting IllegalArgumentException when creating a SparkSession

I am using pyspark and jupyter notebook on spark 2.1.0 and python 2.7. I am trying to create a new SparkSession using the code below;
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession\
.builder\
.appName("Bank Service Classifier")\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
sc = SparkContext()
sqlContext = SQLContext(sc)
However, if I am getting the following error;
IllegalArgumentException Traceback (most recent call last)
<ipython-input-40-2683a8d0ffcf> in <module>()
4 from pyspark.sql import SQLContext
5
----> 6 spark = SparkSession .builder .appName("example-spark") .config("spark.sql.crossJoin.enabled","true") .getOrCreate()
7
8 sc = SparkContext()
/srv/spark/python/pyspark/sql/session.py in getOrCreate(self)
177 session = SparkSession(sc)
178 for key, value in self._options.items():
--> 179 session._jsparkSession.sessionState().conf().setConfString(key, value)
180 for key, value in self._options.items():
181 session.sparkContext._conf.set(key, value)
/srv/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/srv/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
How do i fix this?
I ran into this same error. Downloading Spark pre-built for Hadoop 2.6 instead of 2.7 worked for me.

Resources