Databricks on Apache Spark AttributeError: 'str' object has no attribute '_jvm' - apache-spark

When attempting to readStream data fron Azure Event Hub with Databricks on Apache Spark I get the error
AttributeError: 'str' object has no attribute '_jvm'
The details of the error is as follows:
----> 8 ehConf['eventhubs.connectionString'] = sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
The code is as follows:
sparkContext = ""
connectionString = 'Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=both4;SharedAccessKey=adfdMyKeyIGBKYBs=;EntityPath=hubv5'
# Source with default settings
connectionString = connectionString
ehConf = {}
ehConf['eventhubs.connectionString'] = sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
streaming_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Has anyone come across this error and found a solution?

It shouldn't be the sparkContext, but just sc:
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
P.S. But it's just easier to use built-in Kafka connector with EventHubs - you don't need to install anything, and it's more performant...

Related

Running a Spark Streaming job in Zeppelin throws connection refused 8998 error

I'm working in a virtual machine. I run a Spark Streaming job which I basically copied from a Databricks tutorial.
%pyspark
query = (
streamingCountsDF
.writeStream
.format("memory") # memory = store in-memory table
.queryName("counts") # counts = name of the in-memory table
.outputMode("complete") # complete = all the counts should be in the table
.start()
)
Py4JJavaError: An error occurred while calling o101.start.
: java.net.ConnectException: Call From VirtualBox/127.0.1.1 to localhost:8998 failed on connection exception: java.net.ConnectException:
I checked and there is no service listening on port 8998. I learned that this port is associated with the Apache Livy-server which I am not using. Can someone point me into the right direction?
Ok, so I fixed this issue. First, I added 'file://' when specifying the input folder. Second, I added a checkpoint location. See code below:
inputFolder = 'file:///home/sallos/tmp/'
streamingInputDF = (
spark
.readStream
.schema(schema)
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputFolder)
)
streamingCountsDF = (
streamingInputDF
.groupBy(
streamingInputDF.SrcIPAddr,
window(streamingInputDF.Datefirstseen, "30 seconds"))
.sum('Bytes').withColumnRenamed("sum(Bytes)", "sum_bytes")
)
query = (
streamingCountsDF
.writeStream.format("memory")\
.queryName("sumbytes")\
.outputMode("complete")\
.option("checkpointLocation","file:///home/sallos/tmp_checkpoint/")\
.start()
)

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

Unable to read Azure Eventhub topics from spark

Enviroment details
spark version : 3.x
Python version 3.8 and java version 8
azure-eventhubs-spark_2.12-2.3.17.jar
import json
from pyspark.sql import SparkSession
#the below command getOrCreate() uses the SparkSession shared across the jobs instead of using one SparkSession per job.
spark = SparkSession.builder.appName('ntorq_eventhub_load').getOrCreate()
#ntorq adls checkpoint location.
ntorq_connection_string = "connection-string"
ehConf = {}
ehConf['eventhubs.connectionString'] = spark.sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(ntorq_connection_string)
# ehConf['eventhubs.connectionString'] = ntorq_connection_string
ehConf['eventhubs.consumerGroup'] = "$default"
OFFSET_START = "-1" # the beginning
OFFSET_END = "#latest"
# Create the positions
startingEventPosition = {
"offset": OFFSET_START ,
"seqNo": -1, #not in use
"enqueuedTime": None, #not in use
"isInclusive": True
}
endingEventPosition = {
"offset": OFFSET_END, #not in use
"seqNo": -1, #not in use
"enqueuedTime": None,
"isInclusive": True
}
# Put the positions into the Event Hub config dictionary
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
ehConf["eventhubs.endingPosition"] = json.dumps(endingEventPosition)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load() \
.selectExpr("cast(body as string) as body_str")
df.writeStream \
.format("console") \
.start()
error
21/04/25 20:17:53 WARN Utils: Your hostname,resolves to a loopback address: 127.0.0.1; using 192.168.1.202 instead (on interface en0)
21/04/25 20:17:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/04/25 20:17:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "/Users/PycharmProjects/pythonProject/test.py", line 12, in <module>
ehConf['eventhubs.connectionString'] = spark.sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(ntorq_connection_string)
TypeError: 'JavaPackage' object is not callable
Code is working fine on databricks environment but unable to consume all messages from eventhub I tried clearing the default checkpointing folders before running every time but still facing the issue, so want to try on the local system.
When trying on local environment facing JavaPackage issue.
Appreciate any help.
thank you
You need to add EventHubs package when creating session:
park = SparkSession.builder.appName('ntorq_eventhub_load')\
.config("spark.jars.packages", "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.18")\
.getOrCreate()

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?
Error:
Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name.
at scala.sys.package$.error(package.scala:27)
Code:
from pyspark.sql import SparkSession
if __name__ == "__main__":
session = SparkSession.builder.master('local')
.appName("RealEstateSurvey").getOrCreate()
df = session \
.read \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
print("=== Print out schema ===")
session.stop()
The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat and com.databricks.spark.csv.DefaultSource) in your classpath. And spark got confused which one to choose.
All you need is tell spark to use com.databricks.spark.csv.DefaultSource by defining format option as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
Another alternative is to use load as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.load("/home/senthiljdpm/RealEstate.csv")
If anyone faced a similar issue in Spark Java, it could be because you have multiple versions of the spark-sql jar in your classpath. Just FYI.
I had faced the same issue, and got fixed when changed the Hudi version used in pom.xml from 9.0 to 11.1

Consume events from EventHub In Azure Databricks using pySpark

I could see spark connectors & guidelines for consuming events from Event Hub using Scala in Azure Databricks.
But, How can we consume events in event Hub from azure databricks using pySpark?
any suggestions/documentation details would help. thanks
Below is the snippet for reading events from event hub from pyspark on azure data-bricks.
// With an entity path
val with = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
# Source with default settings
connectionString = "Valid EventHubs connection string."
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)
I think there is slight modification that is required if you are using spark version 2.4.5 or greater and version of the Azure event Hub Connector 2.3.15 or above
For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted, So you need to pass it as shown in the code snippet below.
connectionString = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)

Resources