Unable to send Pyspark data frame to Kafka topic - apache-spark

I am trying to send data from a daily batch to a Kafka topic using pyspark, but I currently receive the following error:
Traceback (most recent call last): File "", line 5, in
File
"/usr/local/rms/lib/hdp26_c5000/spark2/python/pyspark/sql/readwriter.py",
line 548, in save
self._jwrite.save() File "/usr/local/rms/lib/hdp26_c5000/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/usr/local/rms/lib/hdp26_c5000/spark2/python/pyspark/sql/utils.py",
line 71, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"Invalid call to toAttribute on
unresolved object, tree: unresolvedalias('shop_id, None)"
The code I am using is as follows:
from pyspark.sql import SparkSession
from pyspark.sql import functions
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.debug.maxToStringFields", 100000) \
.getOrCreate()
df = spark.sql('''select distinct shop_id, item_id
from sale.data
''')
df.selectExpr("shop_id", "item_id") \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "myserver.local:443") \
.option("topic","test_topic_01") \
.save()
Currently used versions are:
-Spark 2.1.1.2.6.2.0-205
-Kafka Broker 0.11

Kafka expects that a key and a value is written into its topic. Although the key is not mandatory. It does that by looking at the names of the dataframe columns which should be "key" and "value".
In your query, you are only selecting the column "shop_id", so no key or value column is existing. The error message: "unresolvedalias('shop_id, None)" tells you that the column "shop_id" is selected as the key (as it is the first column), but nothing is interpreted as the mandatory value.
You can solve your issue by renaming the column to "value", something like:
df = spark.sql('''select distinct shop_id, item_id
from sale.data
''')
df.withColumn("value", col("shop_id").cast(StringType)) \
.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "myserver.local:443") \
.option("topic","test_topic_01") \
.save()

Related

Python Proxy Error when streaming XML files from Azure Event Hub using Databricks

I've got the below piece of code to retrieve XML files, extract some of the tags and save as CSV files. As tag values need to be saved as separate files I'm using foreachbatch method of df.writeStream; to extract and save them separately. See below, the environment/version, the code used and the error returned when executed on Azure Databricks.
Environment:
Databricks Runtime version: 10.4 LTS
Apache Spark 3.2.1,
Scala 2.12
Event hubs library from maven: com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22
Code:
# Databricks notebook source
import lxml.etree as ET
import pyspark.sql.types as T
from os.path import dirname, join
# Define namespaces found in the xml files to pick elements from the default ("") namespace or specific namespace
namespaces = {
"": "http://www.fpml.org/FpML-5/reporting",
"xsi": "http://www.w3.org/2001/XMLSchema-instance"
}
# trade date **********************************
trade_header = T.StructType([
T.StructField("messageId", T.StringType(), False),
T.StructField("tradeDate", T.StringType(), False)
])
def to_xml_message_trade_date(xml_string):
root = ET.fromstring(xml_string)
messageId = root.find(".//messageId", namespaces).text
tradeDate = root.find(".//tradeDate", namespaces).text
return [messageId, tradeDate]
extract_udf = udf(to_xml_message_trade_date, trade_header)
**********************************************
connectionString = "Endpoint=sb://xxxxxx.servicebus.windows.net/;SharedAccessKeyName=xxxx;SharedAccessKey=xxxxxxx;EntityPath=xxxxx"
ehConf = {
'eventhubs.connectionString' : sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
}
stream_data = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.option('multiLine', True) \
.option('mode', 'PERMISSIVE') \
.load()
df_str = stream_data.withColumn("data", stream_data["body"].cast("string"))
def write2csv(df, epoch_id):
df.persist()
df_tuples = df.select(extract_udf("data").alias("extracted_data"))
df_parsed = df_tuples.select("extracted_data.*")
df_parsed \
.write \
.format("csv") \
.mode(SaveMode.Append) \
.option("header", True) \
.save("dbfs:/FileStore/Incoming/trade_date/")
df.unpersist()
query = df_str \
.writeStream \
.outputMode("append") \
.foreachBatch(write2csv) \
.trigger(processingTime="1 seconds") \
.start()
query.awaitTermination()
Error returned:
StreamingQueryException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
StreamingQueryException Traceback (most recent call last)
<command-1879221600357983> in <module>
6 .start()
7
----> 8 query.awaitTermination()
9
10 # .format("csv") \
/databricks/spark/python/pyspark/sql/streaming.py in awaitTermination(self, timeout)
101 return self._jsq.awaitTermination(int(timeout * 1000))
102 else:
--> 103 return self._jsq.awaitTermination()
104
105 #property
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
I can normally stream and save tag values in a single file using the below code snippet but the issue occurs when I use foreachbatch to save tag values in separate files.
df_trade_date \
.writeStream \
.format("csv") \
.trigger(processingTime="30 seconds") \
.option("checkpointLocation", "dbfs:/FileStore/checkpoint/") \
.option("path", "dbfs:/FileStore/Incoming/trade_date/") \
.option("header", True) \
.outputMode("append") \
.start() \
.awaitTermination()
What am I missing here? Are there any suggestions?
Changing write2csv function with below fixed the issue
def write2csv(df, epoch_id):
df.persist()
df_tuples = df.select(extract_udf("data").alias("extracted_data"))
df_parsed = df_tuples.select("extracted_data.*")
df_parsed \
.write \
.format("csv") \
.mode("append") \
.option("header", True) \
.save("dbfs:/FileStore/Incoming/trade_date/")
df.unpersist()
Note .mode("append") \ line where I replaced Savemode.Append with

Error while connecting big query in GCP using Spark

I was trying to connect Google big query using pySpark using the below code :
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("GCP")
sc = SparkContext(conf=conf)
master = "yarn"
spark = SparkSession.builder \
.master("local")\
.appName("GCP") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","key.json")
df = spark.read.format('bigquery') \
.option("parentProject", "project_name") \
.option('table', 'project_name.table_name') \
.load()
df.show()
my spark version 2.3 and big query jar : spark-bigquery-latest_2.12
Though my service account was having "BigQuery Job User" permission at project level and bigquery data viewer and bigquery user at dataset level , but still I am getting the below error when trying to execute the above code
Traceback (most recent call last):
File "/home/lo815/GCP/gcp.py", line 23, in <module>
df.show()
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/GCP'
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:53)

IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment

I'm trying to connect BigQuery Dataset to Databrick and run Script using Pyspark.
Procedures I've done:
I patched the BigQuery Json API to databrick in dbfs for connection access.
Then I added spark-bigquery-latest.jar in the cluster library and I ran my Script.
When I run this script, I didn't face any error.
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'dbfs:/FileStore/jars/jarlocation.jar') \
.getOrCreate()
)
df = spark.read.format("bigquery").option("credentialsFile", "/dbfs/FileStore/tables/bigqueryapi.json") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("table","dataset.schema.tablename") \
.load()
df.show()
But Instead of calling a single table in that schema I tried to call all the tables under it using query like:
from pyspark.sql import SparkSession
from google.cloud import bigquery
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'dbfs:/FileStore/jars/jarlocation.jar') \
.getOrCreate()
)
client = bigquery.Client()
table_list = 'dataset.schema'
tables = client.list_tables(table_list)
for table in tables:
tlist = tlist.append(table)
for i in tlist:
sql_query = """select * from `dataset.schema.' + i +'`"""
df = spark.read.format("bigquery").option("credentialsFile", "/dbfs/FileStore/tables/bigqueryapi.json") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("query", sql_query).load()
df.show()
OR
This Script:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'dbfs:/FileStore/jars/jarlocation.jar') \
.getOrCreate()
)
sql_query = """select * from `dataset.schema.tablename`"""
df = spark.read.format("bigquery").option("credentialsFile", "/dbfs/FileStore/tables/bigqueryapi.json") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("query", sql_query).load()
df.show()
I get this unusual Error:
IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment. Please set a project ID using the builder.
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<command-131090852> in <module>
35 .option("parentProject", "google-project-ID") \
36 .option("project", "Dataset-Name") \
---> 37 .option("query", sql_query).load()
38 #df.show()
39
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
182 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
183 else:
--> 184 return self._df(self._jreader.load())
185
186 #since(1.4)
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
131 # Hide where the exception came from that shows a non-Pythonic
132 # JVM exception message.
--> 133 raise_from(converted)
134 else:
135 raise
/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment. Please set a project ID using the builder.
It do recognize my project ID when I call it as table, but when I run it as query I get this error.
I tried to figure it out and went through many sites for an answer but couldn't get a clear answer for it.
Help is much appreciated... Thanks in Advance...
Can you avoid using queries and just use the table option?
from pyspark.sql import SparkSession
from google.cloud import bigquery
spark = (
SparkSession.builder
.appName('bq')
.master('local[4]')
.config('parentProject', 'google-project-ID')
.config('spark.jars', 'dbfs:/FileStore/jars/jarlocation.jar') \
.getOrCreate()
)
client = bigquery.Client()
table_list = 'dataset.schema'
tables = client.list_tables(table_list)
for table in tables:
tlist = tlist.append(table)
for i in tlist:
df = spark.read.format("bigquery").option("credentialsFile", "/dbfs/FileStore/tables/bigqueryapi.json") \
.option("parentProject", "google-project-ID") \
.option("project", "Dataset-Name") \
.option("table","dataset.schema." + str(i)) \
.load()
df.show()
In my case I had the same exception but because I wasn't specifying the config value parentProject which is the BigQuery project ID I'm connecting to

how to write same stream with different dataframes in console format?

As I am new to spark structure streaming and facing issue on a simple scenario:
I am trying to write one stream with two different dataframes.
from pyspark.sql import functions as f
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "topic1") \
.option("failOnDataLoss", False)\
.option("startingOffsets", "earliest") \
.load()
data1 = df.filter(f.col('status') == 'true')
data2 = df.filter(f.col('status') == 'false')
data2 = data2.select(df.id,f.struct(df.col1, df.col2, df.col3).alias('value'))
data2 = data2.groupBy("id").agg(f.collect_set('value').alias('history'))
data1 = data1.writeStream.format("console").option("truncate", "False").trigger(processingTime='15 seconds').start()
data2 = data2.writeStream.format("console").option("truncate", "False").trigger(processingTime='15 seconds').start()
spark.streams.awaitAnyTermination()
I am getting below error for the same:
Traceback (most recent call last):
File "/home/adarshbajpai/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark /sql/utils.py", line 63, in deco
File "/home/adarshbajpai/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o186.start.
: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Aggregate [customerid#93L], [customerid#93L, collect_set(hist_value#278, 0, 0) AS histleadstatus#284]
+- Project [customerid#93L, named_struct(islaststatus, islaststatus#46, statusid, statusid#43, status, statusname#187, createdOn, statusCreatedDate#59, updatedOn, statusUpdatedDate#60) AS hist_value#278]
+- Filter (islaststatus#46 = 0)
I think I should not use watermark as my streaming has no delay and any latency.
please suggest ! Thanks in advance.

How to connect to a secured Kafka cluster from Zeppelin ("Failed to construct kafka consumer")?

I am trying to read some data from a Kafka broker using structured streaming to display it in a Zeppelin note. I am using Spark 2.4.3, Scala 2.11, Python 2.7, Java 9 and Kafka 2.2 with SSL enabled hosted on Heroku, but get the StreamingQueryException: 'Failed to construct kafka consumer'.
I am using the following dependencies (set in the Spark interpreter settings):
org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.3
org.apache.spark:spark-streaming_2.11:2.4.3
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
I have tried older and newer versions, but these should match Spark/Scala versions I am using.
I have successfully written and read from Kafka using simple Python producer and consumer.
The code I am using:
%pyspark
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when
schema = StructType().add("power", IntegerType()).add("colorR", IntegerType()).add("colorG",IntegerType()).add("colorB",IntegerType()).add("colorW",IntegerType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", brokers) \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", "/home/ubuntu/kafka/truststore.jks") \
.option("kafka.ssl.keystore.location", "/home/ubuntu/kafka/keystore.jks") \
.option("kafka.ssl.keystore.password", password) \
.option("kafka.ssl.truststore.password", password) \
.option("kafka.ssl.endpoint.identification.algorithm", "") \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
schema = ArrayType(
StructType([StructField("power", IntegerType()),
StructField("colorR", IntegerType()),
StructField("colorG", IntegerType()),
StructField("colorB", IntegerType()),
StructField("colorW", IntegerType())]))
readDF = df.select( \
col("key").cast("string"),
from_json(col("value").cast("string"), schema))
query = readDF.writeStream.format("console").start()
query.awaitTermination()
And the error I get:
Fail to execute line 43: query.awaitTermination()
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2171412221151055324.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 43, in <module>
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 103, in awaitTermination
return self._jsq.awaitTermination()
File "/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
StreamingQueryException: u'Failed to construct kafka consumer\n=== Streaming Query ===\nIdentifier: [id = 2ee20c47-8293-469a-bc0b-ef71a1f118bc, runId = 72422290-090a-4b6d-bd66-088a5a534240]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: ACTIVE\nThread State: RUNNABLE\n\nLogical Plan:\nProject [cast(key#7 as string) AS key#22, jsontostructs(ArrayType(StructType(StructField(power,IntegerType,true), StructField(colorR,IntegerType,true), StructField(colorG,IntegerType,true), StructField(colorB,IntegerType,true), StructField(colorW,IntegerType,true)),true), cast(value#8 as string), Some(Etc/UTC)) AS jsontostructs(CAST(value AS STRING))#21]\n+- StreamingExecutionRelation KafkaV2[Subscribe[tanana-44614.lightbulb]], [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13]\n'
When I use read and write instead of readStream and writeStream I do not get any errors, but nothing appears on the console when I send some data to Kafka.
What else should I try?
It looks like the Kafka Consumer cannot access ~/kafka/truststore.jks and hence the exception. Replace ~ with the fully-specified path (without the tilde) and the issue should go away.

Resources