Sending PySpark results to MQTT broker with foreachRDD and paho

Sending PySpark results to MQTT broker with foreachRDD and paho - apache-spark

I'm trying to send a DStream with my computed results to an MQTT broker, but foreachRDD keeps crashing.
I'm running Spark 2.4.3 with Bahir for MQTT subscribe compiled from git master. Everything works up to this point. Before trying to publish my results with MQTT, I tried to saveAsFiles(), and that worked (but isn't exactly what I want).
def sendPartition(part):
# code for publishing with MQTT here
return 0
mydstream = MQTTUtils.createStream(ssc, brokerUrl, topic)
mydstream = packets.map(change_format) \
.map(lambda mac: (mac, 1)) \
.reduceByKey(lambda a, b: a + b)
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition)) # line 56
the resulting Error I get is this:
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/util.py", line 68, in call
r = self.func(t, *rdds)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 161, in <lambda>
func = lambda t, rdd: old_func(rdd)
File "/path/to/my/code.py", line 56, in <lambda>
mydstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 806, in foreachPartition
self.mapPartitions(func).count() # Force evaluation
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
vals = self.mapPartitions(func).collect()
File "/SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/SPARK_HOME/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.: java.lang.IllegalArgumentException: Unsupported class file major version 55
with lots of java errors following, but I suspect the error is in my code.

Are you able to run other Spark commands? At the end of your stack trace, you see java.lang.IllegalArgumentException: Unsupported class file major version 55. This indicates that you are running Spark on an unsupported version of Java.
Spark is not yet compatible with Java 11 (due to limitations imposed by Scala I think). Try configuring spark to use Java 8. The specifics vary a bit based on what platform you're on. You'll probably need to install Java 8, and change the JAVA_HOME environment variable to point towards the new installation.

Related

Can't write anything in PySpark

I try to write a DataFrame into any kind of file format. I tried to reinstall spark several times in different ways and different versions, but receieve the same error everytime, even on another machine. Currently using Spark 3.3.1 on Hadoop 2.7 locally on Windows 11:
data = [[1, 43, 41], [2, 43, 41], [3, 43, 4]]
x = spark.createDataFrame(data)
x.write.csv('qqq')
And receive this:
File "D:\venvs\spark2\spark_hw.py", line 77, in <module>
x.write.csv('qqq')
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\readwriter.py", line 1240, in csv
self._jwrite.csv(path)
File "D:\venvs\spark2\lib\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "D:\venvs\spark2\lib\site-packages\pyspark\sql\utils.py", line 190, in deco
return f(*a, **kw)
File "D:\venvs\spark2\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o44.csv.
: org.apache.spark.SparkException: Job aborted.

x.write.format("csv").save("path/where/file/should/go")
will write your dataframe to a csv to the path specified in the save method

Azure databricks: KafkaUtils createDirectStream causes Py4JNetworkError("Answer from Java side is empty") error

In Azure databricks, I tried to create a kafka stream in notebook and used it to create a spark
job. Databricks throw error at the line KafkaUtils.createDirectStream(). Attached the correponding code below.
from kazoo.client import KazooClient
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
print('SSC created:: {}'.format(ssc))
zk = KazooClient(hosts=kafka_host)
print(kafka_host)
zk.start()
_offset_directory = "/" + topic + "/" + "DA_DAINT" + "/partitions"
print(_offset_directory)
if zk.exists(_offset_directory):
partitions = zk.get_children(_offset_directory)
print(partitions)
partition_offsets_dict = {}
for partition in partitions:
offset, stat = zk.get((_offset_directory + '/' + partition))
partition_offsets_dict[partition] = offset.decode()
print(partition_offsets_dict)
from_offset = {}
for _partition in partitions:
offset = partition_offsets_dict[_partition]
topic_partition = TopicAndPartition(topic, int(_partition))
from_offset[topic_partition] = int(offset)
print(from_offset)
print("\nCreate kafka direct stream ...")
kafka_stream = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker_list},
fromOffsets=from_offset)
Attaching the error stack traces.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
An error occurred while calling
o581.createTopicAndPartition Traceback (most recent call last):
File "<command-3832551107104577>", line 77, in <module> fromOffsets=from_offset)
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in createDirectStream v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py", line 141, in <listcomp> v) for (k, v) in fromOffsets.items()])
File "/databricks/spark/python/pyspark/streaming/kafka.py",
line 314, in _jTopicAndPartition return helper.createTopicAndPartition(self._topic, self._partition)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)
File "/databricks/spark/python/pyspark/sql/utils.py",
line 63, in deco return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o581.createTopicAndPartition
In Azure databricks, when using Kafka stream in python notebook, I have installed kafka-python and org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 libraries and added them as a dependencies to the spark-job in databricks.
Note 1:
Also i am able to receive data from Kafka when i use simple kafka consumer in databricks notebook.
from kafka import KafkaConsumer
if __name__ == "__main__":
consumer_ = KafkaConsumer(group_id='test', bootstrap_servers=['my_kafka_server:9092'])
print(consumer_.topics())
consumer_.subscribe(topics=['dev_test'])
for m in consumer_:
print(m)
The problem arises only, if i try to create Kafka direct stream using KafkaUtils.createDirectStream() in azure databricks python notebook.
Another minimal set of code for reproducing this issue,
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
broker = "broker:9092"
topic = "dev_topic"
sc = spark.sparkContext
ssc = StreamingContext(sc, 30)
dks = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": broker})
print("Direct stream created...")
parsed = dks.map(lambda v: v[1])
summary_dstream = parsed.count().map(lambda x: 'Words in this batch: %s' % x)
print(summary_dstream)
NOTE 2:
Kafka version: 0.10
Scala version: 2.11
Spark version: 2.4.3

Still i am unable to get the root cause of the issue.
But using the jar org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3 fixed the issue.
UPDATE 1:
Got the following update from microsoft support team:
Below is the update from databricks engineering.
We see the customer
is using DStreams API
(https://learn.microsoft.com/en-us/azure/databricks/spark/latest/rdd-streaming/)
which is outdated and we don't support it anymore. Also, we strongly recommend them switch
to Structured Streaming, you can follow this doc for doing it -
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/kafka

Error while collecting the data from dataframe column in Pyspark

i am using Pyspark (Python 3.7 with Spark 2.4) and have a small line of code to collect a date from one of the attributes in Dataframe, i can run the same code from pyspark command line, however in my production code it errors out.
Here is the line of code where i read a dataframe "df" and collecting date from field "job_id".
>>> run_dt = map( lambda r:r[0], df.filter((df['delivery_date'] == '2017-12-31')).select(max(substring(df['job_id'], 9, 10).cast("integer")).alias('last_run')).collect())[0]
>>> print(run_dt)
2017123101
The same code line gives me an error in my production code while evaluation. The error message is-
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\dataframe.py", line 533, in collect
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 63, in deco
File "C:\Users\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.collectToPython.

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.

SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Today, I wanna try some new features with Spark2.0, here is my program:
#coding:utf-8
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
df = spark.read.json("/Users/lyj/Programs/Apache/Spark2/examples/src/main/resources/people.json")
df.show()
but it errors as follow:
Traceback (most recent call last):
File "/Users/lyj/Programs/kiseliugit/MyPysparkCodes/test/spark2.0.py", line 5, in <module>
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/context.py", line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/java_gateway.py", line 116, in launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 90, in java_import
return_value = get_return_value(answer, gateway_client, None, None)
File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 306, in get_return_value
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
KeyError: u'y'
What's wrong with these few lines of codes? Does it have the problem with java environment? Plus, I use IDE PyCharm for developing.

Try to upgrade py4j, pip install py4j --upgrade
It's worked for me.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sending PySpark results to MQTT broker with foreachRDD and paho - apache-spark

Related

Can't write anything in PySpark

Azure databricks: KafkaUtils createDirectStream causes Py4JNetworkError("Answer from Java side is empty") error

Error while collecting the data from dataframe column in Pyspark

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

How could I write the right entry point in Spark 2.0 program (Actually pyspark 2.0)?

Categories

Resources