Apache Spark 2.0 MQTT - Err "No module named mqtt" - apache-spark

I get this error when I try to run my application using spark 2.0. I tried downloading the package from https://github.com/spark-packages/dstream-mqtt but the repositories don't exist. Also, I tried searching for the packages at https://spark-packages.org/ , but couldn't find any. My program is very simple,
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from dstream_mqtt import MQTTUtils
#from pyspark.streaming.mqtt import MQTTUtils
sc = SparkContext()
ssc = StreamingContext(sc, 6)
mqttStream = MQTTUtils.createStream(ssc,"tcp://192.168.4.54:1883","/test")
mqttStream.pprint()
mqttStream.saveAsTextFiles("test/status", "txt")
ssc.start()
ssc.awaitTermination()
ssc.stop()
I have downloaded and tried including the Jar files - spark-streaming-mqtt-assembly_2.11-1.6.2.jar and spark-streaming-mqtt_2.11-1.6.2.jar, but it is not helping.
The same code and packages just work fine with Spark 1.6.
Any help will be appreciated.

Related

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
I am able to read the data from Hudi using my jupyter notebook using below commands
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate
import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
"path"
)
import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)
This jupyter notebook was opened using a cluster which was created with one of the below properties
--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \
however, when I try to run the job using spark-submit with the same cluster, I get the error above.
I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.
As your application is dependent on hudi jar, hudi itself has some dependencies, when you add the maven package to your session, spark will install hudi jar and its dependencies, but in your case, you provide only the hudi jar file from a GCS bucket.
You can try this property instead:
--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \
Or directly from you notebook:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config(
"spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
).config(
"spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
).config(
"spark.jars.package", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0"
).enableHiveSupport().getOrCreate

azure pyspark register udf from jar Failed UDFRegistration

I'm having trouble registering some udfs that are in a java file. I've a couple approaches but they all return :
Failed to execute user defined function(UDFRegistration$$Lambda$6068/1550981127: (double, double) => double)
First I tried this approach:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
conf=SparkConf()
conf.set('spark.driver.extraClassPath', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')
conf.set('spark.jars', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')
spark = SparkSession(sc)
sc = SparkContext.getOrCreate(conf=conf)
#spark.sparkContext.addPyFile("dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar")
udfs = [
('jaro_winkler_sim', 'JaroWinklerSimilarity',DoubleType()),
('jaccard_sim', 'JaccardSimilarity',DoubleType()),
('cosine_distance', 'CosineDistance',DoubleType()),
('Dmetaphone', 'DoubleMetaphone',StringType()),
('QgramTokeniser', 'QgramTokeniser',StringType())
]
for a,b,c in udfs:
spark.udf.registerJavaFunction(a, 'uk.gov.moj.dash.linkage.'+ b, c)
linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()
next I tried to move the jars and extraClassPath to the cluster config.
spark.jars dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
spark.driver.extraClassPath dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
The I registered them in my script as follows:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession, udf
from pyspark.sql.types import *
# java path to class uk.gov.moj.dash.linkage.scala-udf-similarity.CosineDistance
udfs = [
('jaro_winkler_sim', 'JaroWinklerSimilarity',DoubleType()),
('jaccard_sim', 'JaccardSimilarity',DoubleType()),
('cosine_distance', 'CosineDistance',DoubleType()),
('Dmetaphone', 'DoubleMetaphone',StringType()),
('QgramTokeniser', 'QgramTokeniser',StringType())
]
for a,b,c in udfs:
spark.udf.registerJavaFunction(a, 'uk.gov.moj.dash.linkage.'+ b, c)
linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()
Thanks
Looking into the source code of the UDFs, I see that it's compiled with Scala 2.11, and uses Spark 2.2.0 as a base. The most probable reason for the error is that you're using this jar with DBR 7.x that is compiled with Scala 2.12 and based on Spark 3.x that are binary incompatible with your jar. You have following choices:
Recompile the library with Scala 2.12 and Spark 3.0
Use DBR 6.4 that uses Scala 2.11 and Spark 2.4
P.S. Overwriting classpath on Databricks sometimes could be tricky, so it's better to use other approaches:
Install your jar as library into cluster - this could be done via UI, or via REST API, or via some other automation, like, terraform
Use [init script][2] to copy your jar into default location of the jars. In simplest case it could look like as following:
#!/bin/bash
cp /dbfs/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar /databricks/jars/

Pyspark Failed to find data source: kafka

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.
This is the command I am using.
df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()
ERROR:
Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
How can I resolve this?
NOTE: I am running this in Jupyter Notebook
findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
Everything is running fine till here (above code)
df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()
This is where things are going wrong (above code).
The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/
Edit
Using spark.jars.packages works better than PYSPARK_SUBMIT_ARGS
Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition
It's not clear how you ran the code. Keep reading the blog, and you see
spark-submit \
...
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
sstreaming-spark-out.py
Seems you missed adding the --packages flag
In Jupyter, you could add this
import os
# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'
# initialize spark
import pyspark, findspark
findspark.init()
Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

Spark : Error Not found value SC

I have just started with Spark. I have CDH5 Installed with Spark . However when I try to use sparkcontext it gives Error as below
<console>:17: error: not found: value sc
val distdata = sc.parallelize(data)
I have researched about this and found error: not found: value sc
and tried to start spark context with ./spark-shell . It gives error No such File or Directory
You can either start spark-shell starting with ./ if you're in its exact directory or path/to/spark-shell if you're elsewhere.
Also, if you're running a script with spark-submit, you need to initialize sc as SparkContext first:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
There is another stackoverflow post that answers this question by getting sc(spark context) from spark session. I do it this way:
val spark = SparkSession.builder().appName("app_name").enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
original answer here:
Retrieve SparkContext from SparkSession
Add spark directory to path then you may use spark-shell from anywhere.
Add import org.apache.spark.SparkContext if you are using it in a spark-submit job to create a spark context using:
val sc = new SparkContext(conf)
where conf is already defined.
Starting a new terminal fixes the problem in my case.
You need to run Hadoop daemons first (run this command "start-all.sh"). Then try
you ca run this command in spark(scala) prompt
conf.set("spark.driver.allowMultipleContexts","true")

Spark SQLContext does not work on CDH5.3

I am running spark 1.2 on CDH 5.3 and trying a simple code in spark-shell
I am failing on
val sqlContext = new SQLContext(sc)
with the error:
not found : type SQLContext
What is wrong with my environment?
Make sure you import it:
import org.apache.spark.sql.SQLContext

Resources