How to do streaming Kafka->Zeppelin->Spark with current versions - apache-spark

I have a Kafka 2.3 message broker and want to do some processing with data of the messages within Spark. For the beginning I want to use the Spark 2.4.0 that is integrated in Zeppelin 0.8.1 and want to use the Zeppelin notebooks for rapid prototyping.
For this streaming task I need "spark-streaming-kafka-0-10" for Spark>2.3 according to https://spark.apache.org/docs/latest/streaming-kafka-integration.html that only supports Java and Scale (and not Python). But there are no default Java or Scala interpreters in Zeppelin.
If I try this code (taken from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)
%spark.pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1})
I get the following error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Fail to execute line 1: kafkaStream = KafkaUtils.createStream(ssc,
'localhost:9092', 'spark-streaming', {'test':1}) Traceback (most
recent call last): File
"/tmp/zeppelin_pyspark-8982542851842620568.py", line 380, in
exec(code, _zcUserQueryNameSpace) File "", line 1, in File
"/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 78, in createStream
helper = KafkaUtils._get_helper(ssc._sc) File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 217, in _get_helper
return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
TypeError: 'JavaPackage' object is not callable
So I wonder how to tackle the task:
Should I really use spark-streaming-kafka-0-8 despited being deprecated since some months? But spark-streaming-kafka-0-10 seems to be in the default zeppelin-jar directory.
Configure/Create interpreter in Zeppelin for Java/Scala since spark-streaming-kafka-0-10 does only support these langauges?
Ignore Zeppelin and do it on the console using "spark-submit"?

Related

azure pyspark register udf from jar Failed UDFRegistration

I'm having trouble registering some udfs that are in a java file. I've a couple approaches but they all return :
Failed to execute user defined function(UDFRegistration$$Lambda$6068/1550981127: (double, double) => double)
First I tried this approach:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
conf=SparkConf()
conf.set('spark.driver.extraClassPath', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')
conf.set('spark.jars', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')
spark = SparkSession(sc)
sc = SparkContext.getOrCreate(conf=conf)
#spark.sparkContext.addPyFile("dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar")
udfs = [
('jaro_winkler_sim', 'JaroWinklerSimilarity',DoubleType()),
('jaccard_sim', 'JaccardSimilarity',DoubleType()),
('cosine_distance', 'CosineDistance',DoubleType()),
('Dmetaphone', 'DoubleMetaphone',StringType()),
('QgramTokeniser', 'QgramTokeniser',StringType())
]
for a,b,c in udfs:
spark.udf.registerJavaFunction(a, 'uk.gov.moj.dash.linkage.'+ b, c)
linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()
next I tried to move the jars and extraClassPath to the cluster config.
spark.jars dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
spark.driver.extraClassPath dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
The I registered them in my script as follows:
from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession, udf
from pyspark.sql.types import *
# java path to class uk.gov.moj.dash.linkage.scala-udf-similarity.CosineDistance
udfs = [
('jaro_winkler_sim', 'JaroWinklerSimilarity',DoubleType()),
('jaccard_sim', 'JaccardSimilarity',DoubleType()),
('cosine_distance', 'CosineDistance',DoubleType()),
('Dmetaphone', 'DoubleMetaphone',StringType()),
('QgramTokeniser', 'QgramTokeniser',StringType())
]
for a,b,c in udfs:
spark.udf.registerJavaFunction(a, 'uk.gov.moj.dash.linkage.'+ b, c)
linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()
Thanks
Looking into the source code of the UDFs, I see that it's compiled with Scala 2.11, and uses Spark 2.2.0 as a base. The most probable reason for the error is that you're using this jar with DBR 7.x that is compiled with Scala 2.12 and based on Spark 3.x that are binary incompatible with your jar. You have following choices:
Recompile the library with Scala 2.12 and Spark 3.0
Use DBR 6.4 that uses Scala 2.11 and Spark 2.4
P.S. Overwriting classpath on Databricks sometimes could be tricky, so it's better to use other approaches:
Install your jar as library into cluster - this could be done via UI, or via REST API, or via some other automation, like, terraform
Use [init script][2] to copy your jar into default location of the jars. In simplest case it could look like as following:
#!/bin/bash
cp /dbfs/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar /databricks/jars/

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

Not able to run the Hive sql via Spark

I am trying to execute hive SQL via spark code but it is throwing below mentioned error. I can only select data from hive table.
My spark version is 1.6.1
My Hive version is 1.2.1
command to run spark submit
spark-submit --master local[8] --files /srv/data/app/spark/conf/hive-site.xml test_hive.py
code:-
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
sc=SparkContext()
sqlContext = SQLContext(sc)
HiveContext = HiveContext(sc)
#HiveContext.setConf("yarn.timeline-service.enabled","false")
#HiveContext.sql("SET spark.sql.crossJoin.enabled=false")
HiveContext.sql("use default")
HiveContext.sql("TRUNCATE TABLE default.test_table")
HiveContext.sql("LOAD DATA LOCAL INPATH '/srv/data/data_files/*' OVERWRITE INTO TABLE default.test_table")
df = HiveContext.sql("select * from version")
for x in df.collect():
print x
Error:-
17386 [Thread-3] ERROR org.apache.spark.sql.hive.client.ClientWrapper -
======================
HIVE FAILURE OUTPUT
======================
SET spark.sql.inMemoryColumnarStorage.compressed=true
SET spark.sql.thriftServer.incrementalCollect=true
SET spark.sql.hive.convertMetastoreParquet=false
SET spark.sql.broadcastTimeout=800
SET spark.sql.hive.thriftServer.singleSession=true
SET spark.sql.inMemoryColumnarStorage.partitionPruning=true
SET spark.sql.crossJoin.enabled=true
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.crossJoin.enabled=false
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class
======================
END HIVE FAILURE OUTPUT
======================
Traceback (most recent call last):
File "/home/iip/hist_load.py", line 10, in <module>
HiveContext.sql("TRUNCATE TABLE default.tbl_wmt_pos_file_test")
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o46.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class
I can only select data from hive table.
It is perfectly normal and expected behavior. Spark SQL is not intended to be fully compatible with HiveQL or implement full set of features provided by Hive.
Overall, some compatibility is preserved, but is not guaranteed to be kept in the future, as Spark SQL converges to SQL 2003 standard.
From the post here:
Spark job fails with ClassCastException because of conflict in different version of same class in YARN and SPARK jar.
From Set below property in HiveContext:
hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("yarn.timeline-service.enabled","false")

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed).
Run 1 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))
external package: external_package.py
class external(object):
def __init__(self):
pass
def fun(self,input):
return input*2
readme.txt
MY TEXT HERE
spark-submit command
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
/local-pgm-path/spark_distro.py \
1000
Output: Working as expected
['MY TEXT HERE']
But if i try to pass the file(readme.txt) from command line using --files (instead of sc.addFile)option it is failing.
Like below.
Run 2 : spark_distro.py
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
def import_my_special_package(x):
from external_package import external
ext = external()
return ext.fun(x)
conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))
external_package.py Same as above
spark submit
spark-submit \
--master yarn-client \
--py-files /path to local codelib/external_package.py \
--files /local-path/readme.txt#readme.txt \
/local-pgm-path/spark_distro.py \
1000
Output:
Traceback (most recent call last):
File "/local-pgm-path/spark_distro.py", line 31, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'
Is sc.addFile and --file used for same purpose? Can someone please share your thoughts.
I have finally figured out the issue, and it is a very subtle one indeed.
As suspected, the two options (sc.addFile and --files) are not equivalent, and this is (admittedly very subtly) hinted at the documentation (emphasis added):
addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.
--files FILES
Comma-separated list of files to be placed in the working
directory of each executor.
In plain English, while files added with sc.addFile are available to both the executors and the driver, files added with --files are available only to the executors; hence, when trying to access them from the driver (as is the case in the OP), we get a No such file or directory error.
Let's confirm this (getting rid of all the irrelevant --py-files and 1000 stuff in the OP):
test_fail.py:
from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
lines = [line.strip() for line in test_file]
print(lines)
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_fail.py
Result:
[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'
In the above script test_fail.py, it is the driver program that requests access to the file readme.txt; let's change the script, so that access is requested for the executors (test_success.py):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())
Test:
spark-submit --master yarn \
--deploy-mode client \
--files /home/ctsats/readme.txt \
/home/ctsats/scripts/SO/test_success.py
Result:
[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']
Notice also that here we don't need SparkFiles.get - the file is readily accessible.
As said above, sc.addFile will work in both cases, i.e. when access is requested either by the driver or by the executors (tested but not shown here).
Regarding the order of the command line options: as I have argued elsewhere, all Spark-related arguments must be before the script to be executed; arguably, the relative order of --files and --py-files is irrelevant (leaving it as an exercise).
Tested with both Spark 1.6.0 & 2.2.0.
UPDATE (after the comments): Seems that my fs.defaultFS setting points to HDFS, too:
$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020
But let me focus on the forest here (instead of the trees, that is), and explain why this whole discussion is of academic interest only:
Passing files to be processed with the --files flag is bad practice; in hindsight, I can now see why I could find almost no use references online - probably nobody uses it in practice, and with good reason.
(Notice that I am not talking for --py-files, which serves a different, legitimate role.)
Since Spark is a distributed processing framework, running over a cluster and a distributed file system (HDFS), the best thing to do is to have all files to be processed into the HDFS already - period. The "natural" place for files to be processed by Spark is the HDFS, not the local FS - although there are some toy examples using the local FS for demonstration purposes only. What's more, if you want some time in the future to change the deploy mode to cluster, you'll discover that the cluster, by default, knows nothing of local paths and files, and rightfully so...

Spark 2.1 - Error While instantiating HiveSessionState

With a fresh install of Spark 2.1, I am getting an error when executing the pyspark command.
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/local/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
I have Hadoop and Hive on the same machine. Hive is configured to use MySQL for the metastore. I did not get this error with Spark 2.0.2.
Can someone please point me in the right direction?
I was getting same error in windows environment and Below trick worked for me.
in shell.py the spark session is defined with .enableHiveSupport()
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
Remove hive support and redefine spark session as below:
spark = SparkSession.builder\
.getOrCreate()
you can find shell.py in your spark installation folder.
for me it's in "C:\spark-2.1.1-bin-hadoop2.7\python\pyspark"
Hope this helps
I had the same problem. Some of the answers sudo chmod -R 777 /tmp/hive/, or to downgrade spark with hadoop to 2.6 didn't work for me.
I realized that what caused this problem for me is that I was doing SQL queries using the sqlContext instead of using the sparkSession.
sparkSession =SparkSession.builder.master("local[*]").appName("appName").config("spark.sql.warehouse.dir", "./spark-warehouse").getOrCreate()
sqlCtx.registerDataFrameAsTable(..)
df = sparkSession.sql("SELECT ...")
this perfectly works for me now.
Spark 2.1.0 - When I run it with yarn client option - I don't see this issue, but yarn cluster mode gives "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':".
Still looking for answer.
The issue for me was solved by disabling HADOOP_CONF_DIR environment variable. It was pointing to hadoop configuration directory and while starting pyspark shell, the variable caused spark to initiate hadoop cluster which wasn't initiated.
So if you have HADOOP_CONF_DIR variable enabled, then you have to start hadoop cluster started before using spark shells
Or you need to disable the variable.
You are missing the spark-hive jar.
For example, if you are running on Scala 2.11, with Spark 2.1, you can use this jar.
https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.11/2.1.0
I saw this error on a new (2018) Mac, which came with Java 10. The fix was to set JAVA_HOME to Java 8:
export JAVA_HOME=`usr/libexec/java_home -v 1.8`
I too was struggling in cluster mode. Added hive-site.xml from sparkconf directory, if you have hdp cluster then it should be at /usr/hdp/current/spark2-client/conf. Its working for me.
I was getting this error trying to run pyspark and spark-shell when my HDFS wasn't started.
I have removed ".enableHiveSupport()\" from shell.py file and its working perfect
/*****Before********/
spark = SparkSession.builder\
.enableHiveSupport()\
.getOrCreate()
/*****After********/
spark = SparkSession.builder\
.getOrCreate()
/*************************/
Project location and file permissions would be issue. I have observed this error happening inspite of changes to my pom file.Then i changed the directory of my project to user directory where i have full permissions, this solved my issue.

Resources