How to use external (custom) package in pyspark? - apache-spark

I am trying to replicate the soultion given here https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html
to import external packages in pypspark. But it is failing.
My code:
spark_distro.py
from pyspark import SparkContext, SparkConf
def import_my_special_package(x):
from external_package import external
return external.fun(x)
conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x)).collect()
external_package.py
class external:
def __init__(self,in):
self.in = in
def fun(self,in):
return self.in*3
spark submit command:
spark-submit \
--master yarn \
/path to script/spark_distro.py \
--py-files /path to script/external_package.py \
1000
Actual Error:
Actual:
vs = list(itertools.islice(iterator, batch))
File "/home/gsurapur/pyspark_examples/spark_distro.py", line 13, in <lambda>
File "/home/gsurapur/pyspark_examples/spark_distro.py", line 6, in import_my_special_package
ImportError: No module named external_package
Expected output:
[3,6,9,12]
I tried sc.addPyFile option too and it is failing with same issue.

I know that, in hindsight, it sounds silly, but the order of the arguments of spark-submit is not in general interchangeable: all Spark-related arguments, including --py-file, must be before the script to be executed:
# your case:
spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro.py --py-files /home/ctsats/scripts/SO/external_package.py
[...]
ImportError: No module named external_package
# correct usage:
spark-submit --master yarn-client --py-files /home/ctsats/scripts/SO/external_package.py /home/ctsats/scripts/SO/spark_distro.py
[...]
[3, 6, 9, 12]
Tested with your scripts modified as follows:
spark_distro.py
from pyspark import SparkContext, SparkConf
def import_my_special_package(x):
from external_package import external
return external(x)
conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()
external_package.py
def external(x):
return x*3
with the modifications arguably not changing the essence of the question...

Here is the situation regarding addPyFile:
spark_distro2.py
from pyspark import SparkContext, SparkConf
def import_my_special_package(x):
from external_package import external
return external(x)
conf = SparkConf()
sc = SparkContext()
sc.addPyFile("/home/ctsats/scripts/SO/external_package.py") # added
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()
Test:
spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro2.py
[...]
[3, 6, 9, 12]

Related

ModuleNotFoundError in PySpark caused in serializers.py

I am trying to submit a Spark Application to the local Kubernetes cluster on my machine (created via Docker Dashboard). The application depends on a python package, let's call it X.
Here is the application code:
import sys
from pyspark import SparkContext
from pyspark.sql import SparkSession
datafolder = "/opt/spark/data" # Folder created in container by spark's docker file
sys.path.append(datafolder) # X is contained inside of datafolder
from X.predictor import * # import functionality from X
def apply_x_functionality_on(item):
predictor = Predictor() # class from X.predictor
predictor.predict(item)
def main():
spark = SparkSession\
.builder\
.appName("AppX")\
.getOrCreate()
sc = spark.sparkContext
data = []
# Read data: [no problems there]
...
data_rdd = sc.parallelize(data) # create RDD
data_rdd.foreach(lambda item: apply_network(item)) # call function
if __name__ == "__main__":
main()
Initially I hoped to avoid such problems by putting the X folder to the data folder of Spark. When container is built, all the content of data folder is being copied to the /opt/spark/data. My Spark application appends contents of data folder to the system path, as such consuming the package X. Well, I thought it does.
Everything works fine until the .foreach function is called. Here is a snippet from loggs with error description:
20/11/25 16:13:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.1.0.60, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'X'
There are a lot of similar questions here: one, two, three, but none of the answers to them have helped me so far.
What I have tried:
I submitted application with .zip(ed) X (I zip it in container, by applying zip to X):
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=kostjaigin/spark-py:v3.0.1-X_0.0.1 \
--py-files "local:///opt/spark/data/X.zip" \
local:///opt/spark/data/MyApp.py
I added .zip(ed) X to Spark Context:
sc.addPyFile("opt/spark/data/X.zip")
I have resolved the issue:
Created dependencies folder under /opt/spark/data
Put X to dependencies
Inside of my docker file I pack dependencies folder in a zip archive to submit it later as py-files: cd /opt/spark/data/**dependencies** && zip -r ../dependencies.zip .
In Application:
...
from X.predictor import * # import functionality from X
...
# zipped package
zipped_pkg = os.path.join(datafolder, "dependencies.zip")
assert os.path.exists(zipped_pkg)
sc.addPyFile(zipped_pkg)
...
Add --py-files flag to the submit command:
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--py-files "local:///opt/spark/data/dependencies.zip" \
local:///opt/spark/data/MyApp.py
Run it
Basically it is all about adding a dependencies.zip Archive with all the required dependencies in it.

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

ValueError: Cannot run multiple SparkContexts at once in spark with pyspark

i am new in using spark , i try to run this code on pyspark
from pyspark import SparkConf, SparkContext
import collections
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
but he till me this erore message
Using Python version 3.5.2 (default, Jul 5 2016 11:41:13)
SparkSession available as 'spark'.
>>> from pyspark import SparkConf, SparkContext
>>> import collections
>>> conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
>>> sc = SparkContext(conf = conf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\python\pyspark\context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\spark\python\pyspark\context.py", line 275, in _ensure_initialized
callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by getOrCreate at C:\spark\bin\..\python\pyspark\shell.py:43
>>>
i have version spark 2.1.1 and python 3.5.2 , i search and found it is problem in sc ,he could not read it but no when till why , any one have help here
You can try out this
sc = SparkContext.getOrCreate();
You can try:
sc = SparkContext.getOrCreate(conf=conf)
Your previous session is still on. You can run
sc.stop()
it can run through Jupyter lab also. but you have to use as your previous session is still running and local can not run two sessions at a time
sc = SparkContext.getOrCreate( conf =conf)

jupyter notebook NameError: name 'sc' is not defined

I used the jupyter notebook, pyspark, then, my first command was:
rdd = sc.parallelize([2, 3, 4])
Then, it showed that
NameError Traceback (most recent call last)
<ipython-input-1-c540c4a1d203> in <module>()
----> 1 rdd = sc.parallelize([2, 3, 4])
NameError: name 'sc' is not defined.
How to fix this error 'sc' is not defined.
Have you initialized the SparkContext?
You could try this:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
Try this
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
myrdd = sc.parallelize([('roze', 60), ('Mary', 80), ('stella', 34)])

TypeError: 'JavaPackage' object is not callable

when I code the spark sql API hiveContext.sql()
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext,HiveContext
conf = SparkConf().setAppName("spark_sql")
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
#rdd = sc.textFile("test.txt")
sqlContext = SQLContext(sc)
res = hc.sql("use teg_uee_app")
#for each in res.collect():
# print(each[0])
sc.stop()
I got the following error:
enFile "spark_sql.py", line 23, in <module>
res = hc.sql("use teg_uee_app")
File "/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/spark/python/pyspark/sql/context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "/spark/python/pyspark/sql/context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable
how do I add SPARK_CLASSPATH or SparkContext.addFile?I don't have idea.
Maybe this will help you: When using HiveContext I have to add three jars to the spark-submit arguments:
spark-submit --jars /usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/lib/datanucleus-core-3.2.10.jar,/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar ...
Of course the paths and versions depend on your cluster setup.
In my case this turned out to be a classpath issue - I had a Hadoop jar on the classpath that was a wrong version of Hadoop than I was running.
Make sure you only set the executor and/or driver classpaths in one place and that there's no system-wide default applied somewhere such as .bashrc or Spark's conf/spark-env.sh.

Resources