My environment: Ubuntu 64 bit, Spark 2.4.5, Jupyter Notebook.
With internet connection that's fine, I don't get any error:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars.packages","com.microsoft.ml.spark:mmlspark_2.11:0.18.1") \
.getOrCreate()
from mmlspark.lightgbm import LightGBMClassifier
But without an internet connection I got related jars (This style recommended by cloudera docs):
import os
mmlspark_jars_dir = os.path.join(os.environ["SPARK_HOME"], "mmlspark_jars")
mmlspark_jars = [os.path.join(mmlspark_jars_dir, x) for x in os.listdir(mmlspark_jars_dir)]
print(mmlspark_jars)
['/home/erkan/spark/mmlspark_jars/com.jcraft_jsch-0.1.54.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.spark_mmlspark_2.11-0.18.1.jar',
'/home/erkan/spark/mmlspark_jars/commons-codec_commons-codec-1.10.jar',
'/home/erkan/spark/mmlspark_jars/org.scalatest_scalatest_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpcore-4.4.10.jar',
'/home/erkan/spark/mmlspark_jars/org.openpnp_opencv-3.2.0-1.jar',
'/home/erkan/spark/mmlspark_jars/commons-logging_commons-logging-1.2.jar',
'/home/erkan/spark/mmlspark_jars/com.github.vowpalwabbit_vw-jni-8.7.0.2.jar',
'/home/erkan/spark/mmlspark_jars/org.apache.httpcomponents_httpclient-4.5.6.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang_scala-reflect-2.11.12.jar',
'/home/erkan/spark/mmlspark_jars/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.cntk_cntk-2.4.jar',
'/home/erkan/spark/mmlspark_jars/io.spray_spray-json_2.11-1.3.2.jar',
'/home/erkan/spark/mmlspark_jars/org.scalactic_scalactic_2.11-3.0.5.jar',
'/home/erkan/spark/mmlspark_jars/com.microsoft.ml.lightgbm_lightgbmlib-2.2.350.jar']
And I had to modify SparkSession like this:
spark = SparkSession.builder \
.appName("Churn Scoring LightGBM") \
.master("local[4]") \
.config("spark.jars", ",".join(mmlspark_jars)) \
.getOrCreate()
I observed from terminal and everything seemed fine SparkSession was created. Then I checked Spark UI
Then I tried to import:
from mmlspark.lightgbm import LightGBMClassifier
And got this error:
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-10-df498625321c> in <module>
----> 1 from mmlspark.lightgbm import LightGBMClassifier
ModuleNotFoundError: No module named 'mmlspark'
I don't understand that although I see the same jars on SparkUI import doesn't work with the second method.
Related
I'm trying to make following build: use Airflow + Apache Spark (in Standalone mode) in Docker.
I have an error when running following code:
with DAG(
dag_id="spark_airflow_dag",
default_args=default_args,
schedule_interval="#once",
) as dag:
transform_2_csv = SparkSubmitOperator(
application="/usr/local/spark/app/transform_2_csv.py",
conn_id="SparkLocal",
task_id="spark_submit_task",
)
(transform_2_csv)
An error in airflow-scheduler:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
self._hook.submit(self._application)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/apache/spark/hooks/spark_submit.py", line 427, in submit
f"Cannot execute: {self._mask_cmd(spark_submit_cmd)}. Error code is: {returncode}."
airflow.exceptions.AirflowException: Cannot execute: spark-submit --master spark://spark:7077 --name arrow-spark /usr/local/spark/app/tranform_2_csv.py. Error code is: -9.
I used airflow-in-docker guide like start point and extend with spark.
I will be grateful if you have the opportunity to basicly check the correctness of the settings:
docker-compose
Dockerfile
After that I created a connection in Airflow UI:
When I run following code in DAG like PythonOperator it works:
import os
from pathlib import Path
from pyspark.sql import SparkSession
path = Path("/opt/airflow")
path_resources = path / "resources"
spark = SparkSession.builder.appName("bel").getOrCreate()
raw_json_dataframe = (
spark.read.format("json")
.option("inferSchema", "true")
.load("/usr/local/spark/resources/July.json")
)
raw_json_dataframe.printSchema()
raw_json_dataframe.write.csv(f"{path_resources}/test.csv")
But when I run same code as file like SparkSubmitOperator with this location it doesnt work
spark
|__app/
|__transform_2_csv.py
I am building project in pycharm IDE using pyspark.
The Spark install successfully and can be call easily from command prompt.
The Interpreter also configured correctly in project setting. I also tried with pip install pyspark.
The main.py looks like:-
import os
os.environ["SPARK_HOME"] = "/usr/local/spark"
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from genericFunc import genericFunction
from config import constants
spark = genericFunction.start_data_pipeline()
inputDf = genericFunction.read_json(constants.INPUT_FOLDER_PATH+"file-000.json")
inputDf1 = genericFunction.read_json(constants.INPUT_FOLDER_PATH+" file-001.json")
and the generic function looks like:-
from pyspark.sql import SparkSession
print('w')
def start_data_pipeline():
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = SparkSession\
.builder\
.appName("Nike ETL")\
.getOrCreate()
return spark
except Exception as e:
raise
def read_json(file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
spark = start_data_pipeline()
spark = spark.read \
.option("header", "true") \
.option("inferSchema", "true")\
.json(file_name)
return spark
except Exception as e:
raise
def load_as_csv(df,file_name):
#setting up spark session
'''
This function will set the spark session and return it to the __main__
function.
'''
try:
df.repartition(1).write.format('com.databricks.spark.csv')\
.save(file_name, header = 'true')
except Exception as e:
raise
Error:
Error:
Unresolved reference 'genericFunc'
"C:\Users\MY PC\PycharmProjects\pythonProject1\venv\Scripts\python.exe" C:/Capgemini/cv/tulsi/test-tulsi/main.py
Traceback (most recent call last):
File "C:/Capgemini/cv/tulsi/test-naveen/main.py", line 6, in <module>
from pyspark import SparkContext
ImportError: No module named pyspark
Process finished with exit code 1
Please help
The problem is that PyCharm creates its own virtual environment (venv) before running a python project and that venv do not have the packages installed - in this case pyspark. So you need to point PyCharm to the correct python shell where the packages are available.
You should go to File -> Settings -> Project -> Python Interpreter
and change the Python Interpreter to correct python that has the packages. To find your python run this your python shell
>>> import os
>>> import sys
>>> os.path.dirname(sys.executable)
'C:\\Doc\\'
I am trying to submit a Spark Application to the local Kubernetes cluster on my machine (created via Docker Dashboard). The application depends on a python package, let's call it X.
Here is the application code:
import sys
from pyspark import SparkContext
from pyspark.sql import SparkSession
datafolder = "/opt/spark/data" # Folder created in container by spark's docker file
sys.path.append(datafolder) # X is contained inside of datafolder
from X.predictor import * # import functionality from X
def apply_x_functionality_on(item):
predictor = Predictor() # class from X.predictor
predictor.predict(item)
def main():
spark = SparkSession\
.builder\
.appName("AppX")\
.getOrCreate()
sc = spark.sparkContext
data = []
# Read data: [no problems there]
...
data_rdd = sc.parallelize(data) # create RDD
data_rdd.foreach(lambda item: apply_network(item)) # call function
if __name__ == "__main__":
main()
Initially I hoped to avoid such problems by putting the X folder to the data folder of Spark. When container is built, all the content of data folder is being copied to the /opt/spark/data. My Spark application appends contents of data folder to the system path, as such consuming the package X. Well, I thought it does.
Everything works fine until the .foreach function is called. Here is a snippet from loggs with error description:
20/11/25 16:13:54 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.1.0.60, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 587, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'X'
There are a lot of similar questions here: one, two, three, but none of the answers to them have helped me so far.
What I have tried:
I submitted application with .zip(ed) X (I zip it in container, by applying zip to X):
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=kostjaigin/spark-py:v3.0.1-X_0.0.1 \
--py-files "local:///opt/spark/data/X.zip" \
local:///opt/spark/data/MyApp.py
I added .zip(ed) X to Spark Context:
sc.addPyFile("opt/spark/data/X.zip")
I have resolved the issue:
Created dependencies folder under /opt/spark/data
Put X to dependencies
Inside of my docker file I pack dependencies folder in a zip archive to submit it later as py-files: cd /opt/spark/data/**dependencies** && zip -r ../dependencies.zip .
In Application:
...
from X.predictor import * # import functionality from X
...
# zipped package
zipped_pkg = os.path.join(datafolder, "dependencies.zip")
assert os.path.exists(zipped_pkg)
sc.addPyFile(zipped_pkg)
...
Add --py-files flag to the submit command:
$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=5 \
--py-files "local:///opt/spark/data/dependencies.zip" \
local:///opt/spark/data/MyApp.py
Run it
Basically it is all about adding a dependencies.zip Archive with all the required dependencies in it.
I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0
I am facing an issue when connecting to HBASE using PySpark as it fails with an error as:
py4j.protocol.Py4JJavaError: An error occurred while calling o42.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.execution.datasources.hbase. Please find packages at http://spark.apache.org/third-party-projects.html
HDP Version : 2.6.4.0-91
Spark Ver: 2.2.0.2.6.4.0-91
Python: 2.7.5
Jar used: /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar
I tried jar import using pyspark --jars /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar
It takes to PySpark's shell with the prompt, but when I try to connect to HBASE, it fails with the error mentioned above.
Sample Code Executed:
Using Python version 2.7.5 (default, May 31 2018 09:41:32)
SparkSession available as 'spark'.
>>> catalog = ''.join("""{'table': {'namespace': 'default','name': 'books'},'rowkey': 'key','columns': {'title': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},'author': {'cf': 'info', 'col': 'author', 'type': 'string'}}}""".split())
>>>
>>> df = sqlContext.read.options(catalog=catalog).format('org.apache.spark.sql.execution.datasources.hbase').load()
Failing with error given below:
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named org.apache.spark.sql.execution.datasources.hbase
Try with using --packages and --repositories arguments as mentioned here.
bash$ export SPARK_MAJOR_VERSION=2
bash$ pyspark --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> spark = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()
>>> catalog = ''.join("""{'table': {'namespace': 'default','name': 'books'},'rowkey': 'key','columns': {'title': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},'author': {'cf': 'info', 'col': 'author', 'type': 'string'}}}""".split())
>>> df=spark.read.options(catalog=catalog,newtable=5).format("org.apache.spark.sql.execution.datasources.hbase").load()