spark-submit fails to detect the installed modulus in pip - python-3.x

I have a python code which have the following 3rd party dependencies:
import boto3
from warcio.archiveiterator import ArchiveIterator
from warcio.recordloader import ArchiveLoadFailed
import requests
import botocore
from requests_file import FileAdapter
....
I have installed the dependencies using pip, and made sure that it was correctly installed by having the command pip list. Then, when I tried to submit the job to spark, I received the following errors:
ImportError: No module named 'boto3'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The problem of no module named not only happens with 'boto3' but also with other modules.
I tried the following things:
Added SparkContext.addPyFile(".zip files")
Using submit-spark --py-files
Reinstall pip
Made sure the path env variables export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH and installed pip install py4j
Used python instead of spark-submit
Software information:
Python version: 3.4.3
Spark version: 2.2.0
Running on EMR-AWS: Linux version 2017.09

Before doing spark-submit try going to python shell and try importing the modules.
Also check which python shell (check python path) is opening up by default.
If you are able to successfully import these modules in python shell (same python version as you trying to use in spark-submit), please check following:
In which mode are you submitting the application? try standalone or if on yarn try client mode.
Also try adding export PYSPARK_PYTHON=(your python path)

All checks mentioned above worked ok but setting PYSPARK_PYTHON solved the issue for me.

Related

How to provide a matplotlib Python dependency to spark-submit?

I have inherited a web application that delegates some of its work to other machines via spark-submit.
spark-submit -v --py-files /somedir/python/pyspark_utils*.egg --master yarn /somedir/python/driver.py arg1 gwas-manhattan-plot 58
Our code relies on matplotlib for plotting, so we build using setuptools with
python setup.py bdist_egg.
The setuptools web page says:
When your project is installed (e.g., using pip), all of the
dependencies not already installed will be located (via PyPI),
downloaded, built (if necessary), and installed...
Within the generated pyspark_utils-0.0.1-py3.9.egg is a requires.txt file with the contents:
matplotlib
numpy
Unfortunately, when I run the spark-submit command, it can't find matplotlib:
ModuleNotFoundError: No module named 'matplotlib'
While the egg file indicates a dependency on matplotlib, it doesn't actually include the packaged code. How can I indicate to spark-submit where to find matplotlib? I thought of adding another --py-files argument for matplotlib, but matplotlib is available as a wheel, not as an egg file, and as I understand it spark-submit can't handle wheels.

SparkSession | Ubuntu | Pycharm not working

I'm trying to use PySpark locally on Ubuntu using PyCharm rather than a jupyter notebook in order to build an Electron app. However, when I'm trying to set up a SparkSession, it doesn't work. When I try this:
spark = SparkSession.builder.master('local[*]').appName('Search').enableHiveSupport().getOrCreate
df = pd.DataFrame([1,2,3], columns=['Test'])
myschema = StructType([StructField('Test'),Integertype(),True)])
df2 = spark.createDataFrame(df,schema=myschema)
print(type(df2))
the session opens but it tells me
"AttributeError: 'function' object has no attribute 'createDataFrame' "
Then, rewrite the above with ".getOrCreate()" and it tells me
"FileNotFoundError: [Error 2] No such file or directory "home/...././bin/spark-submit'
I guess the set up in Pycharm might be off, but I don't really understand why.
You need to to use method invocation getOrCreate(), not getOrCreate. Also, make sure you install pyspark inside the python interpreter used for your project in pycharm. You can access it via Preferences -> Python Interpreter in pycharm.
Update:
Try downloading and extracting the spark binaries (e.g. spark 2.4.0) on your local and then add the following entries in your bashrc (and source it). I'm assuming you're using spark 2.4.0, so the py4j is specific to this version. For any other version of spark, check the py4j version and add accordingly.
export SPARK_HOME=/<your_path>/spark-2.4.0-bin-hadoop2.7
export PYTHONPATH=${SPARK_HOME}/python:$PYTHONPATH
export PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=/<location_of_python_interpreter>
Whatever python interpreter you're linking into PYSPARK_PYTHON, make sure to use the same python in your pycharm project.

add python3 library to python3.6 path

hello I'm new to python3 and python3.6, I usually use pip3 install to add old libraries to my python3 path.
I recently picked up python3.6 for managing my home servers with its asyncio functionalities however my python3.6 interpreter is unable to locate pwnlibs and am thus unable to reuse my old code.
I tried:
import sys
import os
sys.path.append(os.path.abspath(pwn.__file__))
import pwn
debugging results:
on python3.4 os.path.abspath(pwn.__file__) returns the correct path to the library
While sys.path.append is a valid means to add to your python path at runtime, you can't get the path from a module you have not yet loaded.
Instead you should install your packages using pip3 or in a specific location and add that to your path either at runtime or via the PYTHONPATH environment variable.
I would guess since you mentioned pip3 already however that you had the pwn package already installed when you tried with 3.4 and your install of Python 3.6 is not using the same paths as your Python 3.4 install. Try comparing your python paths from 3.4 with 3.6 by comparing the output from sys.path.
Lastly, just as a note, if you are using the pwntools package, it doesn't yet support Python 3, so if you are simply copying the folder, be aware it might not function correctly or at all.

Error in Dask connecting to HDFS

I was trying to connect to HDFS using dask, following the blog then I installed the hdfs3 from the docs using conda.
When I import the hdfs3 it gives me an error:
ImportError: Can not find the shared library: libhdfs3.so See
installation instructions at
http://hdfs3.readthedocs.io/en/latest/install.html
I found a github link but the solution specified does not work for python 3.5.2.
As the the libprotobuf 2.5.0 is for python 2.7 and libprotobuf 2.6.1 also doesn't work. Is there any workaround for connecting to HDFS using dask.
Thanks you.

Anaconda3 libhdf5.so.9: cannot open shared object file [works fine on py2.7 but not on py3.4]

I just tried to use pd.HDFStore in IPython Notebook with a Python 3 kernel (Anaconda 2&3 on Ubuntu 14.04)
import pandas as pd
store = pd.HDFStore('/home/Jian/Downloads/test.h5')
but it throws the following error
ImportError: HDFStore requires PyTables, "libhdf5.so.9: cannot open shared object file: No such file or directory" problem importing
I initially thought it's because pytables is somehow missing, but when I check $source activate py34 and $conda list, pytables 3.2.0 is already installed under anaconda python3 environment.
Also, if I switch to Python 2, for example, $source activate py27 and start ipython notebook, it works properly and no import error is thrown.
I guess that I must miss something for configuring pytables under anaconda python 3 env, but I cannot figure it out. Any help is highly appreciated.
Update:
I just tried on a fresh install of Anaconda3-2.3.0-Linux-x86_64 from official website and it ends up with the same error. When I try $locate libhdf5.so.9 in command line, nothing shows up.
This is a known issue that we are working on. When it is fixed, conda update --all will update the libraries and fix the issue.

Resources