How to add a module folder /tar.gz to nodes in Pyspark - apache-spark

I am running pyspark in Ipython Notebook after doing following configuration
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook--NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"
export PYSPARK_PYTHON=/usr/bin/python
I am having a custom udf function, which makes use of a module called mzgeohash. But, I am getting module not found error, I guess this module might be missing in workers / nodes .I tried to add sc.addpyfile and all. But, what will be the effective way to add a cloned folder or tar.gz python module in this case , from Ipython .

Here is how I do it, basically the idea is to create a zip of all the files in your module and pass it to sc.addPyFile() :
import dictconfig
import zipfile
def ziplib():
libpath = os.path.dirname(__file__) # this should point to your packages directory
zippath = '/tmp/mylib-' + rand_str(6) + '.zip' # some random filename in writable directory
zf = zipfile.PyZipFile(zippath, mode='w')
try:
zf.debug = 3 # making it verbose, good for debugging
zf.writepy(libpath)
return zippath # return path to generated zip archive
finally:
zf.close()
...
zip_path = ziplib() # generate zip archive containing your lib
sc.addPyFile(zip_path) # add the entire archive to SparkContext
...
os.remove(zip_path) # don't forget to remove temporary file, preferably in "finally" clause

Related

Python class cannot open file when object is created from directory below

I have a app.py file which is using a custom class which I created one level below, ie.
from myfolder import util_file
def get_data():
dataContainer = util_file.MyDataContainer()
# do more stuff
One folder down my util_file I create MyDataContainer as a class that uses data from two local files to instantiate itself. ie.
class MyDataContainer:
def __init__(self):
with open('file1.txt', 'r'):
#get needed init data part 1
#repeat operation with data in file2.txt
The object is created fine when I run from the file itself, ie.
if __name__ == '__main__':
testcont = MyDataContainer()
The issue is that when I run the code in app.py I get:
FileNotFoundError: [Errno 2] No such file or directory: 'file1.txt'
thrown at the line
with open('file1.txt', 'r'):
I am guessing that python is unable to see the file1.txt and file2.txt since it is in another directory at runtime but I checked the my path and the subfolder is in sys.path at runtime, so I believe these should be visible. I really do not understand what is going on here. Is there a different reason python is unable to see these files?

How to get Sphinx to recognize pandas as a module

I am using Python 3.9.2 with sphinx-build 3.5.2. I have created a project with the following directory and file structure
core_utilities
|_core_utilities
| |_ __init__.py
| |_read_files.py
|_docs
|_ sphinx
|_Makefile
|_conf.pg
|_source
| |_conf.py
| |_index.rst
| |_read_files.rst
|_build
The read_files.py file contains the following code. NOTE: I simplified this example so it would not have distracting information. Within this code their is one class that contains one member function to read a text file, look for a key word and read the variable to the right of the keyword as a numpy.float32 variable. The stand-alone function written to read in a csv file with specific data types and save it to a pandas dataframe.
# Import packages here
import os
import sys
import numpy as np
import pandas as pd
from typing import List
class ReadTextFileKeywords:
def __init__(self, file_name: str):
self.file_name = file_name
if not os.path.isfile(file_name):
sys.exit('{}{}{}'.format('FATAL ERROR: ', file_name, ' does not exist'))
# ----------------------------------------------------------------------------
def read_double(self, key_words: str) -> np.float64:
values = self.read_sentence(key_words)
values = values.split()
return np.float64(values[0])
# ================================================================================
# ================================================================================
def read_csv_columns_by_headers(file_name: str, headers: List[str],
data_type: List[type],
if not os.path.isfile(file_name):
sys.exit('{}{}{}'.format('FATAL ERROR: ', file_name, ' does not exist'))
dat = dict(zip(headers, data_type))
df = pd.read_csv(file_name, usecols=headers, dtype=dat, skiprows=skip)
return df
The conf.py file as the following information in it.
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../../../core_utilities'))
# -- Project information -----------------------------------------------------
project = 'Core Utilities'
copyright = 'my copyright'
author = 'my name'
# The full version, including alpha/beta/rc tags
release = '0.1.0'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.todo', 'sphinx.ext.viewcode', 'sphinx.ext.autodoc',
'sphinx.ext.autosummary', 'sphinx.ext.githubpages']
autodoc_member_order = 'groupwise'
autodoc_default_flags = ['members', 'show-inheritance']
autosummary_generate = True
autodock_mock_imports = ['numpy']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'nature'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
The index.rst file has the following information.
. Core Utilities documentation master file, created by
sphinx-quickstart
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Core Utilities's documentation!
==========================================
.. toctree::
:maxdepth: 2
:caption: Contents:
read_files
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
and the read_files.rst file has the following information
**********
read_files
**********
The ``read_files`` module contains methods and classes that allow a user to read different
types of files in different ways. Within the module a user will find functionality that
allows them to read text files and csv files by columns and keywords. This module
also contains functions to read sqlite databases, xml files, json files, yaml files,
and html files
.. autoclass:: test.ReadTextFileKeywords
:members:
.. autofunction:: test.read_csv_columns_by_headers
In addition, I am using a virtual environment that is active when I try to compile this with command line. I run the following command to compile html files with sphinx.
sphinx-build -b html source build
When I run the above command it fails with the following error.
WARNING: autodoc: failed to import class 'ReadTextFileKeywords' from module 'read_files'; the following exception was raised; No module named pandas.
If I delete the line from pandas import pd and then delete the function read_csv_columns_by_headers along with the call to the function in the read_files.rst file, everything compiles fine. It appears that for some reason sphinx is able to find numpy, but it for some reason does not seem to recognize pandas, although both exist in the virtual environment and were loaded with a pip3 install statement. Does anyone know why sphinx is able to find other modules, but not pandas?
Add pandas to autodoc
autodoc_mock_imports = ['numpy','pandas']

Sphinx cannot find pandas

I am trying to document a Python class using sphinx. I am used Python 3.9.2 and sphinx-build 3.5.2. I am running the command sphinx-build -b html source build to transform the .rst files into .html files. I have already successfully executed this configuration on other classes and functions in the code-suite, so I know their is nothing wrong with the way I have organized my files. When I run the command I get the following error;
Warning: autodoc: failed to import class 'ReadTextFileKeywords' from module 'read_files'; the following exception was raised: No module named pandas
I do have the most recent version of pandas loaded globally and in my virtual environment. Does anyone know why it cannot load pandas and how I can fix this issue?
For reference, the file structure looks like this;
core_utilities
|_core_utilities
|_docs
|_sphinx
|_Makefile
|_build
| |_html files
|_source
|_index.rst
|_Introduction.rst
|_read_files.rst
|_operating_system.rst
|_conf.py
The conf.py file has the following information
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../../../core_utilities'))
# -- Project information -----------------------------------------------------
project = 'Core Utilities'
copyright = '2021, Jonathan A. Webb'
author = 'Jonathan A. Webb'
# The full version, including alpha/beta/rc tags
release = '0.1.0'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.todo', 'sphinx.ext.viewcode', 'sphinx.ext.autodoc',
'sphinx.ext.autosummary', 'sphinx.ext.githubpages']
autodoc_member_order = 'groupwise'
autodoc_default_flags = ['members', 'show-inheritance']
autosummary_generate = True
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'nature'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
and the index file has the following format
.. Core Utilities documentation master file, created by
sphinx-quickstart
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Core Utilities's documentation!
==========================================
.. toctree::
:maxdepth: 2
:caption: Contents:
Introduction
operating_system
read_files
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Sphinx works just fine when I import docstrings from the operating_system module, so I know that is not the problem. It fails when I try to import from the read_files module. The read_files.rst file has hte following format.
**********
read_files
**********
The ``read_files`` module contains methods and classes that allow a user to read different
types of files in different ways. Within the module a user will find functionality that
allows them to read text files and csv files by columns and keywords. This module
also contains functions to read sqlite databases, xml files, json files, yaml files,
and html files
.. autoclass:: read_files.ReadTextFileKeywords
:members:
And the class it is reading, to include the file has the following format
# Import packages here
import os
import sys
import numpy as np
import pandas as pd
from typing import List
import sqlite3
# ================================================================================
# ================================================================================
# Date: Month Day, Year
# Purpose: Describe the purpose of functions of this file
# Source Code Metadata
__author__ = "Jonathan A. Webb"
__copyright__ = "Copyright 2021, Jon Webb Inc."
__version__ = "1.0"
# ================================================================================
# ================================================================================
# Insert Code here
class ReadTextFileKeywords:
"""
A class to find keywords in a text file and the the variable(s)
to the right of the key word. This class must inherit the
``FileUtilities`` class
:param file_name: The name of the file being read to include the
path-link
For the purposes of demonstrating the use of this class, assume
a text file titled ``test_file.txt`` with the following contents.
.. code-block:: text
sentence: This is a short sentence!
float: 3.1415 # this is a float comment
double: 3.141596235941 # this is a double comment
String: test # this is a string comment
Integer Value: 3 # This is an integer comment
float list: 1.2 3.4 4.5 5.6 6.7
double list: 1.12321 344.3454453 21.434553
integer list: 1 2 3 4 5 6 7
"""
def __init__(self, file_name: str):
self.file_name = file_name
if not os.path.isfile(file_name):
sys.exit('{}{}{}'.format('FATAL ERROR: ', file_name, ' does not exist'))

python3, directory is not correct when import a module in sub-folder

I have a main folder called 'test', the inner structure is:
# folders and files in the main folder 'test'
Desktop\test\use_try.py
Desktop\test\cond\__init__.py # empty file.
Desktop\test\cond\tryme.py
Desktop\test\db\
Now in the file tryme.py. I want to generate a file in the folder of 'db'
# content in the file of tryme.py
import os
def main():
cwd = os.getcwd() # the directory of the folder 'Desktop\test\cond'
folder_test = cwd[:-4] # -4 since 'cond' has 4 letters
folder_db = folder_test + 'db/' # the directory of folder 'db'
with open(folder_db + 'db01.txt', 'w') as wfile:
wfile.writelines(['This is a test.'])
if __name__ == '__main__':
main()
If I directly run this file, no problem, the file 'db01.txt' is in the folder of 'db'.
But if I run the file of use_try.py, it will not work.
# content in the file of use_try.py
from cond import tryme
tryme.main()
The error I got refers to the tryme.py file. In the command of 'with open ...'
FileNotFoundError: [Error 2] No such file or directory: 'Desktop\db\db01.txt'
It seems like the code
'os.getcwd()'
only refers to the file that calls the tryme.py file, not the tryme.py file itself.
Do you know how to fix it, so that I can use the file use_try.py to generate the 'db01.txt' in the 'db' folder? I am using Python3
Thanks
Seems like what you need is not the working directory, but the directory of the tryme.py file.
This can be resolved using the __file__ magic:
curdir = os.path.dirname(__file__)
Use absolute filenames from an environment variable, or expect the db/ directory to be a subdirectory of the current working directory.
This behavior is as expected. The current working directory is where you invoke the code from, not where the code is stored.
folder_test = cwd # assume working directory will have the db/ subdir
or
folder_test = os.getEnv('TEST_DIR') # use ${TEST_DIR}/db/

Loading python modules in Python 3

How do I load a python module, that is not built in. I'm trying to create a plugin system for a small project im working on. How do I load those "plugins" into python? And, instaed of calling "import module", use a string to reference the module.
Have a look at importlib
Option 1: Import an arbitrary file in an arbiatrary path
Assume there's a module at /path/to/my/custom/module.py containing the following contents:
# /path/to/my/custom/module.py
test_var = 'hello'
def test_func():
print(test_var)
We can import this module using the following code:
import importlib.machinery
myfile = '/path/to/my/custom/module.py'
sfl = importlib.machinery.SourceFileLoader('mymod', myfile)
mymod = sfl.load_module()
The module is imported and assigned to the variable mymod. We can then access the module's contents as:
mymod.test_var
# prints 'hello' to the console
mymod.test_func()
# also prints 'hello' to the console
Option 2: Import a module from a package
Use importlib.import_module
For example, if you want to import settings from a settings.py file in your application root folder, you could use
_settings = importlib.import_module('settings')
The popular task queue package Celery uses this a lot, rather than giving you code examples here, please check out their git repository

Resources