Using code_path in mlflow.pyfunc models on Databricks - databricks

We are using Databricks over AWS infra, registering models on mlflow.
We write our in-project imports as from src.(module location) import (objects).
Following examples online, I expected that when I use mlflow.pyfunc.log_model(..., code_path=['PROJECT_ROOT/src'], ...), that would add the entire code tree to the model's running environment and thus allow us to keep our imports as-are.
When logging the model, I get a long list of [Errno 95] Operation not supported, one for each notebook in our repo. This blocks us from registering the model to mlflow.
We have used several ad-hoc solutions and workarounds, from forcing ourselves to work with all code in one file, to only working with files in the same directory (code_path=['./filename.py'], to adding specific libraries (and changing import paths accordingly), etc.
However none of these is optimal. As a result we either duplicate code (killing DRY), or we put some imports inside the wrapper (i.e. those that cannot be run in our working environment since it's different from the one the model will experience when deployed), etc.
We have not yet tried to put all the notebooks (which we believe cause [Errno 95] Operation not supported) in a separate folder. This will be highly disruptive to our current situation and processes, and we'd like to avoid that as much as we can.
Please advise

I had a similar struggle with Databricks when using custom model logic from an src directory (similar structure to cookiecutter-data-science). The solution was to log the entire src directory using the relative path.
So if you have the following project structure.
.
├── notebooks
│   └── train.py
└── src
├── __init__.py
└── model.py
Your train.py should look like this. Note AddN comes from the MLflow Docs.
import mlflow
from src.model import AddN
model = AddN(n=5)
mlflow.pyfunc.log_model(
registered_model_name="add_n_model",
artifact_path="add_n_model",
python_model=model,
code_path=["../src"],
)
This will copy all code in src/ and log it in the MLflow artifact allowing the model to load all dependencies.
If you are not using a notebooks/ directory, you will set code_path=["src"]. If you are using sub-directies like notebooks/train/train.py, you will set code_path=["../..src"].

Related

ModuleNotFoundException: No module named utils

In our Airflow dags project, we have created a package called utils. In this package, we have created many Python files.
Recently I created a new file called github_util.py where I have written some code to interact with GitHub APIs.
There is another Python file called mail_forms.py in the utils folder.
I am importing github_util.py in mail_forms.py using
from utils.github_util import task_failed_github_issue
The content of github_util.py is
def task_failed_github_issue(context):
print("only function")
In utils package, we have an empty init.py file.
When I deploy this code in Airflow, I am getting below error:
ModuleNotFoundException: No module names 'utils'.
We have files called backend_util.py and entity_util in the same folder 'utils'. entity_util is also imported into backend file as below:
from utils.entity_util import read_definitions
This import is working correctly. I am not able to understand why this import works and mine does not.
I referred to many links on Stack Overflow and other websites, but none of the solution worked for me.

export all package from one file, and then import them from this uniq source

I would like to export all my package from one single file, let's say exportAll.js and then across the code I may be able to import the one I need per file using something like
import { package1, package2 } from '../exportAll.js'
However I am concerning about the performance impact. I do believe after once import, Javascript save the reference for future import. However I also believe Javascript will first import all files into exportAll.js when I want to use only package1 and package2 and so, it may slow down the loading process of the first page who import anything from exportAll.
Can you help me to understand if this is bad for performance or even if this is bad practice for any other reason ?

Import ValueError: attempted relative import beyond top-level package

I have a folder structure for project with small application and some additional scripts, which are not directly used by application (they are used for some data science on data in data folder):
project
data
doc
src
└───scripts1
script1_1.py
script1_2.py
scripts2
script2_1.py
application
common
__init__.py
config.py
enrichment
__init__.py
module1.py
module2.py
__init__.py
app.py
Script app.py is always entry point and is used for starting the application. I want to be able to use relative imports in this folder structure. For example I want to import Configuration class, which is located in config.py inside of module1.py. But when I run app.py and have this import statement in module1.py:
from ..common.config import Configuration
I get the following error:
File ".../project/src/application/enrichment/module1.py", line 6, in <module>
from ..common.config import Configuration
ValueError: attempted relative import beyond top-level package
I would also need to import enrichment modules in app.py, guess it should work similarly:
from .enrichment.module1 import <func or class>
I have read multiple threads on module importing, but I am still not able to reconstruct the behavior, where I am able to use these relative imports and not get the ValueError. In one older project I used path importing in __init__.py files, but I hope it can be solved somehow better, because it was a bit magic for me. Really thanks for any help.

Is it possible to override the automated assignment of uuid for filenames when writing datasets with pyarrow.parquet?

Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])
On disk the dataset would look like something like this:
some_path
├── a=1
____├── 4498704937d84fe5abebb3f06515ab2d.parquet
├── a=2
____├── 8bcfaed8986c4bdba587aaaee532370c.parquet
Q: Is it possible for me to override the auto-assignment of the long UUID as filename somehow during the dataset writing? My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.
For anyone who's also interested in the development of this issue, it is solved as of pyarrow version 0.15.0, with great thanks to the open source community (Jira issue link).
Following the example used in the question:
pyarrow.parquet.write_to_dataset(table,
some_path,
['a',],
partition_filename_cb=lambda x:'-'.join(x)+'.parquet')
would produce a saved dataset like this:
├── a=1
├── 1.parquet
├── a=2
├── 2.parquet

How to structure my little python framework

I wrote a simple set of python3 files for emulating a small set of mongodb features on a 32 bit platform. I fired up PyCharm and put together a directory that looked like:
minu/
client.py
database.py
collection.py
test_client.py
test_database.py
test_client.py
My imports are simple. For example, client.py has the following at the top:
from collection import Collection
Basically, client has a Client class, collection has a Collection class, and database has a Database class. Not too tough.
As long as I cd into the minu directory, I can fire up a python3 interpreter and do things like:
>>> from client import Client
>>> c = Client(pathstring='something')
And everything just works. I can run the test_files as well, which use the same sorts of imports.
I'd like to modularize this, so I can use it another project by just dropping the minu directory alongside my application's .py files and just have everything work. When I do this though, and am running python3 from another directory, the local imports don't work. I placed an empty init.py in the minu directory. That made it so I could import minu. But the others broke. I tried using things like from .collection import Collection (added the dot), but then I can't run things in the original directory anymore, like I could before. What is the simple/right way to do this?
I have looked around a bit with Dr. Google, but none of the examples really clarify it well, feel free to point out the one I missed
In this file ...minu/__init__.py import the submodules you wish to expose externally.
If the __init__.py file contains the following lines, and the client.py file has a variable foo.
import client
import collection
import database
Then from above the minu directory, the following will work:
from minu.client import foo

Resources