Confusion regarding joblib.dump() - scikit-learn

One way to save sklearn models is to use joblib.dump(model,filename). I have a confusion regarding the filename argument. One way to run this function is through :
joblib.dump(model,"model.joblib")
This saves the model successfully and also the model is loaded correctly using the:
model=joblib.load("model.joblib")
Another way is to use :
joblib.dump(model,"model")
With no ".joblib" extension this time. This also runs successfully and the model is loaded correctly using the:
model=joblib.load("model")
What confuses me is the file extension in the filename, Is there a certain file extension that I should use for saving the model? Or it is not necessary to use a file extension as I did above? If it is not necessary, then why?

There is no file extension that "must" be used to serialize a model. You can specify the compression method by using one of the supported filename extensions (.z, .gz, .bz2, .xz or .lzma). By default joblib will use zlib to serialize objects.
Therefore you can use any file extension. However it is a good practice to use the library as the extension in order to know how to load it.
I name my serialized model model.pickle when I am using pickle library and model.joblib when I am using joblib.

Related

Best way to store XML needed for api calls in a Python Package

I have a question about where to store XML data needed for making API calls in a python package I am building. For some of the API calls in my package I need to provide a binary xml string in the request. Right now I am reading it in from the same directory that my source code is kept in. What would be the best way to store this XML for use during run time after my project has been packaged and installed through pip? I read about package_data from the setuptools package but i'm not sure how you would open the files included in there when a API call is made.
Below is an example of how I am currently executing one of these API calls. and current directory structure.
def make_api_call(self):
# line below just extracts creds for api call from db
API_CREDENTIALS = self.stage_request()
#request_input.xml contains xml needed to make request
data = open('./request_input.xml', 'rb').read()
r = requests.post(url='apiurl',data=data,auth=(API_CREDENTIALS[0], API_CREDENTIALS[1]),headers={'Content-Type': 'text/xml'})
-----Package
-------Packages
---------module1.py
---------module2.py
---------module3.py
---------request_input.xml
-----setup.py
-----README.md
To read the package data, use one of the following...
importlib.resources
pkgutil.get_data()
pkg_resources from setuptools

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

How to stream (upload) large amount of data with Requests in Python?

The module requests provides a high level HTTP API. Using requests I'd like to send data via HTTP using a POST request. The documentation is very short about this, stating that a "file like object" should be provided without stating clearly what exactly requests would expect from that object. I've some binary data, but unfortunately this is generated data and I have not a file like object. How could I possibly implement a "file like object" myself that would conform to the expectations of requests? The documentation is quite poor in that regard and I wasn't able to clarify this by looking into the source code of requests myself. Has anyone done this before using the requests API?
File-like object is a standard Python term for an object that behaves like a file. This means that if you have a file, you have a file-like object and simply need to pass the file path to Requests. If you have a more complex situation you will need to give us a full description of the form of your data so we can help you more explicitly.
EDIT: To address your comment, here is the code to send a binary file to a host using Requests.
url = 'http://SomeSite/post'
files = {'files': ('mydata', open('mydata', mode='rb'), 'application/octet-stream')}
r = requests.post(url, files=files)
Opening the file with the Python open command creates the file-like object.
EDIT2: Whenever to open a file on disk you create a file-like object in the process of opening the objects. However, Python supports other object types that act like files. Some examples include the standard stdin, stdout and stderr. In addition, pipes can be access using os.pipe and via subprocess.Pipe. These objects behave like files, i.e. they can be accessed with a subset of the file API and their API's behave in the same way as the object that accesses a real file.
This is why they are called file-like because they use the same API's and act in the same way. You open, close, can read or write a pipe in the same way as you do a file.

Load spydata file

I'm coming from R + Rstudio. In RStudio, you can save objects to an .RData file using save()
save(object_to_save, file = "C:/path/where/RData/file/will/be/saved.RData")
You can then load() the objects :
load(file = "C:/path/where/RData/file/was/saved.RData")
I'm now using Spyder and Python3, and I was wondering if the same thing is possible.
I'm aware everything in the globalenv can be saved to a .spydata using this :
But I'm looking for a way to save to a .spydata file in the code. Basically, just the code under the buttons.
Bonus points if the answer includes a way to save an object (or multiple objects) and not the whole env.
(Please note I'm not looking for an answer using pickle or shelve, but really something similar to R's load() and save().)
(Spyder developer here) There's no way to do what you ask for with a command in Spyder consoles.
If you'd like to see this in a future Spyder release, please open an issue in our issues tracker about it, so we don't forget to consider it.
Considering the comment here, we can
rename the file from .spydata to .tar
extract the file (using file manager, for example). It will deliver a file .pickle (and maybe a .npy)
extract the objects saved from the environment:
import pickle
with open(path, 'rb') as f:
data_temp = pickle.load(f)
that object will be a dictionary with the objects saved.

JAXB API to generate Java source files directly to OutputStream

I have a schema file and I want to generate the class files directly into MEMORY instead of file system. I have searched a lot, but everywhere I am finding API to generate java files into filesystem only.
Can any please provide links of API to generate the java source files directly into memory.
Thanks,
Harish
I haven't leveraged this code in the way you described, but this fragment might point you in the right direction:
import com.sun.codemodel.*;
import com.sun.tools.xjc.*;
import com.sun.tools.xjc.api.*;
SchemaCompiler sc = XJC.createSchemaCompiler();
sc.setEntityResolver(new YourEntityResolver());
sc.setErrorListener(new YourErrorListener());
sc.parseSchema(SYSTEM_ID, element);
S2JJAXBModel model = sc.bind();

Resources