In Kaggle, where can I store common code to be imported into multiple Notebooks? - python-3.x

I am working on the Kaggle HuBMAP competition. My application consists of several components -- Preprocessing, Training, Prediction and Scoring -- and there is common code that is used by more than one component. Currently, I put multiple copies of the common code in the Notebooks for each of the components, but I'd like to maintain one copy of the common code that I can import into my application components.
My question is: Where do I store that common code so that it can be imported? Does it go in a separate DataSet? Or a separate Notebook? How do I store it? How do I import it?

Figured out how to do it:
Create a file like common_code.py to hold the code you want to import in Kaggle
Create a Kaggle DataSet like CommonCode and upload common_code.py to it
In the Notebook where you want to import the common code add the DataSet CommonCode
At the top of your Notebook, add the following code:
import sys
sys.path.append( "/kaggle/input/CommonCode" )
Then, at any subsequent point in the Notebook you can say
from common_code import *
While this shows an example of a single import-able file, you can use it for any number of files, and you can update the DataSet for additions or revisions.

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

How can I use a model I trained to make predictions in the future without retraining whenever I want to use it

I recently finished training a linear regression algorithm but I don't know how to save it so that in the future, I can use it to make relevant predictions without having to retrain it whenever I want to use it.
Do I save the .py file and call it whenever I need it or create a class or what?
I just want to know how I can save a model I trained so I can use it in the future.
Depending on how you make the linear regression, you should be able to obtain the equation of the regression, as well as the values of the coefficients, most likely by inspecting the workspace.
If you explain what module, function, or code you use to do the regression, it will be easier to give a specific solution.
Furthermore, you can probably use the dill package:
https://pypi.org/project/dill/
I saw the solution here:
https://askdatascience.com/441/anyone-knows-workspace-jupyter-python-variables-functions
The steps proposed for using dill are:
Install dill. If you use conda, the code would be conda install -c anaconda dill
To save workspace using dill:
import dill
dill.dump_session('notebook_session.db')
To restore sesion:
import dill
dill.load_session('notebook_session.db')
I saw the same package discussed here: How to save all the variables in the current python session?
and I tested it using a model created with the interpretML package, and it worked for me.

PyTorch loads old data when using tensorboard

In using tensorboard I have cleared my data directory and trained a new model but I am seeing images from an old model. Why is tensorboard loading old data, where is it being stored, and how do I remove it?
Tensorboard was built to have caches in case long training fails you have "bak"-like files that your board will generate visualizations from. Unfortunately, there is not a good practice to manually remove hidden temp files as they are not seen from displaying files including ones with the . (dot) prefix using bash. This memory is self-managed. For best practices, (1) have your tensorboard name be dynamic for results of each run: this can be done using datetime library in combination with an f-string in python so that the name of each run is separated by a time stamp. (This command be done right from python, say a jupyter notebook, if you import the subprocess package and run your bash command straight from the script.) (2) Additionally, you are strongly advised to save your logdir (log directory) separately from where you are running the code. These two practices together should solve all the problems related to tmp files erroneously populating new results.
How to "reset" tensorboard data after killing tensorflow instance

Using my saved ML model to work on a raw and unprocessed dataset

I have created few models in ML and saved them for future use in predicting the outcomes. This time there is a common scenario but unseen for me.
I need to provide this model to someone else to test it out on their dataset.
I had removed few redundant columns from my training data, trained a regression model on it and saved it after validating it. However, when I give this model to someone to use it on their dataset, how do I tell them to drop few columns. I could have manually added the column list in a python file where saved model will be called from but that does not sound too neat.
What is the best way to do this in general. Kindly share some inputs.
One can simply use pickle library to save column list and other things along with the model. In the new session, one can simply use pickle to upload those things in the session again.

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora.
My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory.
Is it possible to do this with sklearn ? are there alternatives ?
Thanks
register
For some algorithms supporting partial_fit, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.
Here is a sample integration template to explain how it would fit together.
import numpy as np
from sklearn.linear_model import Perceptron
from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader
dataset_reader = DataSetReader('/path/to/raw/data')
expected_classes = dataset_reader.get_all_classes() # need to know the possible classes ahead of time
feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()
dataset_reader = DataSetReader('/path/to/raw/data')
for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):
vectors = feature_extractor.transform(documents)
classifier.partial_fit(vectors, labels, classes=expected_classes)
if i % 100 == 0:
# dump model to be able to monitor quality and later analyse convergence externally
joblib.dump(classifier, 'model_%04d.pkl' % i)
The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).
The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.
Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).
In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.
Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.
Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizerthat can directly deal with text data.
Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.
Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.
EDIT: Here is a full-fledged example of such an application
Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).
In addition to Vowpal Wabbit, gensim might be interesting as well - it too features online Latent Dirichlet Allocation.

Resources