How to Integrate Dedupe Active Learning functinolity (Console_label) with restful api - python-dedupe

below code to train csv model and active learning involved init how to integrate console_label function with restful api(eg fastapi)
Create a new deduper object and pass our data model to it.
deduper = dedupe.Dedupe(fields)
# If we have training data saved from a previous run of dedupe,
# look for it and load it in.
# __Note:__ if you want to train from scratch, delete the training_file
if os.path.exists(training_file):
print('reading labeled examples from ', training_file)
with open(training_file, 'rb') as f:
deduper.prepare_training(data_d, f)
else:
deduper.prepare_training(data_d)
# ## Active learning
# Dedupe will find the next pair of records
# it is least certain about and ask you to label them as duplicates
# or not.
# use 'y', 'n' and 'u' keys to flag duplicates
# press 'f' when you are finished
print('starting active labeling...')
dedupe.console_label(deduper)
# Using the examples we just labeled, train the deduper and learn
# blocking predicates
deduper.train()

Related

How to download pipeline code from Pycaret AutoML into .py files?

PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']

Loading saved NER back into HuggingFace pipeline?

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.
My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.
Here is what I have tried so far:
1: Following the documentation about pipeline
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
The error I got was: 'str' object has no attribute "config"
2: Following HuggingFace example on ner:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This yields an error: list index out of range
I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.
Any help would be much appreciated!
Loading a model like this has always worked for me:
from transformers import pipeline
pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
Have a look at here for further examples on how to use pipelines.

How to import custom functions on my experiment script for Azure ML?

I can successfully submit an experiment to processing on a remote compute target on Azure ML.
In my notebook, for submitting the experiment, I have:
# estimator
estimator = Estimator(
source_directory='scripts',
entry_script='exp01.py',
compute_target='pc2',
conda_packages=['scikit-learn'],
inputs=[data.as_named_input('my_dataset')],
)
# Submit
exp = Experiment(workspace=ws, name='my_exp')
# Run the experiment based on the estimator
run = exp.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion(show_output=True)
However, in order to keep things clean, I want to define my general use functions on an auxiliary script, so the first will import it.
On my script experiment file exp01.py, I wanted:
import custom_functions as custom
# azure experiment start
run = Run.get_context()
# the data from azure datasets/datastorage
df = run.input_datasets['my_dataset'].to_pandas_dataframe()
# prepare data
df_transformed = custom.prepare_data(df)
# split data
X_train, X_test, y_train, y_test = custom.split_data(df_transformed)
# run my models.....
model_name = 'RF'
model = custom.model_x(model_name, a_lot_of_args)
# log the results
run.log(model_name, results)
# azure finish
run.complete()
The thing is: Azure wont let me import the custom_functions.py.
How are you doing it?
TL;DR any files you put inside the source_directory in your case, scripts will be available to the Estimator.
To make this happen, simply create a file called custom_functions.py in the scripts folder that contains your prepare_data(), split_data(), model_x() functions.
I also recommend that you include only exactly what you need in the source_directory folder and make distinct folders for each Estimator because:
the entire folder's contents will be uploaded when you use a remote compute_target, and
when you started using ML Pipeilnes (which are awesome), PythonScriptSteps allow_reuse parameter will look to see if any files in the source_directory have changed when determining if the step needs to run again or not.
Lastly, when you want to share general utility functions across PythonScriptSteps or Estimators without having to copy and paste code, that's when you might want to consider creating a custom python package.

Load existing mf2005 model into flopy and change parameters

I have an existing MODFLOW2005 model that was created in Processing Modflow gui. I would like to import this model into flopy to be able to conduct a sensivitiy analysis on model parameters, something that I believe should be much quicker using flopy.
I can load the existing modflow model using:
ml = flopy.modflow.Modflow.load("modelnamw.nam", model_ws=model_ws,verbose=True,check=False)
And can re-name the model to create a new output using:
ml.name = 'New model'
ml.write_input()
Is there a way I can leave the entire model as is but just change the hydraulic conductivity (hy) parameter (leaving rest of bcf input as is)?
Thank you
The easiest method is probably to create a copy of the model (by changing the model_ws or giving it a new name), and then create a new BCF package with the modified parameters. Be sure to pass all the parameters that are unchanged to the new BCF package.
# get the BCF package
bcf = ml.get_package("BCF6")
# new hy
new_hy = 2.
# don't forget to pass all the unchanged parameters from the old BCF
new_bcf = flopy.modflow.ModflowBcf(ml, laycon=bcf.laycon, hy=new_hy, vcont=bcf.vcont)
new_bcf.write_file() # write file
ml.run_model() # run model with new BCF
Changing only the parameter on the existing object is also possible. To do that replace the existing bcf.hy object with a new Util3d object. Note: in this case it's a Util3d but for other parameters it might be 1D or 2D.
# get the BCF package
bcf = ml.get_package("BCF6")
# create new util3d object
new_hy_util3d = flopy.utils.Util3d(ml, bcf.hy.array.shape, np.float32, new_hy, "hy")
# replace the old hy with the new object
bcf.hy= new_hy_util3d
bcf.write_file() # write file
ml.run_model() # run model with new hy

Inception v3, train test and validation splits in retrain.py

I have been using tensorflow's retrain.py script for the inception v3 model to build a multilabel classification model, I found some code in the script that I don't fully understand, I'm asking for help clarifying what it does:
for file_name in file_list:
base_name = os.path.basename(file_name)
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
Is there a way to name the images to decide which ones go to the train set, validation set or test set? If someone has a link for the plant disease dataset mentioned here that would also be helpful, thanks!

Resources