How do I convert a h2o4gpu Kmeans object to sklearn Kmeans object? - scikit-learn

I'm in a spot where I need to convert then save h2o4gpu Kmeans object to just a sklearn object.
I thought maybe I could just do this? I was expecting I would be able to save sklearn_model and load it, but I get error: AttributeError: 'KMeans' object has no attribute '_n_threads'
from h2o4gpu.solvers import KMeans as GPUKMeans
from sklearn.cluster import KMeans
...
gpu_model = GPUKMeans(n_clusters=num_clusters)
gpu_model.fit(embeddings)
sklearn_model = KMeans(n_clusters=num_clusters)
sklearn_model.cluster_centers_= gpu_model.cluster_centers_;
...

After digging into source code I found some code that was doing a similar thing:
from h2o4gpu.solvers import KMeans as GPUKMeans
from sklearn.cluster import KMeans
from sklearn.utils._openmp_helpers import _openmp_effective_n_threads
...
gpu_model = GPUKMeans(n_clusters=num_clusters)
gpu_model.fit(embeddings)
kmeans_model = KMeans(n_clusters=num_clusters)
kmeans_model.cluster_centers_= gpu_model.cluster_centers_;
kmeans_model.labels_= gpu_model.labels_;
kmeans_model.inertia_= gpu_model.inertia_;
kmeans_model._n_threads = _openmp_effective_n_threads()
...

Related

How do I get access to the "last_hidden_state" for code generation models in huggingface?

I'm trying to obtain the "last_hidden_state" (as explained here) for code generation models over here. I am unable to figure out how to proceed, other than manually downloading each code-generation-model and checking if its key has that attribute using the following code -
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModel, AutoModelForCausalLM
import torch
from sklearn.linear_model import LogisticRegression
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot").to(device)
inputs = tokenizer("def hello_world():", return_tensors="pt")
inputs = {k:v.to(device) for k,v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
print(outputs.keys())
So far, I tried this strategy on CodeParrot and InCoder with no success. Perhaps there is a better way to access the values of the hidden layers?
The hidden_states of output from CodeGenForCausalLM is already the last_hidden_state for the codegen model. See: link
where hidden_states = transformer_outputs[0] is the output of CodeGenModel (link) and the transformer_outputs[0] is the last_hidden_state
if not return_dict:
return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=presents,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
)

Python - Need help in solving "Load the R data set mtcars as a pandas dataframe." problem

I am working on this problem and unsure on how to proceed.
Load the R data set mtcars as a pandas dataframe.
Build a linear regression model by considering the log of independent variable wt, and log of dependent variable mpg.
Fit the model with data.
Perform ANOVA on the linear model obtained in the previous step.(Hint:Use anova.anova_lm)
Display the F-statistic value.
I see in another post below solution was provided. But it doesn't to seem work.
import statsmodels.api as sm
import numpy as np
mtcars = sm.datasets.get_rdataset('mtcars')
mtcars_data = mtcars.data
liner_model = sm.formula.ols('np.log(wt) ~ np.log(mpg)',mtcars_data)
liner_result = liner_model.fit()
print(liner_result.rsquared)'''
fixed it
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
model = smf.ols(formula='np.log(mpg) ~ np.log(wt)', data=mtcars).fit()
print(anova.anova_lm(model))
print(anova.anova_lm(model).F["np.log(wt)"])

Getting error while trying to save and apply existing machine learning model to new dataset?

I am trying to use this model https://github.com/aninda052/Disasters-on-social-media-NLP/blob/master/Disasters%20on%20social%20media.ipynb
, I searched for a way to save this model and use it with new dataset in other application an I find out use pickle, and I add this to code like this
import pickle
model_tfidf=LogisticRegression( C=30.0,class_weight='balanced', solver='newton-cg',
multi_class='multinomial', n_jobs=-1, random_state=5)
model_tfidf.fit(x_train_tfidf, y_train)
predicted_tfidf=model_tfidf.predict(x_test_tfidf)
Pkl_Filename = "Pickle_RL_Model.pkl"
with open(Pkl_Filename, 'wb') as file:
pickle.dump(model_tfidf, file)
after that I tried to create new project to load and use this model and the code is:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import pickle
with open('Pickle_RL_Model.pkl', 'rb') as file:
Pickled_LR_Model = pickle.load(file)
x=["hi disaster","flood disaster","cry sad bad ","srong storm"]
tfd=TfidfVectorizer()
new_data_vec=tfd.fit_transform(x)
Ypredict = Pickled_LR_Model.predict(new_data_vec)
but I got error said:
X has 8 features per sample; expecting 16988
I don't know what I did wrong, any help please.

Deploy Keras model on Spark

I have a trained keras model.
https://github.com/qubvel/efficientnet
I have a large updating dataset I want to get predictions on. Meaning to run my spark job every 2 hours or so.
What is the way to implement this? MlLib does not support efficientNet.
When searching online I saw this kind of implementation using sparkdl, but it does not support efficentNet as modelName parameter.
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
My naive approach would be
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
from sparkdl import readImages
image_df = readImages("flower_photos/sample/")
image_df.withcolumn("modelTags", efficient_net_udf($"image".data))
and creating a UDF that calls model.predict...
Another method I saw is
from keras.preprocessing.image import img_to_array, load_img
import numpy as np
import os
from pyspark.sql.types import StringType
from sparkdl import KerasImageFileTransformer
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
model.save("kerasModel.h5")
def loadAndPreprocessKeras(uri):
image = img_to_array(load_img(uri, target_size=(299, 299)))
image = np.expand_dims(image, axis=0)
return image
transformer = KerasImageFileTransformer(inputCol="uri", outputCol="predictions",
modelFile='path/kerasModel.h5',
imageLoader=loadAndPreprocessKeras,
outputMode="vector")
files = [os.path.abspath(os.path.join(dirpath, f)) for f in os.listdir("/data/myimages") if f.endswith('.jpg')]
uri_df = sqlContext.createDataFrame(files, StringType()).toDF("uri")
keras_pred_df = transformer.transform(uri_df)
What is the correct (and working) way to approach this?

Error saving a linear regression model with MLLib

Trying to save my linear regression model to disk I receive this error: "TypeError: save() takes 2 positional arguments but 3 were given"
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sc= SparkContext()
lr = LinearRegression(featuresCol = 'features', labelCol='NextOrderInDays', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
lr_model.save(sc, "lr_model.model")
Searching the web outputs something similar to what I wrote. What do I miss as 3rd argument?
Thanks
You use the ml package not the mllib: from pyspark.ml.regression import LinearRegression.
So the save function has only one argument: the path (cf. documentation).

Resources