Trying to use fetch_20newsgroups - python-3.x

I am in the process of learning Python and have the following problem with fetching the 20newwsgroups data in the this code:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import *
categories = ['comp_graphics', 'misc_foresale',
'rec.autos', 'sci.space']
twenty_train = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True,
random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(
twenty_train.data)
print(X_train_counts)
print("BOW shape:", X_train_counts.shape)
caltech_idx = count_vect.vocabulary_['caltech']
print('"Caltech": %i' % X_train_counts[0, caltech_idx])
The version of scikit learn i have is 1.0.2, python version 3.7.9. although I do have a copy of Python 3.10 installed as well

Related

GPU not used on d3rlpy

I am new to using d3rlpy for offline RL training and makes use of pytorch. So I installed cuda 1.16 as recommended from PYtorch doc: pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116. I installed d3rlpy after and run the following sample code:
from d3rlpy.algos import BC,DDPG,CRR,PLAS,PLASWithPerturbation,TD3PlusBC,IQL
import d3rlpy
import numpy as np
import glob
import time
#models
continuous_models = {
"BehaviorCloning": BC,
"DeepDeterministicPolicyGradients": DDPG,
"CriticRegularizedRegression": CRR,
"PolicyLatentActionSpace": PLAS,
"PolicyLatentActionSpacePerturbation": PLASWithPerturbation,
"TwinDelayedPlusBehaviorCloning": TD3PlusBC,
"ImplicitQLearning": IQL,
}
#load dataset data_batch is created as a*.h5 file with d3rlpy
dataset = d3rlpy.dataset.MDPDataset.load(data_batch)
# preprocess
mean = np.mean(dataset.observations, axis=0, keepdims=True)
std = np.std(dataset.observations, axis=0, keepdims=True)
scaler = d3rlpy.preprocessing.StandardScaler(mean=mean, std=std)
# test models
for _model in continuous_models:
the_model = continuous_models[_model](scaler = scaler)
the_model.use_gpu = True
the_model.build_with_dataset(dataset)
the_model.fit(dataset = dataset.episodes,
n_steps_per_epoch = 10800,
n_steps = 54000,
logdir = './logs',
experiment_name = f"{_model}",
tensorboard_dir = 'logs',
save_interval = 900, # we don't want to save intermediate parameters
)
#save model
the_timestamp = int(time.time())
the_model.save_model(f"./models/{_model}/{_model}_{the_timestamp}.pt")
The issue is that None of the models, despite being set with use_gpu =True are actually using the GPU. With a sample code of pytotch and testing torch.cuda.current_device() I can see that pytorch is properly set and detecting the gpu. Any idea where to look for solving this issue? I am not sure this is a bug from the d3rlpy so I would bother creating an issue on github yet :)

Fastai for time series regression

So I have been using fastai library for a couple of years now. Recently, I came upon the extension library dedicated for the time series analysis - tsai
I am trying to perform simple regression task on the famous airpassengers dataset.
I have no idea what I am doing wrong:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import torch
import random
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
# fastai
from fastai import *
from fastai.text import *
from fastai.text.all import *
from tsai.all import *
flight_data = sns.load_dataset("flights")
flight_data.head(20)
scaler = MinMaxScaler(feature_range=(-1, 1))
# flight_data['passengers'] = scaler.fit_transform(flight_data['passengers'].values.reshape(-1, 1)).flatten()
plt.figure(figsize=(10, 4))
plt.plot(flight_data['passengers'])
def create_inout_sequences(input_data, tw):
inout_seq = []
label_seq = []
L = len(input_data)
for i in range(L-tw):
train_seq = input_data[i:i+tw]
train_label = input_data[i+tw:i+tw+1]
inout_seq.append(train_seq)
label_seq.append(train_label)
return np.array(inout_seq), np.array(label_seq)
data = flight_data['passengers'].values
x, y = create_inout_sequences(data, 15)
src = itemify(x, y)
yy = y.reshape(-1)
xx = x.reshape(-1)
tfms = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(x, yy, tfms=tfms, bs=64)
dls.show_batch()
dls.one_batch()
dls.c
learn = ts_learner(dls, InceptionTime, metrics=[mae, rmse], cbs=ShowGraph())
learn.lr_find()

Keras Model Training with Azure Machine Learning

I have trained a multiclass-classification model locally using Keras. I am attempting to migrate this so that it can be trained and run in Azure Machine Learning Studio (AML).
I have provided the sections of code below which are used in AML - the Main AML Code and the script to train the model (EnsemblingModel.py). From the Main AML Code, the script to train the model is called via src = (Script Run Config).
Please note that I have also uploaded the dataset which the model should be trained upon to AML directly and is titled 'test_data'.
However an error is returned when executing the line RunDetails(run).show() from the Main AML code section. The error is:
Error occurred: User program failed with FileNotFoundError: [Errno 2] No such file or directory: 'test_data'
This error message refers to the the following line from the EnsemblingModel.py script:
dataframe = pd.read_csv("test_data", header=None)
I understand that the script is unable to load the data and I have therefore tried changing the code, for example:
dataframe = dataset.get_by_name(ws, name='test_data')
Which returned the following error:
Error occurred: User program failed with NameError: name 'dataset' is not defined
How do I change this so that the script is able to read and load the data so that training can commence? Maybe I am going about this completely the wrong way, so any advice is welcomed.
I have consulted the various Microsoft documentation as well as Github azure guides here, but there seems to be limited examples.
I am new to AML, so if anyone has any resources for using it alongside Keras, then that would also be appreciated.
Main AML Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import azureml
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import Dataset
from azureml.core import Workspace, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
ws = Workspace.from_config()
print('Workspace name: ' + ws.name,
'Azure region: ' + ws.location,
'Subscription id: ' + ws.subscription_id,
'Resource group: ' + ws.resource_group, sep='\n')
from azureml.core import Experiment
script_folder = './TestingModel1'
os.makedirs(script_folder, exist_ok=True)
exp = Experiment(workspace=ws, name='TestingModel1')
dataset = Dataset.get_by_name(ws, name='test_data')
dataframe = dataset.to_pandas_dataframe()
df = dataframe.values
cluster_name = "cpu-cluster"
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target')
except ComputeTargetException:
print('Creating a new compute target...')
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
max_nodes=4)
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
print(name, ct.type, ct.provisioning_state)
from azureml.core import Environment
keras_env = Environment.from_conda_specification(name = 'keras-2.3.1', file_path = './conda_dependencies.yml')
# Specify a GPU base image
#keras_env.docker.enabled = True
keras_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04'
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory=script_folder,
script='EnsemblingModel.py',
compute_target=compute_target,
environment=keras_env)
run = exp.submit(src)
from azureml.widgets import RunDetails
RunDetails(run).show()
Ensembling Model Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#KerasLibraries
from keras import callbacks
from keras.layers.normalization import BatchNormalization
from keras.layers import Activation
from keras.layers import Dropout
from keras.optimizers import SGD
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
#tensorFlow
import tensorflow as tf
#SKLearnLibraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from azureml.core import Run
# In[3]:
dataframe = pd.read_csv("test_data", header=None)
dataframe = dataset.get_by_name(ws, name='test_data')
dataset = dataframe.values
# In[4]:
X = dataset[:,0:22].astype(float)
y = dataset[:,22]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_y)
print(dummy_y.shape)
#print(X.shape)
#print(X)
import sys
np.set_printoptions(threshold=sys.maxsize)
dummy_y_new = dummy_y[0:42,:]
print(dummy_y_new)
#dataset
# In[5]:
earlystopping = callbacks.EarlyStopping(monitor ="val_loss",
mode ="min", patience = 125,
restore_best_weights = True)
#define Keras
model1 = Sequential()
model1.add(Dense(50, input_dim=22))
model1.add(BatchNormalization())
model1.add(Activation('relu'))
model1.add(Dropout(0.5,input_shape=(50,)))
model1.add(Dense(50))
model1.add(BatchNormalization())
model1.add(Activation('relu'))
model1.add(Dropout(0.5,input_shape=(50,)))
model1.add(Dense(8, activation='softmax'))
#compile the keras model
model1.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
# fit the keras model on the dataset
model1.fit(X, dummy_y, validation_split=0.25, epochs=10000, batch_size=100, verbose=1, callbacks=[earlystopping])
_, accuracy3 = model1.evaluate(X, dummy_y, verbose=0)
print('Accuracy: %.2f' % (accuracy3*100))
predict_dataset = tf.convert_to_tensor([
[1,5,1,0.459,0.322,0.041,0.002,0.103,0.032,0.041,14,0.404,0.284,0.052,0.008,0.128,0.044,0.037,0.043,54,0,155],
])
predictions = model1(predict_dataset, training=False)
predictions2 = predictions.numpy()
print(predictions2)
print(type(predictions2))
I have resolved the above issue by adding an argument to the ScriptRunConfig code:
test_data_ds = Dataset.get_by_name(ws, name='test_data')
src = ScriptRunConfig(source_directory=script_folder,
script='EnsemblingModel.py',
# pass dataset as an input with friendly name 'titanic'
arguments=['--input-data', test_data_ds.as_named_input('test_data')],
compute_target=compute_target,
environment=keras_env)
As well as the following to the modelling script itself:
import argparse
from azureml.core import Dataset, Run
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
# get the input dataset by ID
dataset = Dataset.get_by_id(ws, id=args.input_data)
# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()
dataset = df.values
For anyone curious, more information can be found here:

How to fix 'name 'cross_validation' is not defined' error in python

I am trying to run XGBClassifier parameter tuning and get a "'name 'cross_validation' is not defined" error following this line of code:
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)
Maybe I didn't import the appropriate library?
First, get your version:
import sklearn
sklearn.__version__
After scikit-learn version 0.17, the cross_validation.KFold has been migrated to model_selection.KFold.
If you have the 0.17 version, use this:
from sklearn.cross_validation import KFold
kfold_5 = KFold(n= len(X), n_folds = numFolds, shuffle=True)
If you have a version newer than 0.17, use this:
from sklearn.model_selection import KFold
kfold_5 = KFold(n_splits = numFolds, shuffle=True)
Documentation for 0.21 version is here

XGBoost error: /workspace/src/metric/elementwise_metric.cc:28: Check failed: preds.size() == info.labelsSize() (

I am new to machine learning and trying to solve a problem of housing prices of kaggle competition.. i am trying to run this code and fit this model but outputs a error..please help and explain as i am a novice...thank in advance
I tried to search in google but shows multiclass error don't know what it is and shows the solution as a "mlogloss" or "merror"
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from learntools.core import *
from xgboost import XGBRegressor
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath',
'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
iowa_model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
iowa_model.fit(train_X, train_y,early_stopping_rounds=5,eval_set=
[(train_X,val_y)],verbose=False)
you got a 'typo' try
iowa_model.fit(train_X, train_y,early_stopping_rounds=5,eval_set= [(val_X,val_y)],verbose=False)

Resources