I have a graph-based Pytorch model and then I wanted to predict the class for 10 graphs.
The data objects (i.e. train_dataset in the code below) looks like this:
[Data(x=[10, 5], edge_index=[2, 18], y=[1]), Data(x=[15, 5], edge_index=[2, 28], y=[1]), Data(x=[13, 5], edge_index=[2, 24], y=[1]), Data(x=[18, 5], edge_index=[2, 34], y=[1]), Data(x=[14, 5], edge_index=[2, 26], y=[1]), Data(x=[13, 5], edge_index=[2, 24], y=[1]), Data(x=[15, 5], edge_index=[2, 28], y=[1]), Data(x=[19, 5], edge_index=[2, 36], y=[1]), Data(x=[15, 5], edge_index=[2, 28], y=[1]), Data(x=[27, 5], edge_index=[2, 52], y=[1])]
So I ran this (where model is a model I have built):
predict_dataset = new_dataset[0:10]
for i in predict_dataset:
prediction = model(i)
label = torch.argmax()
print(prediction)
And my output is:
(tensor(5.5788e-05, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(0.0190, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(5.0663e-05, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(0.0338, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(4.7684e-07, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(2.9166, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(0.))
(tensor(0.1944, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(0.0591, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(1.9073e-06, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
(tensor(0.0025, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), tensor(1.))
I'm confused what the numbers mean, is the last item in each tuple the predicted class? And then what's the first number?
Thanks, just not sure if I've predicted properly so all suggestions/other code examples appreciated.
I believe that the first tensor is the loss since it is the output of the BinaryCrossEntropy, and the second tensor, as you said, is the index of the predicted class. But It would be helpful if you show the result of print(model) to further understand the model.
Related
When I try to run a RandomForestClassifier with Pipeline and param_grid:
pipeline = Pipeline([("scaler" , StandardScaler()),
("rf",RandomForestClassifier())])
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [4, 5, 10],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'n_estimators': [100, 200, 300]
}
# initialize
grid_pipeline = GridSearchCV(pipeline,param_grid,n_jobs=-1, verbose=1, cv=3, scoring='f1')
# fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_
I get the following error:
ValueError: Invalid parameter max_depth for estimator Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('rf',
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=None,
verbose=0, warm_start=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Although I have reviewed the scikit learn documentation and several posts, I can't find the error in my code.
When you use a pipeline with GridSearchCV() you must include names in parameter keys. Just separate names from parameter names with a double underscore. In your case:
param_grid = {
'rf__max_depth': [4, 5, 10],
'rf__max_features': [2, 3],
'rf__min_samples_leaf': [3, 4, 5],
'rf__n_estimators': [100, 200, 300]
}
Example from sklearn documentation: https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
I have applied a gridsearchCV with estimators of DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, XGBClassifier used all of them in ensemble learning.
The result that gridSearchCV will give with all these estimators are different in my system and my friend's system with same data of testing and training, I don't know why?
We are using same data for training and testing but gridsearch is giving different result for these data in both the system, just want to know what should be changed to make the system to give same result on any system?
gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'max_depth': [ 2, 4, 6, 8, 10],
'criterion':['gini','entropy'],
"max_features":["auto", None],
"max_leaf_nodes":[10,20,30,40]}],
scoring=scoring,
cv=10,
refit='recall')
gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1, oob_score = True,class_weight={1: 10/11, 0: 1/11}),
param_grid=[{'max_depth': [4, 6, 8, 10, 12, 16, 20, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 4, 8],
'min_samples_split': [10, 20]}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
gs_lr = GridSearchCV(estimator=LogisticRegression(multi_class='ovr',random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'C': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 ,1],
'penalty':['l1','l2']}],
scoring=scoring,
cv=10,
refit='recall')
gs_gb = GridSearchCV(estimator=XGBClassifier(n_jobs=-1),
param_grid=[{'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [4, 6, 8, 10, 12, 16, 20],
'min_samples_leaf': [4, 8, 12, 16, 20],
'max_features': ['auto', 'sqrt']}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
For example first gridsearchcv gives this result on my system:
DecisionTreeClassifier(class_weight={1: 10, 0: 1}, criterion='gini',
max_depth=8, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=42,
splitter='best')
And on my friend's system it gives:
DecisionTreeClassifier(class_weight={0: 1, 1: 10}, criterion='gini',
max_depth=10, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=42, splitter='best')
Similarly I got different result on my and my friend's system.
I am trying to make time series predictions using XGBoost. (XGBRegressor)
I used GrindSearchCV like this:
parameters = {'nthread': [4],
'objective': ['reg:linear'],
'learning_rate': [0.01, 0.03, 0.05],
'max_depth': [3, 4, 5, 6, 7, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [1],
'colsample_bytree': [0.7, 0.8],
'n_estimators': [500]}
xgb_grid = GridSearchCV(xgb, parameters, cv=2, n_jobs=5,
verbose=True)
xgb_grid.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=100,
verbose=True)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
And got those :
0.307153826086191
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 4, 'min_child_weight': 4, 'n_estimators': 500, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}
I tried implementing those parameters and calculate the error. I got this:
MSE: 4.579726929529167
MAE: 1.6753722069363144
I know that an error of 1.6 is not very good for predictions. It has to be < 0.9.
I tried to micro adjust the parameters but I have not managed to reduce error more than that.
I found something about the date format, maybe that is the problem ? My data is like this : yyyy-MM-dd HH:mm.
I am new to machine learning and that's what I managed to do after some examples and tutorials. What should I do to lower it, or what should I search for to learn ?
I mention that I found various examples like this one, but I didn't understood, and of course it did not work.
So I ran a very thorough GridSearch with 10-fold cross-val in an integrated pipeline in the following manner-
pipeline_rf = Pipeline([
('standardize', MinMaxScaler()),
('grid_search_lr', GridSearchCV(
RandomForestClassifier(),
param_grid={'bootstrap': [True],
'max_depth': [50, 100, 150, 200],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 500, 1000, 1500]},
cv=10,
n_jobs=-1,
scoring='roc_auc',
verbose=2,
refit=True
))
])
pipeline_rf.fit(X_train, y_train)
How should I go about extracting the best set of parameters?
You first need to get the gridSearchCV object from the pipeline, and then call best_params_ on it. This can be done by:
pipeline_rf.named_steps['grid_search_lr'].best_params_
l would like to generate a random 3d array containing random integers (coordinates) in the intervalle [0,100].
so, coordinates=dim(30,10,2)
What l have tried ?
coordinates = [[random.randint(0,100), random.randint(0,100)] for _i in range(30)]
which returns
array([[97, 68],
[11, 23],
[47, 99],
[52, 58],
[95, 60],
[89, 29],
[71, 47],
[80, 52],
[ 7, 83],
[30, 87],
[53, 96],
[70, 33],
[36, 12],
[15, 52],
[30, 76],
[61, 52],
[87, 99],
[19, 74],
[37, 63],
[40, 2],
[ 8, 84],
[70, 32],
[63, 8],
[98, 89],
[27, 12],
[75, 59],
[76, 17],
[27, 12],
[48, 61],
[39, 98]])
of shape (30,10)
What l'm supposed to get ?
dim=(30,10,2) rather than (30,10)
Use the size parameter:
import numpy as np
coordinates = np.random.randint(0, 100, size=(30, 10, 2))
will produce a NumPy array with integer values between 0 and 100 and of shape (30, 10, 2).