How to handle imbalanced dataset in PyTorch? - pytorch

all.
I have some problems when I handled an imbalanced dataset.
I tried to use WeightRandomSampler to over-sampling/under-sampling, but the result is worse than I directly train on an imbalanced dataset.
Here, I list the detail in the imbalanced dataset and how to calculate my weight for WeightRandomSampler.
The number of each class:
Class 1: 1865
Class 1: 2677
Class 2: 3602
Class 3: 916
Class 4: 4354
Class 5: 4061
Class 6: 892
Class 7: 1718
Class 8: 3417
Class 9: 152
My code block:
weights = list(1.0 / np.array([1865, 2677, 3602, 916, 4354, 4061, 892, 1718, 3417, 152]))
sample_weights = [weights[t] for t in train_ds.target]
sampler = WeightedRandomSampler(sample_weights, len(train_ds), replacement=True)
train_dl = torch.utils.data.DataLoader(dataset=train_ds, batch_size=train_bs, sampler=sampler)
Also, I tried to use weight for nn.CrossEntropy(), but have some errors.
Thanks

Related

Cannot interpret SVM model using Shapash

Currently, I'm exploring machine learning interpretability tools for one of my project. I found Shapash quite a new tool and many people suggesting to use it to create a few easily interpretable charts for ML model. When I tried it with RandomForestClassifier it worked fine and generate a webpage full of different charts but the same I cannot achieve while using SVM(just exploring this library, not focusing on the perfect ML model for a problem).
Note - using Shapash link here
#Fit blackbox model
svc = svm.SVC()
svc.fit(X_train_smote, y_train_smote)
y_pred = svc.predict(X_test)
print(f"F1 Score {f1_score(y_test, y_pred, average='macro')}")
print(f"Accuracy {accuracy_score(y_test, y_pred)}")
from shapash import SmartExplainer
xpl = SmartExplainer(model=svc)
error which I'm getting -
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/tmp/ipykernel_13648/1233939729.py in <module>
----> 1 xpl = SmartExplainer(model=svc)
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/explainer/smart_explainer.py in __init__(self, model, backend, preprocessing, postprocessing, features_groups, features_dict, label_dict, title_story, palette_name, colors_dict, **kwargs)
194 if isinstance(backend, str):
195 backend_cls = get_backend_cls_from_name(backend)
--> 196 self.backend = backend_cls(
197 model=self.model, preprocessing=preprocessing, **kwargs)
198 elif isinstance(backend, BaseBackend):
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/backend/shap_backend.py in __init__(self, model, preprocessing, explainer_args, explainer_compute_args)
16 self.explainer_args = explainer_args if explainer_args else {}
17 self.explainer_compute_args = explainer_compute_args if explainer_compute_args else {}
---> 18 self.explainer = shap.Explainer(model=model, **self.explainer_args)
19
20 def run_explainer(self, x: pd.DataFrame) -> dict:
~/Python_AI/ai_env/lib/python3.8/site-packages/shap/explainers/_explainer.py in __init__(self, model, masker, link, algorithm, output_names, feature_names, **kwargs)
166 # if we get here then we don't know how to handle what was given to us
167 else:
--> 168 raise Exception("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
169
170 # build the right subclass
Exception: The passed model is not callable and cannot be analyzed directly with the given masker! Model: SVC()

Getting class probabilities p(c|x) for each data point using sklearn

How can we get the class probability of each test data point? There are some classifiers that do contain "predict_proba()" function that returns the probability of the data class the data point belongs to.
But there is no such function defined in the https://sklearn-lvq.readthedocs.io/en/stable/rslvq.html#
I need to calculate the class probabilities of each class so that the reject option can be applied. The idea is to calculate p(c|x) and check if the value is less than the threshold, then the data point should be rejected.
You can try this:
Method 1: overriding the predict function as follows
from sklearn_lvq import RslvqModel
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import validation
import numpy as np
class RslvqModel_custom(RslvqModel):
def predict(self, x):
"""Predict class membership index for each input sample.
This function does classification on an array of
test vectors X.
Parameters
----------
x : array-like, shape = [n_samples, n_features]
Returns
-------
C : array, shape = (n_samples,)
Returns predicted values.
"""
check_is_fitted(self, ['w_', 'c_w_'])
x = validation.check_array(x)
if x.shape[1] != self.w_.shape[1]:
raise ValueError("X has wrong number of features\n"
"found=%d\n"
"expected=%d" % (self.w_.shape[1], x.shape[1]))
def foo(e):
fun = np.vectorize(lambda w: self._costf(e, w),
signature='(n)->()')
pred = fun(self.w_)
return pred
predictions = np.vectorize(foo, signature='(n)->(n)')(x)
sum = np.sum(predictions, axis=1).reshape(predictions.shape[0], 1)
return predictions/sum
np.random.seed(1)
print(__doc__)
nb_ppc = 100
x = np.append(
np.random.multivariate_normal([0, 0], np.eye(2) / 2, size=nb_ppc),
np.random.multivariate_normal([5, 0], np.eye(2) / 2, size=nb_ppc), axis=0)
y = np.append(np.zeros(nb_ppc), np.ones(nb_ppc), axis=0)
rslvq = RslvqModel_custom(initial_prototypes=[[5,0,0],[0,0,1]]) #_custom
model = rslvq.fit(x, y)
predictions = model.predict([[3.67, 6.50], [4.97, 1.49], [1.14, -4.3]])
print('============================================================')
print('Predictions: ', predictions)
print('-------------------------------------------------------------')
OUTPUT:
Predictions: [[0.5977081 0.4022919 ]
[0.945568 0.054432 ]
[0.33533978 0.66466022]]
Method 2: Or you can simply make use of builtin posterior method:
for data in [[3.67, 6.50], [4.97, 1.49], [1.14, -4.3]]:
print('Class 0: posterior: ', model.posterior(0, data))
print('Class 1: posterior: ', model.posterior(1, data))
print('='*100)
OUTPUT:
-------------------------------------------------------------
Class 0: posterior: [[0.5977081]]
Class 1: posterior: [[0.4022919]]
========================================================
Class 0: posterior: [[0.945568]]
Class 1: posterior: [[0.054432]]
========================================================
Class 0: posterior: [[0.33533978]]
Class 1: posterior: [[0.66466022]]
========================================================

Mask rcnn not working for images with large resolution

I used Mask-Rcnn for training an image set (Note with high resolution Eg:2400*1920 ) with VIAtool following this reference article Mask rcnn usage. Here, I have edited the Ballon.py and the code is as follows:
import os
import sys
import json
import datetime
import numpy as np
import skimage.draw
# Root directory of the project
ROOT_DIR = os.path.abspath("../../")
# Import Mask RCNN
sys.path.append(ROOT_DIR) # To find local version of the library
from mrcnn.config import Config
from mrcnn import model as modellib, utils
# Path to trained weights file
COCO_WEIGHTS_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")
if COCO_WEIGHTS_PATH is None:
print('weights not available')
else:
print('weights available')
DEFAULT_LOGS_DIR = os.path.join(ROOT_DIR, "logs")
# Configurations
class NeuralCodeConfig(Config):
NAME = "screens"
# We use a GPU with 12GB memory, which can fit two images.
# Adjust down if you use a smaller GPU.
IMAGES_PER_GPU = 1
# Number of classes (including background)
NUM_CLASSES = 1 + 10 # Background + other region classes
# Number of training steps per epoch
STEPS_PER_EPOCH = 30
# Skip detections with < 90% confidence
DETECTION_MIN_CONFIDENCE = 0.9
# Dataset
class NeuralCodeDataset(utils.Dataset):
def load_screen(self, dataset_dir, subset):
"""Load a subset of the screens dataset.
dataset_dir: Root directory of the dataset.
subset: Subset to load: train or val
"""
# Add classes.
self.add_class("screens",1,"logo")
self.add_class("screens",2,"slider")
self.add_class("screens",3,"navigation")
self.add_class("screens",4,"forms")
self.add_class("screens",5,"social_media_icons")
self.add_class("screens",6,"video")
self.add_class("screens",7,"map")
self.add_class("screens",8,"pagination")
self.add_class("screens",9,"pricing_table_block")
self.add_class("screens",10,"gallery")
# Train or validation dataset?
assert subset in ["train", "val"]
dataset_dir = os.path.join(dataset_dir, subset)
# Load annotations
# VGG Image Annotator saves each image in the form:
# { 'filename': '28503151_5b5b7ec140_b.jpg',
# 'regions': {
# '0': {
# 'region_attributes': {},
# 'shape_attributes': {
# 'all_points_x': [...],
# 'all_points_y': [...],
# 'name': 'polygon'}},
# ... more regions ...
# },
# 'size': 100202
# }
# We mostly care about the x and y coordinates of each region
annotations = json.load(open(os.path.join(dataset_dir, "via_region_data.json")))
if annotations is None:
print ("region data json not loaded")
else:
print("region data json loaded")
# print(annotations)
annotations = list(annotations.values()) # don't need the dict keys
# The VIA tool saves images in the JSON even if they don't have any
# annotations. Skip unannotated images.
annotations = [a for a in annotations if a['regions']]
# Add images
for a in annotations:
# Get the x, y coordinaets of points of the polygons that make up
# the outline of each object instance. There are stores in the
# shape_attributes and region_attributes (see json format above)
polygons = [r['shape_attributes'] for r in a['regions']]
screens = [r['region_attributes']for r in a['regions']]
#getting the filename by spliting
class_name = screens[0]['html']
file_name = a['filename'].split("/")
file_name = file_name[len(file_name)-1]
#getting class_ids with file_name
class_ids = class_name+"_"+file_name
# #getting width an height of the images
# height = [h['height'] for h in polygons]
# width = [w['width'] for w in polygons]
# print(height,'height')
# print('polygons',polygons)
# load_mask() needs the image size to convert polygons to masks.
# Unfortunately, VIA doesn't include it in JSON, so we must readpath
# the image. This is only managable since the dataset is tiny.
image_path = os.path.join(dataset_dir,file_name)
image = skimage.io.imread(image_path)
#resizing images
# image = utils.resize_image(image, min_dim=800, max_dim=1000, min_scale=None, mode="square")
# print('image',image)
height,width = image.shape[:2]
# print('height',height)
# print('width',width)
# height = 800
# width = 800
self.add_image(
"screens",
image_id=file_name, # use file name as a unique image id
path=image_path,
width=width, height=height,
polygons=polygons,
class_ids=class_ids)
def load_mask(self, image_id):
"""Generate instance masks for an image.
Returns:
masks: A bool array of shape [height, width, instance count] with
one mask per instance.
class_ids: a 1D array of class IDs of the instance masks.
"""
# If not a screens dataset image, delegate to parent class.
image_info = self.image_info[image_id]
if image_info["source"] != "screens":
return super(self.__class__, self).load_mask(image_id)
# Convert polygons to a bitmap mask of shape
# [height, width, instance_count]
info = self.image_info[image_id]
mask = np.zeros([info["height"], info["width"], len(info["polygons"])],
dtype=np.uint8)
for i, p in enumerate(info["polygons"]):
# Get indexes of pixels inside the polygon and set them to 1
rr, cc = skimage.draw.polygon(p['y'], p['x'])
mask[rr, cc, i] = 1
# Return mask, and array of class IDs of each instance. Since we have
# one class ID only, we return an array of 1s
# return mask.astype(np.bool), np.ones([mask.shape[-1]], dtype=np.int32)
# class_ids = np.array(class_ids,dtype=np.int32)
return mask,class_ids
def image_reference(self, image_id):
"""Return the path of the image."""
info = self.image_info[image_id]
if info["source"] == "screens":
return info["path"]
else:
super(self.__class__, self).image_reference(image_id)
def train(model):
# Train the model.
# Training dataset.
dataset_train = NeuralCodeDataset()
dataset_train.load_screen(args.dataset, "train")
dataset_train.prepare()
# Validation dataset
dataset_val = NeuralCodeDataset()
dataset_val.load_screen(args.dataset, "val")
dataset_val.prepare()
# *** This training schedule is an example. Update to your needs ***
# Since we're using a very small dataset, and starting from
# COCO trained weights, we don't need to train too long. Also,
# no need to train all layers, just the heads should do it.
print("Training network heads")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=30,
layers='heads')
# Training
if __name__ == '__main__':
import argparse
# Parse command line arguments
parser = argparse.ArgumentParser(
description='Train Mask R-CNN to detect screens.')
parser.add_argument("command",
metavar="<command>",
help="'train' or 'splash'")
parser.add_argument('--dataset', required='True',
metavar="../../datasets/screens",
help='Directory of the screens dataset')
parser.add_argument('--weights', required=True,
metavar="/weights.h5",
help="Path to weights .h5 file or 'coco'")
parser.add_argument('--logs', required=False,
default=DEFAULT_LOGS_DIR,
metavar="../../logs/",
help='Logs and checkpoints directory (default=logs/)')
parser.add_argument('--image', required=False,
metavar="path or URL to image",
help='Image to apply the color splash effect on')
parser.add_argument('--video', required=False,
metavar="path or URL to video",
help='Video to apply the color splash effect on')
args = parser.parse_args()
# Validate arguments
if args.command == "train":
assert args.dataset, "Argument --dataset is required for training"
elif args.command == "splash":
assert args.image or args.video,\
"Provide --image or --video to apply color splash"
print("Weights: ", args.weights)
print("Dataset: ", args.dataset)
print("Logs: ", args.logs)
# Configurations
if args.command == "train":
config = NeuralCodeConfig()
else:
class InferenceConfig(NeuralCodeConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
config = InferenceConfig()
config.display()
# Create model
if args.command == "train":
model = modellib.MaskRCNN(mode="training", config=config,
model_dir=args.logs)
else:
model = modellib.MaskRCNN(mode="inference", config=config,
model_dir=args.logs)
# Select weights file to load
if args.weights.lower() == "coco":
weights_path = COCO_WEIGHTS_PATH
# Download weights file
if not os.path.exists(weights_path):
utils.download_trained_weights(weights_path)
elif args.weights.lower() == "last":
# Find last trained weights
weights_path = model.find_last()
elif args.weights.lower() == "imagenet":
# Start from ImageNet trained weights
weights_path = model.get_imagenet_weights()
else:
weights_path = args.weights
# Load weights
print("Loading weights ", weights_path)
if args.weights.lower() == "coco":
# Exclude the last layers because they require a matching
# number of classes
model.load_weights(weights_path, by_name=True, exclude=[
"mrcnn_class_logits", "mrcnn_bbox_fc",
"mrcnn_bbox", "mrcnn_mask"])
else:
model.load_weights(weights_path, by_name=True)
# Train or evaluate
if args.command == "train":
train(model)
# elif args.command == "splash":
# detect_and_color_splash(model, image_path=args.image,
# video_path=args.video)
else:
print("'{}' is not recognized. "
"Use 'train' or 'splash'".format(args.command))
And I am getting the following error when training the data set with pretrained COCO dataset:
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-09 13:52:27.993239: W tensorflow/core/framework/allocator.cc:108] Allocation of 51380224 exceeds 10% of system memory.
2018-08-09 13:52:28.037704: W tensorflow/core/framework/allocator.cc:108] Allocation of 51380224 exceeds 10% of system memory.
/home/scit/anaconda3/lib/python3.6/site-packages/keras/engine/training.py:2022: UserWarning: Using a generator with use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class.
UserWarning('Using a generator with `use_multiprocessing=True`'
ERROR:root:Error processing image {'id': '487.jpg', 'source': 'screens', 'path': '../../datasets/screens/train/487.jpg', 'width': 1920, 'height': 7007, 'polygons': [{'name': 'rect', 'x': 384, 'y': 5, 'width': 116, 'height': 64}, {'name': 'rect', 'x': 989, 'y': 17, 'width': 516, 'height': 42}, {'name': 'rect', 'x': 984, 'y': 5933, 'width': 565, 'height': 273}, {'name': 'rect', 'x': 837, 'y': 6793, 'width': 238, 'height': 50}], 'class_ids': 'logo_487.jpg'}
Traceback (most recent call last):
File "/home/scit/Desktop/My_work/object_detection/mask_rcnn/mrcnn/model.py", line 1717, in data_generator
use_mini_mask=config.USE_MINI_MASK)
File "/home/scit/Desktop/My_work/object_detection/mask_rcnn/mrcnn/model.py", line 1219, in load_image_gt
mask, class_ids = dataset.load_mask(image_id)
File "neural_code.py", line 235, in load_mask
rr, cc = skimage.draw.polygon(p['y'], p['x'])
File "/home/scit/anaconda3/lib/python3.6/site-packages/skimage/draw/draw.py", line 441, in polygon
return _polygon(r, c, shape)
File "skimage/draw/_draw.pyx", line 217, in skimage.draw._draw._polygon (skimage/draw/_draw.c:4402)
OverflowError: Python int too large to convert to C ssize_t
My laptop graphics specs are follows:
Nvidia GeForce 830M (2 GB) with 250 CUDA cores
CPU specs:
Intel Core i5 (4th gen), 8 GB RAM
What may be the case here? Is it the resolution of the images or the incapability of my GPU. Shall I proceed with CPU?
I am sharing my observations with Mask RCNN while training my custom dataset.
My dataset comprises of images of various dimension (i.e. smallest image has approx 1700 x 1600 pixels and the largest image has approx 8500 x 4600 pixels).
I am training on nVIDIA RTX 2080Ti, 32 GB DDR4 RAM and while training I get the below mentioned warnings; but the training process completes.
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-05-23 15:25:23.433774: W T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Few months back, I tried the Matterport Splash of Color Example on my Laptop which has 12 GB RAM and nVIDIA 920M (2GB GPU); and have encountered similar Memory Errors.
So, we can suspect that size of the GPU Memory is a contributing factor in this error.
Additionally, batch size is another contributing factor; but I see that you have set the IMAGE_PER_GPU=1. If you search for the BATCH_SIZE in the config.py file present in the mrcnn folder, you will find –
self.BATCH_SIZE = self.IMAGES_PER_GPU * self.GPU_COUNT
So, in your case the batch_size is 1.
In conclusion, I would suggest to please try the same code on a more powerful GPU.

Poor probability results for SVM text classification

I'm fairly new to machine learning technologies and I'm using sklearn and SVC to perform date classification on texts as apart of a project, but I'm getting incredibly low probabiltiy scores.
I have a corpus of 13 texts all authored at different dates ranging from 598 to 1358 (which I use as classes) stored in a train and test file directory, I use a CountVectoriser and TfidfTransformer to prepare the data and pickle my results for later use:
src = "../data/datasets/pickledWordLists/"
corpus = []
for filename in os.listdir(src):
with (open(os.path.join(src, filename), "rb")) as openfile:
while True:
try:
text = pickle.load(openfile)
text = ' '.join(word for word in text)
corpus.append(text)
except EOFError:
break
src = "../data/datasets/TestsPickled/"
for filename in os.listdir(src):
with (open(os.path.join(src, filename), "rb")) as openfile:
while True:
try:
text = pickle.load(openfile)
text = ' '.join(word for word in text)
corpus.append(text)
except EOFError:
break
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
vector = vectorizer.transform(corpus)
tfidf_transformer = TfidfTransformer()
vector = tfidf_transformer.fit_transform(vector)
with open('../data/datasets/labeledData/training/train_matrices/corpus.pickle', 'wb') as handle:
pickle.dump(vector, handle)
After this I load it back in:
train_sparse = []
with open('../data/datasets/labeledData/training/train_matrices/corpus.pickle', 'rb') as handle:
train_sparse = pickle.load(handle)
train_set = np.array(train_sparse.toarray()[0:9])
test_set = np.array(train_sparse.toarray()[10:14])
labels = [1028, 1107, 1358, 598, 707, 875, 884, 890, 988]
I then fit a SVC with the training data and class labels:
clf = SVC(kernel='linear', C=100, cache_size=300, class_weight='balanced', coef0=10.0,
decision_function_shape='ovo', degree=10, gamma='auto',
max_iter=-1, probability=True, random_state=None, shrinking=False,
tol=1, verbose=False)
clf.fit(train_set, labels)
Experimenting with the fitted SVC, and to get some confidence I test it on one of the examples it was actually trained on (clf.predict([train_sparse.toarray()[6]], class 884) expecting a close to 1.0 probabiltiy score for that class. I actually get a very poor result with an incorrect classification.
Actual Class: 884
Predicted Class: [988]
Class: 1028 probability: 0.13680521863292022 %
Class: 1107 probability: 0.1372151835630488 %
Class: 1358 probability: 0.09314753496099398 %
Class: 598 probability: 0.11216304253012621 %
Class: 707 probability: 0.07705449472997644 %
Class: 875 probability: 0.07702437742491991 %
Class: 884 probability: 0.11694844959109225 %
Class: 890 probability: 0.12739816653603753 %
Class: 988 probability: 0.1222435320308847 %
Other attempts produce simmilar results:
Actual Class: 890
Predicted Class: [890]
Class: 1028 probability: 0.13682366473176108 %
Class: 1107 probability: 0.1372180833047104 %
Class: 1358 probability: 0.09312179238345174 %
Class: 598 probability: 0.11286567433780788 %
Class: 707 probability: 0.07636519076484871 %
Class: 875 probability: 0.07712682059152805 %
Class: 884 probability: 0.1169118710112273 %
Class: 890 probability: 0.12735823672708854 %
Class: 988 probability: 0.12220866614757639 %
Is there anything I can do to get these probabiltiy scores up (or down)? rather than sitting around the 10% - 12% mark for everything I try it on? I tested this on a seperate english language corpus; 9 texts, all about the same size ranging in dates from 900 - 1600, this was giving me very simmilar scores. I need a probability score because part of my project is to see whether a text can be roughly dated based on a range of class simmilarity scores from various dates.

Set thresholds in PySpark multinomial logistic regression

I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Then I would like to fit the model on my DF:
test_logit = test_logit_abst.fit(test_train_df)
but when executing this last command I get an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted.
So, how to clear threshold since no clearThreshold() exists?
In order to achieve this I tried to clear threshold this way:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
This time the fit command works, I even obtain the model intercept and coefficients:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
But if I try to get thresholds (plural) from test_logit_abst I get an error:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
What does this mean?
As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Why does changing the order of the "set" instructions change the output as well?
It is a messy situation indeed...
The short answer is:
setThresholds (plural) not clearing the threshold (singular) seems to be a bug
For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)
And now for the long answer...
Let's start with binary classification, adapting the toy data from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Here is the result:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.
Let's now try the seemingly identical operation, but with setThreshold(s) instead:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Nice, eh?
setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...
Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).
Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Looks fine, but let's ask for a prediction in the (training) dataset:
mlorModel.transform(mdf).show(truncate=False)
I have singled out only one row - it should be the 2nd from the end of the full output:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...
So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
To summarize:
In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.
Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although the above argument was made for the binary case, it fully holds for the multinomial one, too...

Resources