Error: `data` and `reference` should be factors with the same levels for imbalanced class - confusion-matrix

I Used SMOTE and Tomek methods for imbalanced classes that I have. I'm trying to do boosted regression tree.
It runs smoothly until I create the confusion matrix I have this error (
Error: data and reference should be factors with the same levels.
### SMOTE and Tomek
NOAA_SMOTE= read.csv("NOAA_SMOTE.csv", TRUE, ",")
train.index <- createDataPartition(NOAA_SMOTE$japon, p = .7, list = FALSE)
train <- NOAA_SMOTE[ train.index,]
test <- NOAA_SMOTE[-train.index,]
tomek = ubTomek(train[,-1], train[,1])
model_train_tomek = cbind(tomek$X,tomek$Y)
names(model_train_tomek)[1] = "japon"
removed.index = tomek$id.rm
train$japon = as.factor(train$japon)
train_tomek = train[-removed.index,]
## SMOTE after tomek links
traintomeksmote <- SMOTE(japon ~ ., train_tomek, perc.over = 2000,perc.under = 100)
fitControlSmoteTomek<- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
repeats = 3,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
gbmGridSmoteTomek <- expand.grid(interaction.depth = c(3,4, 5, 6),
n.trees = (1:30)*50,
shrinkage = c(0.1,0.001,0.75,0.0001),
n.minobsinnode = 10)
gbmFitNOAASMOTETomek <- caret::train (make.names(japon) ~ ., data = traintomeksmote,
method = "gbm",
trControl = fitControlSmoteTomek,
distribution = "bernoulli",
verbose = FALSE,
tuneGrid = gbmGridSmoteTomek,
bag.fraction=0.5,
## Specify which metric to optimize
metric = "ROC")
test$japon = as.factor(test$japon)
PredNOAASMOTETomek <- predict(gbmFitNOAASMOTETomek, newdata= test ,type='prob')
cmSMOTETomekNOAA = confusionMatrix(PredNOAASMOTETomek , as.factor(test$japon), mode="everything")
part of the data
[enter image description here](https://i.stack.imgur.com/jPgI9.png)

Related

How do I extract x co-ordinate of a point using Python

I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?
you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))
You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]

Select multiple target variables in keras tensorflow

I am trying to follow a kaggle kernel for BERT implementation :
https://www.kaggle.com/hiromoon166/save-bert-fine-tuning-model
But i am not able to select target variables. I have to select multiple target variables as my y variable as it is a multi-label classification.
This is the line of code i am stuck on:
train_lines, train_labels = train_df['comment_text'].values, train_df.target.values
def convert_lines(example, max_seq_length,tokenizer):
max_seq_length -=2
all_tokens = []
longer = 0
for i in range(example.shape[0]):
tokens_a = tokenizer.tokenize(example[i])
if len(tokens_a)>max_seq_length:
tokens_a = tokens_a[:max_seq_length]
longer += 1
one_token = tokenizer.convert_tokens_to_ids(["[CLS]"]+tokens_a+["[SEP]"])+[0] * (max_seq_length - len(tokens_a))
all_tokens.append(one_token)
print(longer)
return np.array(all_tokens)
nb_epochs=1
bsz = 32
dict_path = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
tokenizer = tokenization.FullTokenizer(vocab_file=dict_path, do_lower_case=True)
print('build tokenizer done')
train_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
train_df = train_df.sample(frac=0.01,random_state = 42)
#train_df['comment_text'] = train_df['comment_text'].replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n', ' ', regex=True)
train_lines, train_labels = train_df['comment_text'].values, train_df.target.values
print('sample used',train_lines.shape)
token_input = convert_lines(train_lines,maxlen,tokenizer)
seg_input = np.zeros((token_input.shape[0],maxlen))
mask_input = np.ones((token_input.shape[0],maxlen))
print(token_input.shape)
print(seg_input.shape)
print(mask_input.shape)
print('begin training')
model3.fit([token_input, seg_input, mask_input],train_labels,batch_size=bsz,epochs=nb_epochs)
Please help me understand how to select target variables here?

Bokeh – ColumnDataSource not updating whiskered-plot

I’m having issues with the following code (I’ve cut out large pieces but I can add them back in – these seemed like the important parts). In my main code, I set up a plot (“sectionizePlot”) which is a simple variation on another whiskered-plot
I’m looking to update them on the fly. In the same script, I’m using a heatmap (“ModifiedGenericHeatMap”) which updates fine.
Any ideas how I might update my whiskered-plot? Updating the ColumnDataSource doesn’t seem to work (which makes sense). I’m guessing that I am running into issues with adding each circle/point individually onto the plot.
One idea would be to clear the plot each time and manually add the points onto the plot, but it would need to be cleared each time, which I’m unsure of how to do.
Any help would be appreciated. I’m just a lowly Scientist trying to utilize Bokeh in Pharma research.
def ModifiedgenericHeatMap(source, maxPct):
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
#mapper = LinearColorMapper(palette=colors, low=0, high=data['count'].max())
mapper = LinearColorMapper(palette=colors, low=0, high=maxPct)
TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"
globalDist = figure(title="derp",
x_range=cols, y_range=list(reversed(rows)),
x_axis_location="above", plot_width=1000, plot_height=400,
tools=TOOLS, toolbar_location='below')
globalDist.grid.grid_line_color = None
globalDist.axis.axis_line_color = None
globalDist.axis.major_tick_line_color = None
globalDist.axis.major_label_text_font_size = "5pt"
globalDist.axis.major_label_standoff = 0
globalDist.xaxis.major_label_orientation = pi / 3
globalDist.rect(x="cols", y="rows", width=1, height=1,
source=source,
fill_color={'field': 'count', 'transform': mapper},
line_color=None)
color_bar = ColorBar(color_mapper=mapper, major_label_text_font_size="5pt",
ticker=BasicTicker(desired_num_ticks=len(colors)),
# fix this via using a formatter with accounts for
formatter=PrintfTickFormatter(format="%d%%"),
label_standoff=6, border_line_color=None, location=(0, 0))
text_props = {"source": source, "text_align": "left", "text_baseline": "middle"}
x = dodge("cols", -0.4, range=globalDist.x_range)
r = globalDist.text(x=x, y=dodge("rows", 0.3, range=globalDist.y_range), text="count", **text_props)
r.glyph.text_font_size = "8pt"
globalDist.add_layout(color_bar, 'right')
globalDist.select_one(HoverTool).tooltips = [
('Well:', '#rows #cols'),
('Count:', '#count'),
]
return globalDist
def sectionizePlot(source, source_error, type, base):
print("sectionize plot created with typ: " + type)
colors = []
for x in range(0, len(base)):
colors.append(getRandomColor())
title = type + "-wise Intensity Distribution"
p = figure(plot_width=600, plot_height=300, title=title)
p.add_layout(
Whisker(source=source_error, base="base", upper="upper", lower="lower"))
for i, sec in enumerate(source.data['base']):
p.circle(x=source_error.data["base"][i], y=sec, color=colors[i])
p.xaxis.axis_label = type
p.yaxis.axis_label = "Intensity"
if (type.split()[-1] == "Row"):
print("hit a row")
conv = dict(enumerate(list("nABCDEFGHIJKLMNOP")))
conv.pop(0)
p.xaxis.major_label_overrides = conv
p.xaxis.ticker = SingleIntervalTicker(interval=1)
return p
famData = dict()
e1FractSource = ColumnDataSource(dict(count=[], cols=[], rows=[], index=[]))
e1Fract = ModifiedgenericHeatMap(e1FractSource, 100)
rowSectTotSource = ColumnDataSource(data=dict(base=[]))
rowSectTotSource_error = ColumnDataSource(data=dict(base=[], lower=[], upper=[]))
rowSectPlot_tot = sectionizePlot(rowSectTotSource,rowSectTotSource_error, "eSum Row", rowBase)
def update(selected=None):
global famData
famData = getFAMData(file_source_dt1, True)
global e1Stack
e1Fract = (famData['e1Sub'] / famData['eSum']) * 100
e1Stack = e1Fract.stack(dropna=False).reset_index()
e1Stack.columns = ["rows", "cols", "count"]
e1Stack['count'] = e1Stack['count'].apply(lambda x: round(x, 1))
e1FractSource.data = dict(cols=e1Stack["cols"], count=(e1Stack["count"]),
rows=e1Stack["rows"], index=e1Stack.index.values, codon=wells, )
rowData, colData = sectionize(famData['eSum'], rows, cols)
rowData_lower, rowData_upper = getLowerUpper(rowData)
rowBase = list(range(1, 17))
rowSectTotSource_error.data = dict(base=rowBase, lower=rowData_lower, upper=rowData_upper, )
rowSectTotSource.data = dict(base=rowData)
rowSectPlot_tot.title.text = "plot changed in update"
layout = column(e1FractSource, rowSectPlot_tot)
update()
curdoc().add_root(layout)
curdoc().title = "Specs"
print("ok")

Numpy shape not including classes

I want to create a dataset(data.npy) that has the same format as cifar10. The sample omniglot dataset contains
data = [0001_01.png,0001_02.png,0001_03.png,0001_04.png,0002_01.png,0002_02.png,0002_03.png]
class_name,filename = data[0].split('_')
Each file is append with class.There are 1600 classes and each class has 20 samples. The expected dataset(data.npy) shape is (1600,20,784)
But the shape i got is (1,20,784). Below given is the snippet
classes = []
examples = []
prev = files[0].split('_')[0]
for f in files:
cur_id = f.split('_')[0]
cur_pic = misc.imresize(misc.imread('new_data/' + f),[28,28])
cur_pic = (np.float32(cur_pic)/255).flatten()
if prev == cur_id:
examples.append(cur_pic)
else:
classes.append(examples)
examples = [cur_pic]
prev = cur_id
np.save('data',np.asarray(classes))
Any suggestions would be greatly helpful. The above code is taken from https://github.com/zergylord/oneshot

Is there a workaround for not fusing the observed data into model definition in Pymc3?

Problem definition: consider the "Simpletest" model (from pymc3 examples)which is something similar to the following one:
model = Model()
data = np.random.normal(size=(2, 20))
with model:
x = Normal('x', mu=.5, tau=2. ** -2, shape=(2, 1))
z = Beta('z', alpha=10, beta=5.5)
d = Normal('data', mu=x, tau=.75 ** -2, observed=data)
step = NUTS()
trace = sample(1000, step)
I'd like to change it so that I'll have a fixed model structure but run the sampling several iterations, each time adding a new data point to the previous (observed) dataset. Since the observed data is somehow embedded inside the model definition, the only way I know to do this is to put the whole model definition inside a loop:
model = Model()
# a set of initial data points
data = getInitPoints((2,5))
for i in xrange(m):
with model:
x = Normal('x', mu=.5, tau=2. ** -2, shape=(2, 1))
z = Beta('z', alpha=10, beta=5.5)
d = Normal('data', mu=x, tau=.75 ** -2, observed=data)
step = NUTS()
trace = sample(1000, step)
data = numpy.vstack( (data,getnewPoint( (2,1) ) ) )
#use the samples
This may produce some unnecessary overhead specially if the model is large. To refrain from the overhead of repeatedly defining the same model, I wonder if there is a solution so that the same results could be achieved with something similar to the following idea:
with model:
x = Normal('x', mu=.5, tau=2. ** -2, shape=(2, 1))
z = Beta('z', alpha=10, beta=5.5)
data = getInitPoints()
for i in xrange(m):
# only necessary parts are included in the loop
with model:
d = Normal('data', mu=x, tau=.75 ** -2, observed=data)
step = NUTS()
trace = sample(1000, step)
data = numpy.vstack((data,getnewPoint()))
or even better:
data = getInitPoints()
dataHandle = magicHandle(data)
with model:
x = Normal('x', mu=.5, tau=2. ** -2, shape=(2, 1))
z = Beta('z', alpha=10, beta=5.5)
#
d = Normal('data', mu=x, tau=.75 ** -2, observed=dataHandle)
step = NUTS()
for i in xrange(m):
with model:
trace = sample(1000, step)
#
dataHandle = numpy.vstack((data,getnewPoint()))
It seems that it's not possible right know. But there is an open issue on this topic with possible solutions here : https://github.com/pymc-devs/pymc/issues/10

Resources