How to merger NaiveBayesClassifier object in NLTK - nlp

I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)

I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.

Related

keras data generator for multi task learning with non image data format

I am working on a multi-task semantic segmentation problem with three decoders and thus, I need to feed three inputs and have three outputs. Furthermore, my datasets are not image formats(.jpg, ...) but they are .mat and .npy formats. My labels are having three values of 0,1,2 (maps with the same shape as my grayscale images). With these two in mind, I am trying to load the dataset using keras generators as my dataset is very large. Below is what I have tried based on keras documentation for generators, but to my knowledge, the documentation assumes the data as images and single task network. How can I adjust my code so that I can generators for multiple tasks and multiple data formats (non-image)?
def batch_generator(X_gen,Y_gen, amp_gen, phase_gen):
while true:
yield(X_gen.next(),Y_gen.next(), map1_gen.next(), map2_gen.next())
where map1_gen and map2_gen are supposed to be generators for the other two inputs (maps).
train_images_dir = ''
train_masks_dir = ''
train_map1_dir = ''
train_map2_dir = ''
val_images_dir = ''
val_masks_dir = ''
val_map1_dir = ''
val_map2_dir = ''
datagen = ImageDataGenerator()
train_images_generator = datagen.flow_from_directory(train_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_mask_generator = datagen.flow_from_directory(train_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
train_map1_generator = datagen.flow_from_directory(train_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_map2_generator = datagen.flow_from_directory(train_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size ,class_mode=None)
#val augumentation.
val_images_generator = datagen.flow_from_directory(val_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_masks_generator = datagen.flow_from_directory(val_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
val_map1_generator = datagen.flow_from_directory(val_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_map2_generator = datagen.flow_from_directory(val_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
model = ...
model.fit_generator(batch_generator(train_images_generator,train_mask_generator, train_map1_generator, train_map2_generator), validation_data=batch_generator(val_images_generator,val_masks_generator, val_map1_generator, val_map2_generator),callbacks=...)
The outputs of each decoder is supposed to be (Img_Length,Img_Height) segmentation map with three labels 0,1,2; map1 and map2 outputs with (Img_Length,Img_Height) size of linear values respectively.
You could try to implement a custom generator and dismiss the ImageDataGenerator completely. E.g.
def batch_generator(batchsize):
while True:
inputs1 = []
inputs2 = []
inputs3 = []
outputs1 = []
outputs2 = []
outputs3 = []
for _ in batchsize:
input1 = cv2.imread(img1) #or whatever
inputs1.append(input1)
inputs2.append(...)
...
# you may have to convert the lists into numpy arrays
yield([inputs1,inputs2,inputs3],[outputs1,outputs2,outputs3])
Basically, you directly yield a list of all your inputs and outputs, each of them being a batch.
But that means you would have to manually read them in but I think that makes sense considering you have some non-image datatypes.
You can then pass this generator to model.fit_generator (or just model.fit since tensorflow2)
model.fit_generator(batch_generator(batchsize))

How to update PyTorch model parameters (tensors) after averaging them?

I'm currently working on a distributed federated learning infrastructure and am trying to implement PyTorch. For this I also need federated averaging which averages the retrieved parameters from all the nodes and then passes those to a next training round.
The gathering of the parameters looks like this:
def RPC_get_parameters(data, model):
"""
Get parameters from nodes
"""
with torch.no_grad():
for parameters in model.parameters():
# store parameters in dict
return {"parameters": parameters}
The averaging function which happens at the central server looks like this:
# stores results from RPC_get_parameters() in results
results = client.get_results(task_id=task.get("id"))
# averaging of returned parameters
global_sum = 0
global_count = 0
for output in results:
global_sum += output["parameters"]
global_count += len(global_sum)
#
averaged_parameters = global_sum/global_count
#
new_params = {'averaged_parameters': averaged_parameters}
Now my question is, how do you update all the parameters (tensors) in Pytorch from this? I tried a few things and they usually returned errors like "Value Error: can't optimize a non-leaf tensor" when inserting new_params into the optimizer where usually model.parameters() go optimizerD = optim.SGD(new_params, lr=0.01, momentum = 0.5). So how do I actually update the model so it uses the averaged parameters?
Thank you!
https://github.com/simontkl/torch-vantage6/blob/fed_avg-w/local_dp/v6-ppsdg-py/master.py
I think the most convenient way to work with parameters (outside the SGD context) is using the state_dict of the model.
new_params = OrderedDict()
n = len(clients) # number of clients
for client_model in clients:
sd = client_model.state_dict() # get current parameters of one client
for k, v in sd.items():
new_params[k] = new_params.get(k, 0) + v / n
After that new_params is a state_dict (you can load it using .load_state_dict) with the average weights of the clients.

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?
I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.
Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Grouping arrays with common classes for classification in CNN

I have a data set with three columns,the first two columns are the features and the third column contain classes,there are 4 classes,part of it can be seen here.
The data set is big,lets say 100,000 rows and 3 columns(two column features and one column for classes),so I am using a moving window of length 50 on the data set before training my deep learning model. So far I have tried two different method to slice the data set with no good results and I am pretty sure my data set is good.
I first used a moving window on my entire data set,resulting into 2000 data samples with 50 rows and 2 columns(2000,50,2). As some data samples contain mixed classes,I selected only data samples with common classes and find the average of the classes to assign that particular data sample into a single class only,I have not get results with this.Here are my codes,`
def moving_window(data_, length, step=1):
streams = it.tee(data_, length)
return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])
data = list(moving_window(data_, 50))
data = np.asarray(data)
# print(len(data))
for i in data:
label=np.all(i==i[0,2],axis=0)
if label[2]==True:
X.append(i[:,0:2])
Y.append(sum(i[:,2])/len(i[:,2]))`
I tried another way by collecting only features corresponding to a particular class,putting the values into separate lists(4 lists as I have 4 classes) then used a moving window to slice each list separately and assign to its class. No good results too.Here are my codes.
for i in range(5):
labels.append(i)
yy= pd.get_dummies(labels)
yy= yy.values
yy= yy.astype(np.float32)
def moving_window(x, length, step=1):
streams = it.tee(x, length)
return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])
x_1 = list(moving_window(x1, 50))
x_1 = np.asarray(x_1)
y_1 = [yy[0]] * len(x_1)
X.append(x_1)
Y.append(y_1)
# print(x_1.shape)
x_2 = list(moving_window(x2, 50))
x_2 = np.asarray(x_2)
# print(yy[1])
y_2 = [yy[1]] * len(x_2)
X.append(x_2)
Y.append(y_2)
# print(x_2.shape)
x_3 = list(moving_window(x3, 50))
x_3 = np.asarray(x_3)
# print(yy[2])
y_3 = [yy[2]] * len(x_3)
X.append(x_3)
Y.append(y_3)
# print(x_3.shape)
x_4 = list(moving_window(x4, 50))
x_4 = np.asarray(x_4)
# print(yy[3])
y_4 = [yy[3]] * len(x_4)
X.append(x_4)
Y.append(y_4)
# print(x_4.shape)
the architecture of the model which I am trying to train works perfect with other data set. So I think I am missing something on how I process the data.What am I missing on my ways of processing the data before I start training?,is there any other way?. All the work done is in python.
I finally managed to train my CNN model and achieved good training,validation and testing accuracy. The only thing I added was normalization of my input data with the following lines,
minmax_scale = preprocessing.MinMaxScaler().fit(x)
X = minmax_scale.transform(x)
The rest remains the same.

Resources