Finding the mean per class in Pytorch - pytorch

I am naively iterating over each and every sample of the dataset. Is there anyway to calculate the mean efficiently ?
my_root = '/mini_imagenet_full_size/train/'
Clmean = []
dir_list = os.listdir(my_root)
print(len(dir_list))
miniImagenet_dataset = datasets.ImageFolder(root=my_root, transform=data_transform)
Clmean=torch.zeros([64,3,224,224])
for t,c in miniImagenet_dataset:
print(c)
Clmean[c,:,:,:]+=t
print(Clmean)

Related

Measuring sentence similarity between 2 sets of questions using Spacy, and then output the pairs with their similarity score in a dataframe

I am trying to find similarity score between 2 sets of questions using spacy, but I am finding it hard to loop through both lists.
It's in German.
So far I have is:
alles_frage = [nlp(row) for row in df1["User_Input"]] #loop through customer questions to compare similarity score
#creating object for comparison
kompare_frage = df2.loc[18, "Frage"] #take a question to compare
kompare_frage_vector = nlp(kompare_frage) #vectorise the sentence
#for loop to compare questions and then append similarity score in the empty lists
similar_frage_list = []
similarity_score_list = []
for f in range (len(alles_frage)):
similar_frage = alles_frage[f].similarity(kompare_frage_vector)
similar_frage_list.append(similar_frage)
similarity_score_list.append(f)
similarity_dataframe = pd.DataFrame(list(zip(similarity_score_list, similar_frage_list)), columns = ["similarity_score_list", "similar_frage_list"])
updated_similarity_dataframe = similarity_dataframe.assign(kompare_frage = kompare_frage)
best_similar_frage = df1.iloc[updated_similarity_dataframe["similarity_score_list"]]
full_matrix = pd.concat([best_similar_frage, updated_similarity_dataframe], axis = 1)
updated_full_matrix = full_matrix.loc[:,['kompare_frage','User_Input','similar_frage_list']]
#sort the similarity score and questions
similarity_dataframe_sorted = updated_full_matrix.sort_values (by = "similar_frage_list", ascending = False)
similarity_dataframe_sorted.to_csv('C:/Users/NQ10040659/OneDrive - Telefonica/Desktop/Zinia/Full Text Indexing/similarity_dataframe.csv', mode='a', header=False)
I am running the above lines of code in Jupyter notebook, and I am having to manually increase the value of kompare_frage = df2.loc[18, "Frage"], after each loop.
I would like to compare kompare_frage list to alles_frage list, compute similarity score for each pair, and then output the pair along with their similarity scores in a dataframe, which I would like to save in a csv file, or a database table.
Please help, and much much appreciated.

keras data generator for multi task learning with non image data format

I am working on a multi-task semantic segmentation problem with three decoders and thus, I need to feed three inputs and have three outputs. Furthermore, my datasets are not image formats(.jpg, ...) but they are .mat and .npy formats. My labels are having three values of 0,1,2 (maps with the same shape as my grayscale images). With these two in mind, I am trying to load the dataset using keras generators as my dataset is very large. Below is what I have tried based on keras documentation for generators, but to my knowledge, the documentation assumes the data as images and single task network. How can I adjust my code so that I can generators for multiple tasks and multiple data formats (non-image)?
def batch_generator(X_gen,Y_gen, amp_gen, phase_gen):
while true:
yield(X_gen.next(),Y_gen.next(), map1_gen.next(), map2_gen.next())
where map1_gen and map2_gen are supposed to be generators for the other two inputs (maps).
train_images_dir = ''
train_masks_dir = ''
train_map1_dir = ''
train_map2_dir = ''
val_images_dir = ''
val_masks_dir = ''
val_map1_dir = ''
val_map2_dir = ''
datagen = ImageDataGenerator()
train_images_generator = datagen.flow_from_directory(train_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_mask_generator = datagen.flow_from_directory(train_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
train_map1_generator = datagen.flow_from_directory(train_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_map2_generator = datagen.flow_from_directory(train_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size ,class_mode=None)
#val augumentation.
val_images_generator = datagen.flow_from_directory(val_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_masks_generator = datagen.flow_from_directory(val_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
val_map1_generator = datagen.flow_from_directory(val_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_map2_generator = datagen.flow_from_directory(val_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
model = ...
model.fit_generator(batch_generator(train_images_generator,train_mask_generator, train_map1_generator, train_map2_generator), validation_data=batch_generator(val_images_generator,val_masks_generator, val_map1_generator, val_map2_generator),callbacks=...)
The outputs of each decoder is supposed to be (Img_Length,Img_Height) segmentation map with three labels 0,1,2; map1 and map2 outputs with (Img_Length,Img_Height) size of linear values respectively.
You could try to implement a custom generator and dismiss the ImageDataGenerator completely. E.g.
def batch_generator(batchsize):
while True:
inputs1 = []
inputs2 = []
inputs3 = []
outputs1 = []
outputs2 = []
outputs3 = []
for _ in batchsize:
input1 = cv2.imread(img1) #or whatever
inputs1.append(input1)
inputs2.append(...)
...
# you may have to convert the lists into numpy arrays
yield([inputs1,inputs2,inputs3],[outputs1,outputs2,outputs3])
Basically, you directly yield a list of all your inputs and outputs, each of them being a batch.
But that means you would have to manually read them in but I think that makes sense considering you have some non-image datatypes.
You can then pass this generator to model.fit_generator (or just model.fit since tensorflow2)
model.fit_generator(batch_generator(batchsize))

efficient way of calculating Monte Carlo results for different impact assessment methods in Brightway

I am trying to do a comparative monte carlo calculation with brightway2 using different impact assessment methods. I thought about using the switch_method method to be more efficient, since the technosphere matrix is the same for a given iteration. However, I am getting an assertion error. A code to reproduce it could be something like this
import brighway as bw
bw.projects.set_current('ei35') # project with ecoinvent 3.5
db = bw.Database("ei_35cutoff")
# select two different transport activities to compare
activity_name = 'transport, freight, lorry >32 metric ton, EURO4'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE4 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE4['name'])
break
activity_name = 'transport, freight, lorry >32 metric ton, EURO6'
for activity in bw.Database("ei_35cutoff"):
if activity['name'] == activity_name:
truckE6 = bw.Database("ei_35cutoff").get(activity['code'])
print(truckE6['name'])
break
demands = [{truckE4: 1}, {truckE6: 1}]
# impact assessment method:
recipe_midpoint=[method for method in bw.methods.keys()
if method[0]=="ReCiPe Midpoint (H)"]
mc_mm = bw.MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
If I try switch method I get the assertion error.
mc_mm.switch_method(recipe_midpoint[1])
assert mc_mm.method==recipe_midpoint[1]
mc_mm.redo_lcia()
next(mc_mm)
Am I doing something wrong here?
I usually store characterization factor matrices in a temporary dict and multiply these cfs with the LCI resulting from MonteCarloLCA directly.
import brightway2 as bw
import numpy as np
# Generate objects for analysis
bw.projects.set_current("my_mcs")
my_db = bw.Database('db')
my_act = my_db.random()
my_demand = {my_act:1}
my_methods = [bw.methods.random() for _ in range(2)]
I wrote this simple function to get characterization factor matrices for the product system I will generate in the MonteCarloLCA. It uses a temporara "sacrificial LCA" object that will have the same A and B matrices as the MonteCarloLCA.
This may seem like a waste of time, but it is only done once, and will make MonteCarlo quicker and simpler.
def get_C_matrices(demand, list_of_methods):
""" Return a dict with {method tuple:cf_matrix} for a list of methods
Uses a "sacrificial LCA" with exactly the same demand as will be used
in the MonteCarloLCA
"""
C_matrices = {}
sacrificial_LCA = bw.LCA(demand)
sacrificial_LCA.lci()
for method in list_of_methods:
sacrificial_LCA.switch_method(method)
C_matrices[method] = sacrificial_LCA.characterization_matrix
return C_matrices
Then:
# Create array that will store mc results.
# Shape is (number of methods, number of iteration)
my_iterations = 10
mc_scores = np.empty(shape=[len(my_methods), my_iterations])
# Instantiate MonteCarloLCA object
my_mc = bw.MonteCarloLCA(my_demand)
# Get characterization factor matrices
my_C_matrices = get_C_matrices(my_demand, my_methods)
# Generate results
for iteration in range(my_iterations):
lci = next(my_mc)
for i, m in enumerate(my_methods):
mc_scores[i, iteration] = (my_C_matrices[m]*my_mc.inventory).sum()
All your results are in mc_scores. Each row corresponds to a method, each column to an MC iteration.
Not very elegant, but try this:
iterations = 10
simulations = []
for _ in range(iterations):
mc_mm = MonteCarloLCA(demands[0], recipe_midpoint[0])
next(mc_mm)
mcresults = []
for i in demands:
print(i)
for m in recipe_midpoint[0:3]:
mc_mm.switch_method(m)
print(mc_mm.method)
mc_mm.redo_lcia(i)
print(mc_mm.score)
mcresults.append(mc_mm.score)
simulations.append(mcresults)
CC_truckE4 = [i[1] for i in simulations] # Climate Change, truck E4
CC_truckE6 = [i[1+3] for i in simulations] # Climate Change, truck E6
from matplotlib import pyplot as plt
plt.plot(CC_truckE4 , CC_truckE6, 'o')
If you then make a test and do twice the simulation for the same demand vector, by setting demands = [{truckE4: 1}, {truckE4: 1}] and plot the result you should get a straight line. This means that you are doing dependent sampling and re-using the same tech matrix for each demand vector and for each LCIA. I am not 100% sure of this but I hope it answers your question.

How to speed up for loop execution using multiprocessing in python

I have two lists. List A contains 500 words. List B contains 10000 words. I am trying to find similar words for List A with respect to B.I am using Spacy's similarity function.
The problem I am facing is that it takes ages to compute. I am new to multiprocessing usage, hence request help.
How do I speed up the execution of the for loop part through multiprocessing in python?
The following is my code.
ListA =['Dell', 'GPU',......] #500 words lists
ListB = ['Docker','Ec2'.......] #10000 words lists
s_words = []
for token1 in ListB:
list_to_sort = []
for token2 in ListA:
list_to_sort.append((token1, token2,nlp(str(token1)).similarity(nlp(str(token2)))))
sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
You can use multiprocessing package. This I hope will reduce your time significantly. See here for a sample code.
Have you tried nlp.pipe()?
You could do something like this:
from operator import itemgetter
import spacy
nlp = spacy.load("en_core_web_lg")
ListA = ['Apples', 'Monkey'] # 500 words lists
ListB = ['Grapefruit', 'Ape', 'Oranges', 'Banana'] # 10000 words lists
s_words = []
docs_a = nlp.pipe(ListA)
docs_b = list(nlp.pipe(ListB))
for token1 in docs_a:
list_to_sort = []
for token2 in docs_b:
list_to_sort.append((token1.text, token2.text, token1.similarity(token2)))
sorted_list = sorted(list_to_sort, key=itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
print(s_words)
That should already speed things up for you. The function nlp.pipe() also has the parameter n_process which might be what you're looking for.

How to merger NaiveBayesClassifier object in NLTK

I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)
I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.

Resources