I have been trying to figure out how much time one of my algorithm would need to work.
In order to do so, I built a simple python script (I reckon I have a foolish approach in this alg, not testing much):
import time
n=0
x=[]
for k in range(1,10):
begin = time.time()
while (n<1E7):
n+=1
end = time.time()
x.append(end-begin)
print(x)
n=0
print(x)
The result on my computer is:
[2.755953550338745, 2.234074831008911, 2.719917058944702, 2.4802486896514893, 2.8635189533233643, 2.7834832668304443, 4.048354387283325, 3.454935312271118, 3.3593692779541016]
Without doing any further analysis, I can't help but notice a pretty large variance in this result. Where can it be coming from ?
Related
I was implementing a genetic algorithm with tf keras, where i manualy modify the weight, make the gene cross over, all that. Ive found that after a few docen generations, the predictions of all the network are essentialy identical, and after a few more generations the predictions are exactly the same. trying to google the problem i found this page
that mentions the problem in a conceptual level but i cant understand how this would happen if im manualy creating genetic diverity every generation.
def model_mutate(weights,var):
for i in range(len(weights)):
for j in range(len(weights[i])):
if( random.uniform(0,1) < 0.2): #learing rate of 15%
change = np.random.uniform(-var,var,weights[i][j].shape)
weights[i][j] += change
return weights
def crossover_brains(parent1, parent2):
global brains
weight1 = parent1.get_weights()
weight2 = parent2.get_weights()
new_weight1 = weight1
new_weight2 = weight2
gene = random.randint(0,len(new_weight1)-1) #we change a random weight
#or set of weights
new_weight1[gene] = weight2[gene]
new_weight2[gene] = weight1[gene]
q=np.asarray([new_weight1,new_weight2],dtype=object)
return q
def evolve(best_fit1,best_fit2):
global generation
global best_brain
global best_brain2
mutations=[]
for i in range(total_brains//2):
cross_weights=model_crossover(best_fit1,best_fit2)
mutation1=model_mutate(cross_weights[0],0.5)
mutation2=model_mutate(cross_weights[1],0.5)
mutations.append(mutation1)
mutations.append(mutation2)
for i in range(total_brains):
brains[i].set_weights(mutations[i])
generation+=1
def find_best_fit():
fitness=np.loadtxt("fitness.txt")
print(f"fitness average {np.mean(fitness)} in generation {generation}")
print(f"fitness max is {np.max(fitness)} in generation {generation} ")
fitness_t.append(np.mean(fitness))
maxfit1=np.max(fitness)
best_fit1=np.where(fitness==maxfit1)[0]
fitness[best_fit1]=0
maxfit2=np.max(fitness)
best_fit2=np.where(fitness==maxfit2)[0]
if len(best_fit1)>1: #this is a band_aid for when several indiviuals are the same
# this would lead to best_fit(1,2) being an array of indeces
best_fit1=best_fit1[0]
if len(best_fit2)>1:
best_fit2=best_fit2[0]
return int(best_fit1),int(best_fit2)
bf1,bf2=find_best_fit()
evolve(bf1,bf2)
This is the code im using to set the modified weights to the existing keras models (mostly not mine, i dont understand it enough to have created this myself)
if keras is working how i think its working, then i dont see how this would converge to anything that does not maximize fitness, further more, it seems to be decreasing over time.
I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later
You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.
I've been searching long time but can't see any implementation about music feature extraction techniques (like spectral centroid, spectral bandwidth etc.) integrated with Apache Spark. I am working with these feature extraction techniques and the process takes a lot of time for music. I want to parallelize and accelerate this process by using Spark. I did some works but couldn't get any speed up. I want to get arithmetic mean and standard deviation of spectral centroid method. This is what I've done so far.
from pyspark import SparkContext
import librosa
import numpy as np
import time
parts=4
print("Parts: ", parts)
sc = SparkContext('local['+str(parts)+']', 'pyspark tutorial')
def spectral(iterator):
l=list(iterator)
cent=librosa.feature.spectral_centroid(np.array(l), hop_length=256)
ort=np.average(cent)
std=np.std(cent)
return (ort, std)
y, sr=librosa.load("classical.00080.au") #This loads the song.
start1=time.time()
normal=librosa.feature.spectral_centroid(np.array(y), hop_length=256) #This is normal technique without spark
end1=time.time()
print("\nOrt: \t", np.average(normal))
print("Std: \t", np.std(normal))
print("Time elapsed: %.5f" % (end1-start1))
#This is where my spark implementation appears.
rdd = sc.parallelize(y)
start2=time.time()
result=rdd.mapPartitions(spectral).collect()
end2=time.time()
result=np.array(result)
total_avg, total_std = 0, 0
for i in range(0, parts*2, 2):
total_avg += result[i]
total_std += result[i+1]
spark_avg = total_avg/parts
spark_std = total_std/parts
print("\nOrt:", spark_avg)
print("Std:", spark_std)
print("Time elapsed: %.5f" % (end2-start2))
The output of the program is below.
Ort: 971.8843380584146
Std: 306.75410601230413
Time elapsed: 0.17665
Ort: 971.3152955225721
Std: 207.6510740703993
Time elapsed: 4.58174
So, even though I parallelized the array y (the array of music signal), I can't speed up the process. It takes longer time. I couldn't understand why. I am newbie with Spark concept. I thought to use GPU for this process but couldn't implement that either. Can anyone help me to understand what I am doing wrong?
I have a very big product expected from a itertools.product.
for result in product(items, repeat=9):
# stuff
It takes a lot of time, and I am searching for a way to start from a certain item because I won't be able to do it on one run.
I could do the following:
gen = product(items, repeat=9):
for temp in gen:
if temp == DESIRED_VALUE:
break
for result in gen:
# stuff
But it will take a lot of time, almost the same as if I was just restarting the program. So, is there a way to "skip ahead" without wasting the time on iterating the whole thing?
Although I have serious concerns about brute-forcing a passsword in the first place, I can offer an answer.
You can use islice to skip a certain number of steps in iteration. This means that you would need to keep track of how many attempts you have done so far to know where to resume later.
START_VALUE = 200
all_combos = itertools.product(letters,repeat=9)
#start at START_VALUE and stop at None (the end)
combos = itertools.islice(all_combos,START_VALUE,None)
for i,password in enumerate(combos,start=START_VALUE):
...
note that this will only work for values below sys.maxsize.
You can also calculate the index of a given password with the same formula to convert bases:
def check_value(password):
pos = len(letters)
value = 0
for i,c in enumerate(reversed(password)):
value+= (pos**i) * letters.index(c)
return value
>>> check_value("aaaacbdaa")
29802532
I'm new to numpy, have googled a lot, but it is hard for me (at the moment) to speed my code more up. I optimized my code as much as I could using #profile and numba. But my code is still very slow for a large number of documents and it needs a lot of memory space. I'm pretty sure I'm not using numpy the right (fast) way. Because I want to learn, I hope some of you can help me improving my code.
My whole code you can find on:
my code on bitbucket
The very slow part is the log-entropy-weight calculation in the file CreateMatrix.py (create_log_entropy_weight_matrix and __create_np_p_ij_matrix_forLEW)
The profiling result of the two methods you can view here
Here the two methods:
#profile
#jit
def create_log_entropy_weight_matrix(self, np_freq_matrix_ordered):
print(' * Create Log-Entropy-Weight-Matrix')
np_p_ij_matrix = self.__create_np_p_ij_matrix_forLEW(np_freq_matrix_ordered)
np_p_ij_matrix_sum = np_p_ij_matrix.sum(0)
np_log_entropy_weight_matrix = np.zeros(np_freq_matrix_ordered.shape, dtype=np.float32)
n_doc = int(np_freq_matrix_ordered.shape[0])
row_len, col_len = np_freq_matrix_ordered.shape
negative_value = False
for col_i, np_p_ij_matrix_sum_i in enumerate(np_p_ij_matrix_sum):
for row_i in range(row_len):
local_weight_i = math.log(np_freq_matrix_ordered[row_i][col_i] + 1)
if not np_p_ij_matrix[row_i][col_i]:
np_log_entropy_weight_matrix[row_i][col_i] = local_weight_i
else:
global_weight_i = 1 + (np_p_ij_matrix_sum_i / math.log(n_doc))
np_log_entropy_weight_matrix[row_i][col_i] = local_weight_i * global_weight_i
# if np_log_entropy_weight_matrix[row_i][col_i] < 0:
# negative_value = True
#print(' - - test negative_value:', negative_value)
return(np_log_entropy_weight_matrix)
##profile
#jit
def __create_np_p_ij_matrix_forLEW(self, np_freq_matrix_ordered):
np_freq_matrix_ordered_sum = np_freq_matrix_ordered.sum(0)
np_p_ij_matrix = np.zeros(np_freq_matrix_ordered.shape, dtype=np.float32)
row_len, col_len = np_freq_matrix_ordered.shape
for col_i, ft_freq_sum_i in enumerate(np_freq_matrix_ordered_sum):
for row_i in range(row_len):
p_ij = division_lew(np_freq_matrix_ordered[row_i][col_i], ft_freq_sum_i)
if p_ij:
np_p_ij_matrix[row_i][col_i] = p_ij * math.log(p_ij)
return(np_p_ij_matrix)
</code>
Hope someone can help me to improve my code :)
Here's a stab a removing one level of iteration:
doc_log = math.log(n_doc)
local_weight = np.log(np_freq_matrix_ordered + 1)
for col_i, np_p_ij_matrix_sum_i in enumerate(np_p_ij_matrix_sum):
local_weight_j = local_weight[:, col_i]
ind = np_p_ij_matrix[:, col_i]>0
local_weight_j[ind] *= 1 + np_p_ij_matrix_sum_i[ind] / doc_log
np_log_entropy_weight_matrix[:, col_i] = local_weight_j
I haven't run any tests; I just read through your code and replaced things that were unnecessarily iterative.
Without fully understanding your code it looks like it is performing things that can be done on the whole array at one - *, +, log, etc. The only if is avoiding log(0). I replaced one if with the ind masking.
The variable names are long and descriptive. At some level that is good, but it often is easier to read code with shorter names. It takes more concentration to distinguish np_p_ij_matrix from np_p_ij_matrix_sum_i than to distinguish x from y.
Notice I also replaced the [][] indexing with [,] style. Not necessarily faster, but easier to read.
But I haven't used numba enough to know where these changes improve its response. numba lets you get by with an iterative style of coding that makes an old-time MATLAB coder blanch.