Librosa feature extraction methods with PySpark - apache-spark

I've been searching long time but can't see any implementation about music feature extraction techniques (like spectral centroid, spectral bandwidth etc.) integrated with Apache Spark. I am working with these feature extraction techniques and the process takes a lot of time for music. I want to parallelize and accelerate this process by using Spark. I did some works but couldn't get any speed up. I want to get arithmetic mean and standard deviation of spectral centroid method. This is what I've done so far.
from pyspark import SparkContext
import librosa
import numpy as np
import time
parts=4
print("Parts: ", parts)
sc = SparkContext('local['+str(parts)+']', 'pyspark tutorial')
def spectral(iterator):
l=list(iterator)
cent=librosa.feature.spectral_centroid(np.array(l), hop_length=256)
ort=np.average(cent)
std=np.std(cent)
return (ort, std)
y, sr=librosa.load("classical.00080.au") #This loads the song.
start1=time.time()
normal=librosa.feature.spectral_centroid(np.array(y), hop_length=256) #This is normal technique without spark
end1=time.time()
print("\nOrt: \t", np.average(normal))
print("Std: \t", np.std(normal))
print("Time elapsed: %.5f" % (end1-start1))
#This is where my spark implementation appears.
rdd = sc.parallelize(y)
start2=time.time()
result=rdd.mapPartitions(spectral).collect()
end2=time.time()
result=np.array(result)
total_avg, total_std = 0, 0
for i in range(0, parts*2, 2):
total_avg += result[i]
total_std += result[i+1]
spark_avg = total_avg/parts
spark_std = total_std/parts
print("\nOrt:", spark_avg)
print("Std:", spark_std)
print("Time elapsed: %.5f" % (end2-start2))
The output of the program is below.
Ort: 971.8843380584146
Std: 306.75410601230413
Time elapsed: 0.17665
Ort: 971.3152955225721
Std: 207.6510740703993
Time elapsed: 4.58174
So, even though I parallelized the array y (the array of music signal), I can't speed up the process. It takes longer time. I couldn't understand why. I am newbie with Spark concept. I thought to use GPU for this process but couldn't implement that either. Can anyone help me to understand what I am doing wrong?

Related

multiprocessing starts off fast and drastically slows down

I'm trying to train a forecasting model on several backtest dates and model parameters. I wrote a custom function that basically takes an average of ARIMA, ETS, and a few other univariate and multivariate forecasting models from a dataset that's about 10 years of quarterly data (40 data points). I want to run this model in parallel on thousands of different combinations.
The custom model I wrote looks like this
def train_test_func(model_params)
data = read_data_from_pickle()
data_train, data_test = train_test_split(data, backtestdate)
model1 = ARIMA.fit(data_train)
data_pred1 = model1.predict(len(data_test))
...
results = error_eval(data_pred1, ..., data_pred_i, data_test)
save_to_aws_s3(results)
logger.info("log steps here")
My multiprocessing script looks like this:
# Custom function I work that trains and tests
from my_custom_model import train_test_func
commands = []
if __name__ == '__main__':
for backtest_date in target_backtest_dates:
for param_a in target_drugs:
for param_b in param_b_options:
for param_c in param_c_options:
args = {
"backtest_date": backtest_date,
"param_a": param_a,
"param_b": param_b,
"param_c": param_c
}
commands.append(args)
count = multiprocessing.cpu_count()
with multiprocessing.get_context("spawn").Pool(processes=count) as pool:
pool.map(train_test_func, batched_args)
I can get relatively fast results for the first 200 or so iterations, roughly 50 iterations per min. Then, it drastically slows down to ~1 iteration per minute. For reference, running this on a single core gets me about 5 iterations per minute. Each process is independent and uses a relatively small dataset (40 data points). None of the processes need to depend on each other, either--they are completely standalone.
Can anyone help me understand where I'm going wrong with multiprocessing? Is there enough information here to identify the problem? At the moment, the multiprocessing versions are slower than single core versions.
Attaching performance output
I found the answer. Basically my model uses numpy, which, by default, is configured to use multicore. The clue was in my CPU usage from the top command.
This stackoverflow post led me to the correct answer. I added this code block to the top of my scripts that use numpy:
import os
ncore = "1"
os.environ["OMP_NUM_THREADS"] = ncore
os.environ["OPENBLAS_NUM_THREADS"] = ncore
os.environ["MKL_NUM_THREADS"] = ncore
os.environ["VECLIB_MAXIMUM_THREADS"] = ncore
os.environ["NUMEXPR_NUM_THREADS"] = ncore
import numpy
...
The key being that you have to add these configurations before you import numpy.
Performance increased from 50 cycles / min to 150 cycles / min and didn't experience any throttling after a few minutes. CPU usage was also improved, with no processes exceeding 100%.

Do you need a for loop for IncrementalPCA in order to keep constant memory usage?

In the past, I've tried to use scikit-learn's IncrementalPCA in order to reduce memory usage. I used this answer as a template for my code. But as #aarslan said in the comment section: "I've noticed that the explained variance seems to decrease at every iteration." I've always suspected the last for loop in the given answer. So, my question is: Do I need a for loop in order to keep a constant memory usage during partial_fit step or batch_size is alone enough? Below you can find the code:
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
icpa = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
An old question, but yes, the for-loop is needed. The batch_size= parameter is only used with the .fit() method, not with .partial_fit().
Scikit-learn documentation:
batch_size : int, default=None
The number of samples to use for each batch. Only used when calling fit.

Is this text training with skip-gram correct?

I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later
You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.

Variance in computer processing speed

I have been trying to figure out how much time one of my algorithm would need to work.
In order to do so, I built a simple python script (I reckon I have a foolish approach in this alg, not testing much):
import time
n=0
x=[]
for k in range(1,10):
begin = time.time()
while (n<1E7):
n+=1
end = time.time()
x.append(end-begin)
print(x)
n=0
print(x)
The result on my computer is:
[2.755953550338745, 2.234074831008911, 2.719917058944702, 2.4802486896514893, 2.8635189533233643, 2.7834832668304443, 4.048354387283325, 3.454935312271118, 3.3593692779541016]
Without doing any further analysis, I can't help but notice a pretty large variance in this result. Where can it be coming from ?

How to efficiently calculate 160146 by 160146 matrix inverse in python?

My research is into structural dynamics and i am dealing with large symmetric sparse matrix calculation. Recently, i have to calculate the stiffness matrix (160146 by 160146) inverse with 4813762 non zero elements. I did calculate a smaller stiffness matrix inverse for a 15000 by 15000 size and it came out to almost or full dense. Initially i tried with almost all scipy.sparse.linalg functions to calculate inverse through Ax=b form. Currently, i am using superlu to calculate the L and U matrices then using that i calculate the inverse using Solve(). Since the matrix inverse is dense and i could not store in RAM memory i opted for pytables.
Unfortunately, writing time of one column of inverse matrix takes about 16 minutes(time for each step is shown below after the code) and a total of 160146 columns exist for the stiffness matrix. I would like to know how i can boost the writing speed so that this inverse task will finish in couple of days. The code is as follows,
LU= scipy.sparse.linalg.splu(interior_stiff)
interior_dof_row_ptr=160146
#---PyTables creation Code for interior_stiff_inverse begins--#
if(os.path.isfile("HDF5_Interior.h5")==False):
f=tables.open_file("HDF5_Interior.h5", 'w')
# compression-level and compression library
filters=tables.Filters(complevel=0, complib='blosc')
# f.root-> your default group in the HDF5 file "firstdata"->name of the dataset
# tables.Float32Atom()->WHat si your atomic data object?
if(f.__contains__("/DS_interior_stiff_inverse")==False):
print("DS_Interior_stiff_inverse DOESN'T EXIST!!!!!")
out=f.create_carray(f.root, "DS_interior_stiff_inverse", tables.Float32Atom(), shape=(interior_dof_row_ptr,interior_dof_row_ptr), filters=filters)
#out=f.create_earray(f.root, "DS_interior_stiff_inverse", tables.Float32Atom(), shape=(interior_dof_row_ptr,0), filters=filters, expectedrows=interior_dof_row_ptr)
else:
print("DS_interior_stiff_inverse EXISTS!!!!!")
out=f.get_node("/", "DS_interior_stiff_inverse")
#interior_stiff_inverse=numpy.zeros((interior_dof_row_ptr,interior_dof_row_ptr))
for i in range(0,interior_dof_row_ptr):
I=numpy.zeros((interior_dof_row_ptr,1))
I[i,0]=1
#-- COmmented by Libni - interior_stiff_inverse[:,i]=LU.solve(I[:,0]) #In pytables how we define the variables. So interior_stiff_inverse_1 only needs to be stored in pytables.
print("stating solve() calculation for inverse: ", datetime.datetime.now())
tmpResult=LU.solve(I[:,0])
print("solve() calculation for inverse DONE: ", datetime.datetime.now())
out[:,i]=tmpResult
print("Written to hdf5 (pytable) :", datetime.datetime.now())
#out.append(LU.solve(I[:,0]))
print(str(i) + "th iteration of " + str(interior_dof_row_ptr) + " Interior Inv done")
f.flush()
print("After FLUSH line: ", datetime.datetime.now())
f.close()
#--***PyTables creation Code for interior_stiff_inverse begins-***
Time taken for Solve () calculation and writing to hdf5 is as follows,
stating solve() calculation for inverse: 2017-08-26 01:04:20.424447
solve() calculation for inverse DONE: 2017-08-26 01:04:20.596045
Written to hdf5 (pytable) :2017-08-26 01:20:57.228322
After FLUSH line: 01:20:57.555922
which clearly indicate that writing one column of inverse matrix to hdf5 takes 16 minutes. As per this if i need to calculate the entire matrix inverse it will take me 1779 days. I am sure the writing time can be boosted up. I dont know how i can achieve this. Please help me in boosting the writing speed to hdf5 so that the matrix inverse run can be finished within couple of days.
I have used 0 compression in hdf5 creation thinking that this will be helping in reading and writing fast.
My computer spec include i7 with 4 cores and 16 RAM.
Any help will be appreciated.
Thank You,
Paul Thomas

Resources