multiprocessing starts off fast and drastically slows down - python-3.x

I'm trying to train a forecasting model on several backtest dates and model parameters. I wrote a custom function that basically takes an average of ARIMA, ETS, and a few other univariate and multivariate forecasting models from a dataset that's about 10 years of quarterly data (40 data points). I want to run this model in parallel on thousands of different combinations.
The custom model I wrote looks like this
def train_test_func(model_params)
data = read_data_from_pickle()
data_train, data_test = train_test_split(data, backtestdate)
model1 = ARIMA.fit(data_train)
data_pred1 = model1.predict(len(data_test))
...
results = error_eval(data_pred1, ..., data_pred_i, data_test)
save_to_aws_s3(results)
logger.info("log steps here")
My multiprocessing script looks like this:
# Custom function I work that trains and tests
from my_custom_model import train_test_func
commands = []
if __name__ == '__main__':
for backtest_date in target_backtest_dates:
for param_a in target_drugs:
for param_b in param_b_options:
for param_c in param_c_options:
args = {
"backtest_date": backtest_date,
"param_a": param_a,
"param_b": param_b,
"param_c": param_c
}
commands.append(args)
count = multiprocessing.cpu_count()
with multiprocessing.get_context("spawn").Pool(processes=count) as pool:
pool.map(train_test_func, batched_args)
I can get relatively fast results for the first 200 or so iterations, roughly 50 iterations per min. Then, it drastically slows down to ~1 iteration per minute. For reference, running this on a single core gets me about 5 iterations per minute. Each process is independent and uses a relatively small dataset (40 data points). None of the processes need to depend on each other, either--they are completely standalone.
Can anyone help me understand where I'm going wrong with multiprocessing? Is there enough information here to identify the problem? At the moment, the multiprocessing versions are slower than single core versions.
Attaching performance output

I found the answer. Basically my model uses numpy, which, by default, is configured to use multicore. The clue was in my CPU usage from the top command.
This stackoverflow post led me to the correct answer. I added this code block to the top of my scripts that use numpy:
import os
ncore = "1"
os.environ["OMP_NUM_THREADS"] = ncore
os.environ["OPENBLAS_NUM_THREADS"] = ncore
os.environ["MKL_NUM_THREADS"] = ncore
os.environ["VECLIB_MAXIMUM_THREADS"] = ncore
os.environ["NUMEXPR_NUM_THREADS"] = ncore
import numpy
...
The key being that you have to add these configurations before you import numpy.
Performance increased from 50 cycles / min to 150 cycles / min and didn't experience any throttling after a few minutes. CPU usage was also improved, with no processes exceeding 100%.

Related

How to lower RAM usage using xarray open_mfdataset and the quantile function

I am trying to load multiple years of daily data in nc files (one nc file per year). A single nc file has a dimension of 365 (days) * 720 (lat) * 1440 (lon). All the nc files are in the "data" folder.
import xarray as xr
ds = xr.open_mfdataset('data/*.nc',
chunks={'latitude': 10, 'longitude': 10})
# I need the following line (time: -1) in order to do quantile, or it throws a ValueError:
# ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized'
# consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single
# dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True``
# in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
ds = ds.chunk({'time': -1})
# Perform the quantile "computation" (looks more like a reference to the computation, as it's fast
ds_qt = ds.quantile(0.975, dim="time")
# Verify the shape of the loaded ds
print(ds)
# This shows the expected "concatenation" of the nc files.
# Get a sample for a given location to test the algorithm
print(len(ds.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values))
print(ds_qt.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values)
The result is correct. My issue comes from memory usage. I thought that by doing the open_mfdataset method, which uses Dask under the hood, this would be solved. However, loading "just" 2 years of nc files uses around 8GB of virtual RAM, and using 10 years of data uses my entire virtual RAM (around 32GB).
Am I missing something to be able to take a given percentile value across a dask array (I would need 30 nc files)? I apparently have to apply the chunk({'time': -1}) to the dataset to be able to use the quantile function, is this what makes the RAM savings fail?
This may help somebody in the future, here is the solution I am implementing, even though it is not optimized. I basically break the nc files into slices based on geolocation, and paste it back together to create the output file.
ds = xr.open_mfdataset('data/*.nc')
step = 10
min_lat = -90
max_lat = min_lat + step
output_ds = None
while max_lat <= 90:
cropped_ds = ds.sel(lat=slice(min_lat, max_lat))
cropped_ds = cropped_ds.chunk({'time': -1})
cropped_ds_quantile = cropped_ds.quantile(0.975, dim="time")
if not output_ds:
output_ds = cropped_ds_quantile
else:
output_ds = xr.merge([output_ds, cropped_ds_quantile])
min_lat += step
max_lat += step
output_ds.to_netcdf('output.nc')
It's not great, but it limits RAM usage to manageable levels. I am still open to a cleaner/faster solution if it exists (likely).

Fuzzywuzzy match 2 columns... script keeps running

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy.
Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
There are a couple of things you can change to improve the performance:
Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)
The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead
Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance
The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.
import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
company_mapping = {company: utils.default_process(company) for company in correct.Correct}
for customer in companies.Customers:
_, score, comp = process.extractOne(
utils.default_process(customer),
company_mapping,
processor=None)
actual_comp.append(comp)
similarity.append(score)
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

How can I create a Keras Learning Rate Schedule that updates based upon batches rather than epochs

I'm working with Keras, and trying to create a Learning Rate Scheduler that schedules on the basis of number of batches processed, instead of number of epochs. To do this, I've inserted the scheduling code into the get_updates method of my `Optimizer'. For the most part, I've tried to use regular Python variables for values that remain constant during a given training run and computational graph nodes only for parameters that actually vary.
My 2 Questions are:
Does the code below look like it should behave properly as a Learning Rate Scheduler, if placed within the get_updates method of a Keras Optimizer.
How could one embed this code in a Class similar to LearningRateScheduler, but which scheduled based upon number of batches, rather than number of epochs?
#Copying graph node that stores original value of learning rate
lr = self.lr
# Checking whether learning rate schedule is to be used
if self.initial_lr_decay > 0:
# this decay mimics exponential decay from
# tensorflow/python/keras/optimizer_v2/exponential_decay
# Get value of current number of processed batches from graph node
# and convert to numeric value for use in K.pow()
curr_batch = float(K.get_value(self.iterations))
# Create graph node containing lr decay factor
# Note: self.lr_decay_steps is a number, not a node
# self.lr_decay is a node, not a number
decay_factor = K.pow(self.lr_decay, (curr_batch / self.lr_decay_steps))
# Reassign lr to graph node formed by
# product of graph node containing decay factor
# and graph node containing original learning rate.
lr = lr * decay_factor
# Get product of two numbers to calculate number of batches processed
# in warmup period
num_warmup_batches = self.steps_per_epoch_num * self.warmup_epochs
# Make comparisons between numbers to determine if we're in warmup period
if (self.warmup_epochs > 0) and (curr_batch < num_warmup_batches):
# Create node with value of learning rate by multiplying a number
# by a node, and then dividing by a number
lr = (self.initial_lr *
K.cast(self.iterations, K.floatx()) / curr_batch)
Easier than messing with Keras source code (it's possible, but it's complex and sensible), you could use a callback.
from keras.callbacks import LambdaCallback
total_batches = 0
def what_to_do_when_batch_ends(batch, logs):
total_batches += 1 #or use the "batch" variable,
#which is the batch index of the last finished batch
#change learning rate at will
if your_condition == True:
keras.backend.set_value(model.optimizer.lr, newLrValueAsPythonFloat)
When training, use the callback:
lrUpdater = LambdaCallback(on_batch_end = what_to_do_when_batch_ends)
model.fit(........, callbacks = [lrUpdater, ...other callbacks...])

Do you need a for loop for IncrementalPCA in order to keep constant memory usage?

In the past, I've tried to use scikit-learn's IncrementalPCA in order to reduce memory usage. I used this answer as a template for my code. But as #aarslan said in the comment section: "I've noticed that the explained variance seems to decrease at every iteration." I've always suspected the last for loop in the given answer. So, my question is: Do I need a for loop in order to keep a constant memory usage during partial_fit step or batch_size is alone enough? Below you can find the code:
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
icpa = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
An old question, but yes, the for-loop is needed. The batch_size= parameter is only used with the .fit() method, not with .partial_fit().
Scikit-learn documentation:
batch_size : int, default=None
The number of samples to use for each batch. Only used when calling fit.

Librosa feature extraction methods with PySpark

I've been searching long time but can't see any implementation about music feature extraction techniques (like spectral centroid, spectral bandwidth etc.) integrated with Apache Spark. I am working with these feature extraction techniques and the process takes a lot of time for music. I want to parallelize and accelerate this process by using Spark. I did some works but couldn't get any speed up. I want to get arithmetic mean and standard deviation of spectral centroid method. This is what I've done so far.
from pyspark import SparkContext
import librosa
import numpy as np
import time
parts=4
print("Parts: ", parts)
sc = SparkContext('local['+str(parts)+']', 'pyspark tutorial')
def spectral(iterator):
l=list(iterator)
cent=librosa.feature.spectral_centroid(np.array(l), hop_length=256)
ort=np.average(cent)
std=np.std(cent)
return (ort, std)
y, sr=librosa.load("classical.00080.au") #This loads the song.
start1=time.time()
normal=librosa.feature.spectral_centroid(np.array(y), hop_length=256) #This is normal technique without spark
end1=time.time()
print("\nOrt: \t", np.average(normal))
print("Std: \t", np.std(normal))
print("Time elapsed: %.5f" % (end1-start1))
#This is where my spark implementation appears.
rdd = sc.parallelize(y)
start2=time.time()
result=rdd.mapPartitions(spectral).collect()
end2=time.time()
result=np.array(result)
total_avg, total_std = 0, 0
for i in range(0, parts*2, 2):
total_avg += result[i]
total_std += result[i+1]
spark_avg = total_avg/parts
spark_std = total_std/parts
print("\nOrt:", spark_avg)
print("Std:", spark_std)
print("Time elapsed: %.5f" % (end2-start2))
The output of the program is below.
Ort: 971.8843380584146
Std: 306.75410601230413
Time elapsed: 0.17665
Ort: 971.3152955225721
Std: 207.6510740703993
Time elapsed: 4.58174
So, even though I parallelized the array y (the array of music signal), I can't speed up the process. It takes longer time. I couldn't understand why. I am newbie with Spark concept. I thought to use GPU for this process but couldn't implement that either. Can anyone help me to understand what I am doing wrong?

Resources