I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')
GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)
This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,
Related
I'm trying to train multiple "documents" (here mostly log format), and the Doc2Vec is taking longer if I'm specifying more than one core (which i have).
My data looks like this:
print(len(train_corpus))
7930196
print(train_corpus[:5])
[TaggedDocument(words=['port', 'ssh'], tags=[0]),
TaggedDocument(words=['session', 'initialize', 'by', 'client'], tags=[1]),
TaggedDocument(words=['dfs', 'fsnamesystem', 'block', 'namesystem', 'addstoredblock', 'blockmap', 'update', 'be', 'to', 'blk', 'size'], tags=[2]),
TaggedDocument(words=['appl', 'selfupdate', 'component', 'amd', 'microsoft', 'windows', 'kernel', 'none', 'elevation', 'lower', 'version', 'revision', 'holder'], tags=[3]),
TaggedDocument(words=['ramfs', 'tclass', 'blk', 'file'], tags=[4])]
I have 8 cores available:
print(os.cpu_count())
8
I am using gensim 4.1.2, on Centos7. Using this approch (stackoverflow.com/a/37190672/130288), It looks like my BLAS library is OpenBlas, so I setted OPENBLAS_NUM_THREADS=1 on my bashrc (and could be visible from Jupyter, using !echo $OPENBLAS_NUM_THREADS=1 )
This is my test code :
dict_time_workers = dict()
for workers in range(1, 9):
model = Doc2Vec(vector_size=20,
min_count=1,
workers=workers,
epochs=1)
model.build_vocab(train_corpus, update = False)
t1 = time.time()
model.train(train_corpus, epochs=1, total_examples=model.corpus_count)
dict_time_workers[workers] = time.time() - t1
And the variable dict_time_workers is equal too :
{1: 224.23211407661438,
2: 273.408652305603,
3: 313.1667754650116,
4: 331.1840877532959,
5: 433.83785605430603,
6: 545.671571969986,
7: 551.6248495578766,
8: 548.430994272232}
As you can see, the time taking is increasing instead of decreasing. Results seems to be the same with a bigger epochs parameters.
Nothing is running on my Centos7 except this.
If I look at what's happening on my threads using htop, I see that the right number of thread are used for each training. But, the more threads are used, less the percentage of usage is (for example, with only one thread, 95% is used, for 2 they both used around 65% of their max power, for 6 threads are 20-25% ...). I suspected an IO issue, but iotop showed me that nothing bad is happening at the same disk.
The post seems now to be related to this post
Not efficiently to use multi-Core CPU for training Doc2vec with gensim .
When getting no benefit from extra cores like that, it's likely that the BLAS library you've got installed is already configured to try to use all cores for every bulk array operation. That means that other attempts to engage more cores, like Gensim's workers specification, just increase the overhead of contention, when each individual worker thread's individual BLAS callouts also try to use 8 threads.
Depending on the BLAS library in use, its own propensity to use more cores can typically be limited by environment variables named something like OPENBLAS_NUM_THREADS and/or MKL_NUM_THREADS.
If you set these to just 1 before your process launches, you may see different, and possibly better, multithreaded behavior.
Note, though: 1 just restores the assumption that every worker-thread only ever engages a single core. Some other mix of BLAS-cores & Gensim-worker-threads might actually achieve the best training throughput & non-contending core-utilization.
And, at least for Gensim workers, the actual thread count value achieving the best throughput will vary based on other model parameters that influence the relative amount of calculation time in highly-parallelizable code-blocks versus highly-contended blocks, especially window, vector_size, & negative. And, there's not really a shortcut to finding the best workers value except via trial-and-error: observing reported training rates in logs over a few minutes of running. (Though: any rate observed in, say, minutes 2-4 of a abbreviated trial run should be representative of the training rate through the whole corpus over multiple epochs.)
(For any system with at least 4 cores, the optimal value with a classic iterable corpus of TaggedDocuments is usually at least 3, no more than the number of cores, but also rarely more than 8-12 threads, due to other inherent sources of contention due to both Gensim's approach to fanning out the work among worker-threads, and the Python 'GIL'.)
Other thoughts:
the build_vocab() step is never multi-threaded, so benchmarking alternate workers values will give a truer readout of their effect by only timing the train() step
ensuring your iterable corpus does as little redundant work (like say IO & tokenization) on each pass can help limit any bottlenecks around the single manager thread doing each epoch's iteration & batching texts to the workers
the alternate corpus_file approach can achieve higher core utilization, up to any number of cores, by assigning each thread its own exclusive range of an input-file. But, it also means (a) your whole corpus must be in one uncompressed space-tokenized plain-text file; (b) your documents only get a single integer tag (their line-number); (c) you may be subject to some small as-yet-diagnosed-and-fixed bug(s). (See project issue #2747.)
Ok, the best way to fully use the core is to use the parameter corpus_file of doc2vec.
Doing the same bench, the result looks like :
{1: 114.58889961242676, 2: 82.8250150680542, 3: 71.52109575271606, 4: 67.1010684967041, 5: 75.96869373321533, 6: 100.68377351760864, 7: 116.7901406288147, 8: 139.53436756134033}
The thread seems to be useful, in my case 4 are the best.
Still strange that the "regular" doc2vec is not that great at parallelizing
I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?
Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).
I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.
Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.
I have a problem with nested multiprocessing witch starts when I use scikit-learn (v. 0.22) Quadratic Discriminant Analysis. Necessary is system configuration: 24 thread Xeon machine running fedora 30.
I run consecutively qda on the randomly selected subset of predictors:
def process(X,y,n_features,i=1):
comb = np.random.choice(range(X.shape[1]),n_features,replace=False)
qda = QDA(tol=1e-8)
qda.fit(X[:,comb],y)
y_pred = qda.predict(X[:,comb])
return (accuracy_score(y,y_pred),comb,i)
where n_features is number of features randomly selected from the full set of possible predictors, X,y explanatory and depended variables.
When n_features is 18 or less process works in single-thread mode, which means that I can use any tool to parallel processing (I use ray). When n_features is 19, and above for unknown reason it (not me) starts all available threads, and entire calculation get more time even in comparison to a single thread.
tmp = [process(X,y,n_features,i=1) for _ in range(1000)]
Based on my previous experiences with other Linux libraries (R gstat precisely) the same situation (uncontrolled multithreading mode) was caused by Linux implementation of blas, but here it could not be the case. In general, the question is: what starts this multithreading and how to control it even if LDA/QDA hasn't n_jobs parameter to avoid nested multiprocessing.
QDA in scikit-learn does not expose n_jobs meaning that you cannot set anything. However, it could be due to numpy which does not restrict the number of threads.
The solution to limit the number of threads are:
set the environment variable OMP_NUM_THREADS, MKL_NUM_THREADS, or OPENBLAS_NUM_THREADS to be sure that you will limit the number of threads;
you can use threadpoolctl which provides a context manager to set the number of threads.
I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?
The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.
range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.