I just read about Bayesian optimization and I want to try it.
I installed scikit-optimize and checked the API, and I'm confused:
I read that Bayesian optimization starts with some initialize samples.
I can't see where I can change this number ? (BayesSearchCV)
n_points will change the number of parameter settings to sample in parallel and n_iter is the number of iterations (and if I'm not wrong the iterations can't run in parallel, the algorithm improve the parameters after every iteration)
I read that we can use different acquisition functions.
I can't see where I can change the acquisition function in BayesSearchCV ?
Is this something you are looking for?
BayesSearchCV(..., optimizer_kwargs={'n_initial_points': 20, 'acq_func': 'gp_hedge'}, ...)
skopt.Optimizer is the one actually doing the hyperparameter optimization.
BayesSearchCV will build Optimzier with optimizer_kwargs parameters.
https://github.com/scikit-optimize/scikit-optimize/blob/de32b5fd2205a1e58526f3cacd0422a26d315d0f/skopt/searchcv.py#L551
Related
I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.
I want to use ReLU1 non-linear activation. ReLU1 is linear in [0,1] but clamps values less than 0 to 0 and clamps values more than 1 to 1.
It will be used only for the last layer of my deep net in PyTorch having a really high definition output of 2048x4096. Since the code has to be highly optimized in terms of speed and memory I do not know which of the following will be the best implementation.
Following are the two implementations I can think of for the tensor x:-
x.clamp_(min=0.0, max=1.0)
For this I am unable to see the source code given in its docs. So do not know if its the best choice. I will prefer in place operation since backpropagation can happen through it.
The second alternative I have is to use torch.nn.functional.hardtanh_(x, min_val=0.0, max_val=1.0). This is definitely a in place function and the source code says that it uses the C++ file torch._C._nn.hardtanh(input, min_val, max_val) so I think it will be fast.
Please suggest which is the most efficient implementation and another one if possible.
Thankyou
Without trying it, my guess is that clamp and hardtanh will have the same speed, and it will be hard to do this operation any faster if you optimize it in isolation. The arithmetic is trivial so this operation will be bottlenecked by GPU memory bandwidth. To run faster, you'd want to fuse this operation with the operation that produced x. If you don't want to write a custom kernel for the combined operation, you can try using TorchScript.
I have a CNN model. The requests of using this model, for example to classify a picture, come 1 time a second.
I would like to collect the requests as new unsuperised data, and keep training my model.
My question is: How can I handle the training task and classify task effictively?
I will explain why it becomes a problem:
Every training step takes a long time, at least severy seconds, using GPU and not interruptable. So, if my classify tasks use GPU too, I cannot response the requests in time. I would like to make classify tasks using CPU, but looks like theano not support two diffrent config.device in one process.
Multi-process is not acceptable, because my memory is limited and theano costs too much.
Any help or advice would be apreciated.
You could build two separate copies of the same CNN, one on the CPU and one on the GPU. I think this could be done under either the old GPU backend or the new one, but in different ways....some ideas:
Under the old backend:
Load Theano with device=cpu. Build your inference function and compile it. Then call theano.sandbox.cuda.use('gpu'), and build a new copy of your inference function and take gradients of that one to make any training functions. Now the inference function should execute on the CPU, and the training should happen on the GPU. (I've never done this on purpose but I had it happen to me on accident!)
Under the new backend:
As far as I know, you have to tell Theano about any GPUs right when importing, not later. In this case, you could use THEANO_FLAGS="contexts=dev0->cuda0", which doesn't force using one device over another. Then build the inference version of your function like normal, and for the training version, again put all the shared variables on the GPU, and the input variables to any of your training functions should also be GPU variables (e.g. input_var_1.transfer('dev0')). When all your functions are compiled, look at the programs using theano.printing.debugprint(function) to see what's on GPU vs CPU. (When compiling the CPU functions, it might give a warning that it cannot infer the context, and as far as I've seen, that lands it on the CPU...not sure if this behavior is safe to depend on.)
In either case, this will depend on your GPU-based functions do NOT RETURN ANYTHING TO THE CPU (make sure the output variables are GPU ones). This should allow the training function to run concurrently to your inference function, and later you grab what you need to the CPU. For example when you take a training step, just copy the new values over to your inference network parameters, of course.
Let us hear what you come up with!
Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.
I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:
p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5
For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:
p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)
The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.
Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.
Note that Using this method, other regression algorithms might end up being a better choice than NB.
Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.
EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.