I am trying to develop a lightweight system that uses an unsupervised learning method that uses system parameters such as CPU, RAM utilization to train an anomaly detection system. I could not think of anything beyond a Self organizing map. Is there any other learning technique that I can consider here?
You don't have many options on this with SOM. The only think you could consider is whether you will do batch or sequential training, if of course the implementation that you will use offers both options. But this option mainly affects the training time (the first is much more quicker) and not the resulting map (in theory at least).
You could also select a distance function other than the Euclidian but the vast percentage of the bibliography doesn't bother with this.
Related
I'm currently researching with TFF and image classification (Federated Learning for Image Classification) emnist.
I'm looking at hyper parameters for the model learning rate and optimizer. Is grid search a good approach here ? . In a real world scenario would you simply sample clients/devices from the overall domain and if so if I was to do a grid search would I have to fix my client samples 1st. In which case does it make sense to do the grid search.
What would be a typical real world way of selecting parameters, ie is this more a heuristic approach. ?
Colin . . .
I think there is still a lot of open research in these areas for Federated Learning.
Page 6 of https://arxiv.org/abs/1912.04977 describes a cross-device and a cross-silo setting for federated learning.
In cross-device settings, the population is generally very large (hundreds of thousands or millions) and participants are generally only seen once during the entire training process. In this setting, https://arxiv.org/abs/2003.00295 demonstrates that hyper-parameters such as client learning rate play an outsized role in determining speed of model convergence and final model accuracy. To demonstrate that finding, we first performed a large coarse grid search to identify promising hyper-parameter space, and then ran finer grids in the promising regions. However this can be expensive depending on the compute resources available for simulation, the training process must be run to completion to understand these effects.
It might be possible to view federated learning as very large mini-batch SGD. In fact the FedSGD algorithm in https://arxiv.org/abs/1602.05629 is exactly this. In this regime, re-using theory from centralized model training may be fruitful.
Finally https://arxiv.org/abs/1902.01046 describes a system used at Google for federated learning, and does have a small discussion on hyper-parameter exploration.
I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.
A wanna-be data-scientist here and am trying to understand as a data scientist, when and why would you use a Probability Density Function (PDF)?
Sharing a scenario and a few pointers to learn about this and other such functions like CDF and PMF would be really helpful. Know of any book that talks about these functions from practice stand-point?
Why?
Probability theory is very important for modern data-science and machine-learning applications, because (in a lot of cases) it allows one to "open up a black box" and shed some light into the model's inner workings, and with luck find necessary ingredients to transform a poor model into a great model. Without it, a data scientist's work is very much restricted in what they are able to do.
A PDF is a fundamental building block of the probability theory, absolutely necessary to do any sort of probability reasoning, along with expectation, variance, prior and posterior, and so on.
Some examples here on StackOverflow, from my own experience, where a practical issue boils down to understanding data distribution:
Which loss-function is better than MSE in temperature prediction?
Binary Image Classification with CNN - best practices for choosing “negative” dataset?
How do neural networks account for outliers?
When?
The questions above provide some examples, here're a few more if you're interested, and the list is by no means complete:
What is the 'fundamental' idea of machine learning for estimating parameters?
Role of Bias in Neural Networks
How to find probability distribution and parameters for real data? (Python 3)
I personally try to find probabilistic interpretation whenever possible (choice of loss function, parameters, regularization, architecture, etc), because this way I can move from blind guessing to making reasonable decisions.
Reading
This is very opinion-based, but at least few books are really worth mentioning: The Elements of Statistical Learning, An Introduction to Statistical Learning: with Applications in R or Pattern Recognition and Machine Learning (if your primary interest is machine learning). That's just a start, there are dozens of books on more specific topics, like computer vision, natural language processing and reinforcement learning.
Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.
I have written a back propagation class in VB.NET -it works well- and I'm using it in a C# artificial intelligence project.
But I have a AMD Phenom X3 at home and a Intel i5 at school. and my neural network is not multi-threaded.
How to convert that back propagation class to a multithreaded algorithm? or how to use GPGPU programming in it? or should I use any third party libraries that have a multithreaded back propagation neural network?
JeffHeaton has recommend that you use resilient propagation (RPROP) instead of backpropagation. There are examples on how to do multithreaded RPROP (MPROP):
Article on C# multithreaded backpropagation (from Jeff heaton)
Chapter 7.2.1- "Propagation and Multithreading" (p.94 of Introduction to Encog 2.5 for C#)
It's a difficult to discuss all of the details here, so I would recommend that you either read that article and take a look at the relevant chapters of the book I referenced. This, of course, is assuming you're familiar with concurrent programming.
Update:
Resilient propagation will typically outperform backpropagation by a
considerable factor. Additionally, RPROP has no parameters that must
be set. Backpropagation requires that a learning rate and momentum
value be specified. Finding an optimal learning rate and momentum
value for backpropagation can be difficult. This is not necessary with
resilient propagation.
(source: Encog Machine Learning)
I've tried implementing multiple threads for RPROP batch processing, but it seemed it was always slower than using a single thread. I've tried to implement separately at the loop level "#pragma omp parallel" and by calculating the errors, gradients and weights in separate threads. In my interpretation, it seems that the computing done in each thread is too small to outcome the computing done in switching the threads and synchronizing the results (mutex.) I'm wondering if I've done something wrong? My conclusion is that would be smarter to run RPROP single-threaded, while processing multiple neuronal networks at the same time in separate threads. Most of the implementations usually imply multiple interconnected NNs so that would make sense.