Dimensionality-reduction by specific methods - dimensionality-reduction

I'm working on feature extraction of facial image using local binary pattern. How to reduce the dimensions of my feature vector?which method should I use for feature reduction?

Check out SVD decomposition. Practically, it replaces matrix M with the new matrix Mk with a lower rank.
For example
import numpy as np
U, S, V = np.linalg.svd(lena_image)
Here Mk is a dot product of these matrices.
You can consider it like a compression which preserves only the most important information (extracting features)

Related

Create random Numpy array following a given distribution and trend

I want to create data that follows the same distribution and trend of the sample data taken using numpy.
For example say I have an array x whose trend is increasing and the distribution is suppose log normal. Can I create another random array which follows same distribution and trend using numpy ?
Well, numpy doesn't have the capability to fit distributions to your data. You can either do it manually using the method you like (MLE or MM) or you can use scipy that can fit distributions over your data like shown below:
import scipy.stats as st
# Inferred parameters of the distribution
s, loc, scale = st.lognorm.fit(x)
# Distribution object
dist = st.lognorm(s, loc, scale)
# generate 1000 random samples
samples = dist.rvs(size=1000)
Scipy used MLE by default.
You will have to explore your data and look into the distributions that fit the best. Numpy or scipy can't do that for you.
Documentation of fit method: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html

RandomForestClassifier in Multi-label problem - how it works?

How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.
Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.
I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)

Python sensitivity analysis from measured data with SALib toolbox

I would like to understand, how to use the SALib python toolbox to perform a Sobol sensitivity analysis (to study parameters and crossed parameters influence)
From the original example I'm supposed to proceed this way:
from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.test_functions import Ishigami
import numpy as np
problem = {
'num_vars': 3,
'names': ['x1', 'x2', 'x3'],
'bounds': [[-np.pi, np.pi]]*3
}
# Generate samples
param_values = saltelli.sample(problem, 1000)
# Run model (example)
Y = Ishigami.evaluate(param_values)
# Perform analysis
Si = sobol.analyze(problem, Y, print_to_console=True)
# Returns a dictionary with keys 'S1', 'S1_conf', 'ST', and 'ST_conf'
# (first and total-order indices with bootstrap confidence intervals
Because in my case I'm getting data from experiments, I don't have the model that is linking Xi and Yi. I just have an input matrix and an output matrix.
If we assume that my input data are generated from a Latin Hypercube (a good statistical repartition), how to use Salib to evaluate the sensitivity of my parameters? From what I see in the code:
Si = sobol.analyze(problem, Y, print_to_console=True)
We are only using input parameters boundaries and output. But with this approach how is it possible to know which parameter is evolving between two sets ?
thanks for your help!
There is no direct way to compute the Sobol indices using SAlib based on your description of the data. SAlib computes the first- and total-order indices by generating two matrices (A and B) and then using additional values generated by cross-sampling a value from matrix B in matrix A. The diagram below shows how this is done. When the code evaluates the indices it expects the model output to be in this order. The method of computing indices this way is based on the methods published by Saltelli et al. (2010). Because this is not a Latin hypercube sampling method, the experimental data will most likely not work.
One possible method to still complete a sensitivity analysis is to use a surrogate or meta model from your experimental data. In this case you could use the experimental data to fit an approximation of your true model. This approximation can then be analyzed by SAlib or another sensitivity package. The surrogate model is typically a polynomial or based on kriging. Iooss et al (2006) describes some methods. Some software for this method includes UQlab (http://www.uqlab.com/, MATLAB-based) and BASS (https://cran.r-project.org/web/packages/BASS/index.html, R package) among others depending on the specific type of model and fitting techniques you want to use.
Another possibility is to find an estimator that is not based on the Saltelli et al (2010) method. I am not sure if such an estimator exists, but it would probably be better to post that question in the Math or Probability and Statistics Stack Exchanges.
References:
Iooss, B, F. Van Dorpe, N. Devictor. (2006). "Response surfaces and sensitivity analyses for an environmental model of dose calculations". Reliability Engineering and System Safety 91:1241-1251.
Saltelli, A., P. Annoni, I. Azzini, F. Campolongo, M. Ratto, S. Tarantola. 2010. "Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index". Computer Physics Communications 181:259-270.

sklearn likelihood from latent dirichlet allocation

I want to use the latent dirichlet allocation from sklearn for anomaly detection. I need to obtain the likelihood for a new samples as formally described in equation here.
How can I get that?
Solution to your problem
You should be using the score() method of the model which returns the log likelihood of the passed in documents.
Assuming you have created your documents as per the paper and trained an LDA model for each host. You should then get the lowest likelihood from all the training documents and use it as a threshold. Example untested code follows:
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
i for i, x in enumerate(X_unknown)
if lda.score([x]) < threshold
]
# attacks now contains the indexes of the anomalies
Exactly what you asked
If you want to use exact equation in the paper you linked I would advise against trying to do it in scikit-learn because the expectation step interface is not clear.
The parameters θ and φ can be found at lines 112 - 130 as doc_topic_d and norm_phi. The function _update_doc_distribution() returns the doc_topic_distribution and the sufficient statistics from which you could try to infer the θ and φ by the following again untested code:
theta = doc_topic_d / doc_topic_d.sum()
# see the variables exp_doc_topic_d in the source code
# in the function _update_doc_distribution()
phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS
Suggestion for another library
If you want to have more control over the expectation and maximization steps and the variational parameters I would suggest you look at LDA++ and specifically the EStepInterface (disclaimer I am one of the authors of LDA++).

How to get feature names while using HashingVectorizer in python?

I want to make a 2D binary array (n_samples, n_features), where each sample is a text string and each feature is a word(unigram).
The problem is number of sample is 350000 and nunmber of feature is 40000 but my RAM size is 4GB only.
I am getting memory error after using CountVectorizer. So, is there any other way(like mini-batch) to do this?
If I use HashingVectorizer then how to get the feature_names? i.e. which column correspond to which feature?, because get_feature_names() method is not available in HashingVectorizer.
To get feature names for HashingVectorizer you can take a random sample of documents, compute hashes for them and learn which hash correspond to which tokens this way. It is not perfect because there can be other tokens which correspond to a given column, and there can be collisions, but often this is enough to inspect the vectorization result (or e.g. coefficients of a linear classifier which uses hashing features).
A shameless plug - https://github.com/TeamHG-Memex/eli5 package has this implemented:
from eli5.sklearn import InvertableHashingVectorizer
# vec should be a HashingVectorizer instance
ivec = InvertableHashingVectorizer(vec)
ivec.fit(docs_sample) # e.g. each 10-th or 100-th document
names = ivec.get_feature_names()
See also: Debugging Hashing Vectorizer section in eli5 docs.
Mini batches are not supporting in countvectorizer. However, hashing vectorizer of sklearn has partial_fit() that you can use.
Quoting sklearn documentation "There is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model."

Resources