I am working on a small research project. I am looking to write a program that
a) Takes a large number of short texts (~100 words / several thousand texts)
b) Identify keywords in the texts
c) Presents all of them to a group of users who indicate if they found them interesting or not
d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.
Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.
1) Keyword identification.
Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.
How would this problem commonly be addressed?
This is a way to start with:
clean your input text (remove special tokens etc)
use n-grams as features (can just start with 1-gram).
treat user's feedback "preferrable or not" as a binary label.
learn a binary classifier (whatever model is fine, naive bayesian, logistic regression).
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords
You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:
34, 68, 79293, 23232 -> 0 (negative label)
340, 608, 3, 232 -> 1 (positive label)
Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
...
TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features.
Use the same example as above, suppose we just use 4-dimensional embedding:
34 -> [0.1, 0.2, -0.3, 0]
68 -> [0, 0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0, 0]
------------------------------- sum
sum -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1 -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
-> [0.51,0.19,-0.18,0.13]
At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.
You can simply use sklearn to learn a TFIDF bag of words model of your texts which returns a sparse matrix n_samplesxn_features like this:
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = TfidfTransformer(smooth_idf=False)
X_train = vectorizer.fit_transform(list_of_texts)
print(X_train.shape)
X_train is a scipy csr sparse matrix. If your NN implementation doesn't support sparse matrices you can convert it to a numpy dense matrix but it might fill your RAM; better to use an implementation that supports sparseinput (e.g. I know Lasagne/Theano does that).
After training, you can use the parameters of the NN to find out which features have a high/low weight and so are more/less important for the particular label.
Related
So I've got a simple pytorch example of how to train a ResNet CNN to learn MNIST labeling from this link:
https://zablo.net/blog/post/using-resnet-for-mnist-in-pytorch-tutorial/index.html
It's working great, but I want to hack it a bit so that it does 2 things. First, instead of predicting digits, it predicts animal shapes/colors for a project I'm working on. That's already working quite well already and am happy with it.
Second, I'd like to hack the training (and possibly layers) so that predictions is done in parallel on multiple images at a time. In the MNIST example, basically prediction (or output) would be done for an image that has 10 digits at a time concatenated by me. For clarity, each 10-image input will have the digits 0-9 appearing only once each. The key here is that each of the 10 digit gets a unique class/label from the CNN/ResNet and each class gets assigned exactly once. And that digits that have high confidence will prevent other digits with lower confidence from using that label (a Hungarian algorithm type of approach).
So in my use case I want to train on concatenated images (not single images) as in Fig A below and force the classifier to learn to predict the best unique label for each of the concatenated images and do this all at once. Such an approach should outperform single image classification - and it's particularly useful for my animal classification because otherwise the CNN can sometimes return the same ID for multiple animals which is impossible in my application.
I can already predict in series as in Fig B below. And indeed looking at the confidence of each prediction I am able to implement a Hungarian-algorithm like approach post-prediction to assign the best (most confident) unique IDs in each batch of 4 animals. But this doesn't always work and I'm wondering if ResNet can try and learn the greedy Hungarian assignment as well.
In particular, it's not clear that implementing A simply requires augmenting the data input and labels in the training set will do it automatically - because I don't know how to penalize or dissalow returning the same label twice for each group of images. So for now I can generate these training datasets like this:
print (train_loader.dataset.data.shape)
print (train_loader.dataset.targets.shape)
torch.Size([60000, 28, 28])
torch.Size([60000])
And I guess I would want the targets to be [60000, 10]. And each input image would be [1, 28, 28, 10]? But I'm not sure what the correct approach would be.
Any advice or available links?
I think this is a specific type of training, but I forgot the name.
The question title says it all: How can I make a bag-of-words model smaller? I use a Random Forest and a bag-of-words feature set. My model reaches 30 GB in size and I am sure that most words in the feature set do not contribute to the overall performance.
How to shrink a big bag-of-words model without losing (too much) performance?
Use feature selection. Feature selection removes features from your dataset based on their distribution with regards to your labels, using some scoring function.
Features that rarely occur, or occur randomly with all your labels, for example, are very unlikely to contribute to accurate classification, and get low scores.
Here's an example using sklearn:
from sklearn.feature_selection import SelectPercentile
# Assume some matrix X and labels y
# 10 means only include the 10% best features
selector = SelectPercentile(percentile=10)
# A feature space with only 10% of the features
X_new = selector.fit_transform(X, y)
# See the scores for all features
selector.scores
As always, be sure to only call fit_transform on your training data. When using dev or test data, only use transform. See here for additional documentation.
Note that there is also a SelectKBest, which does the same, but which allows you to specify an absolute number of features to keep, instead of a percentage.
If you don't want to change the architecture of your neural network and you are only trying to reduce the memory footprint, a tweak that can be made is to reduce the terms annotated by the CountVectorizer.
From the scikit-learn documentation, we have (at least) three parameter for reduce the vocabulary size.
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
In first instance, try to play with max_df and min_df. If the size is still not suitable with your requirements, you can drop the size as you like using the max_features.
NOTE:
The max_features tuning can drop your classification accuracy by an higher ratio than the other parameters
How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.
If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Quick example
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix
Newbie to Keras alert!!!
I've got some questions related to Recurrent Layers in Keras (over theano)
How is the input supposed to be formatted regarding timesteps (say for instance I want a layer that will have 3 timesteps 1 in the future 1 in the past and 1 current) I see some answers and the API proposing padding and using the embedding layer or to shape the input using a time window (3 in this case) and in any case I can't make heads or tails of the API and SimpleRNN examples are scarce and don't seem to agree.
How would the input time window formatting work with a masking layer?
Some related answers propose performing masking with an embedding layer. What does masking have to do with embedding layers anyway, aren't embedding layers basically 1-hot word embeddings? (my application would use phonemes or characters as input)
I can start an answer, but this question is very broad so I would appreciate suggestions on improvement to my answer.
Keras SimpleRNN expects an input of size (num_training_examples, num_timesteps, num_features).
For example, suppose I have sequences of counts of numbers of cars driving by an intersection per hour (small example just to illustrate):
X = np.array([[10, 14, 2, 5], [12, 15, 1, 4], [13, 10, 0, 0]])
Aside: Notice that I was taking observations over four hours, and the last two hours had no cars driving by. That's an example of zero-padding the input, which means making all of the sequences the same length by adding 0s to the end of shorter sequences to match the length of the longest sequence.
Keras would expect the following input shape: (X.shape[0], X.shape1, 1), which means I could do this:
X_train = np.reshape(X, (X.shape[0], X.shape[1], 1))
And then I could feed that into the RNN:
model = Sequential()
model.add(SimpleRNN(units=10, activation='relu', input_shape = (X.shape[1], X.shape[2])))
You'd add more layers, or add regularization, etc., depending on the nature of your task.
For your specific application, I would think you would need to reshape your input to have 3 elements per row (last time step, current, next).
I don't know much about the masking layers, but here is a good place to start.
As far as I know, embeddings are independent of maskings, but you can mask an embedding.
Hope that provides a good starting point!
I'm using the SVM classifier in the machine learning scikit-learn package for python.
My features are integers. When I call the fit function, I get the user warning "Scaler assumes floating point values as input, got int32", the SVM returns its prediction, I calculate the confusion matrix (I have 2 classes) and the prediction accuracy.
I've tried to avoid the user warning, so I saved the features as floats. Indeed, the warning disappeared, but I got a completely different confusion matrix and prediction accuracy (surprisingly much less accurate)
Does someone know why it happens? What is preferable, should I send the features as float or integers?
Thanks!
You should convert them as floats but the way to do it depends on what the integer features actually represent.
What is the meaning of your integers? Are they category membership indicators (for instance: 1 == sport, 2 == business, 3 == media, 4 == people...) or numerical measures with an order relationship (3 is larger than 2 that is in turn is larger than 1). You cannot say that "people" is larger than "media" for instance. It is meaningless and would confuse the machine learning algorithm to give it this assumption.
Categorical features should hence be transformed to explode each feature as several boolean features (with value 0.0 or 1.0) for each possible category. Have a look at the DictVectorizer class in scikit-learn to better understand what I mean by categorical features.
If there are numerical values just convert them as floats and maybe use the Scaler to have them loosely in the range [-1, 1]. If they span several order of magnitudes (e.g. counts of word occurrences) then taking the logarithm of the counts might yield better results. More documentation on feature preprocessing and examples in this section of the documentation: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: also read this guide that has many more details for features representation and preprocessing: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf