How to deal with a target variable containing nominal data? - nlp

Im working on an NLP project whose target variable contains seven unique sentences which are "inspirational and thought-provoking ", "informative", "acknowledgment and appreciations" and 4 others. As for my understanding, the target variable as we can't establish a quantitative comparison between them. So my question is what is the best way to encode such variables? And if I encode it using one Hot-encoding then the problem will be of multi-class classification?

In classification it does not matter what the class actually represents, the learning algorithm treats every class as categorical anyway. In other words whether the names of the classes are strings, characters or numbers does not change anything to the model. This is why the most common choice is to simply represent the classes as integers: 1,2,3,... For example in scikit this can be done with LabelEncoder.
It would be a bad idea to use one hot encoding because this would make the problem multi-label. This would make the problem much more complex for the model and would very likely lead to lower performance, or it would require much more data in order to reach the same performance as regular classification. This is because there are much more combinations possible in the multi-label problem, and in this case this higher level of complexity is pointless since there can be only one class.

Related

Can we compare word vectors from different models using transfer learning?

I want to train two word2vec/GLoVe models on different corpora and then compare the vectors of a single word. I know that it makes no sense to do so as different models start at different random states, but what if we use pre-trained word vectors as the starting point. Can we assume that the two models will continue to build upon the pre-trained vectors by incorporating the respective domain-specific knowledge, and not go into completely different states?
Tried to find some research papers which discuss this problem, but couldn't find any.
Simply starting your models with pre-trained bectors would eliminate some of the randomness, but with each training epoch on your new corpora:
there's still randomness introduced by negative-sampling (if using that default mode), by frequent-word downsampling (if using default values of the sample parameter in word2vec), and by the interplay of different threads
each epoch with your new corpora will be pulling the word-vectors for present words to new, better positions for that corpora, but leaving original words unmoved. The net movements over many epochs could move words arbitrarily far from where they started, in response to the whole-corpus-effects on all words.
So, doing so wouldn't necessarily achieve your goal in a reliable (or theoretically-defensible) way, though it might kinda-work – at least better than starting from purely random initialization – especially if your corpora are small and you do few training epochs. (That's usually a bad idea – you want big varied training data and enough passes for extra passes to make little incremental difference. But doing those things "wrong" could make your results look "better" in this scenario, where you don't want your training to change the original coordinate-space "too much". I wouldn't rely on such an approach.)
Especially if the words you need to compare are a small subset of the total vocabulary, a couple things you could consider:
combine the corpora into one training corpus, shuffled together, but for those words you need to compare, replace them with corpora-specific tokens. For example, replace 'sugar' with 'sugar_c1' and 'sugar_c2' – leaving the vast majority of surrounding words to be the same tokens (and thus learn a single vector across the whole corpus). Then, the two variant tokens for the "same word" will learn different vectors, based on their differing contexts that still share many of the same tokens.
using some "anchor set" of words that you know (or confidently conjecture) either do mean the same across both contexts, or should mean the same, train two models but learn a transformation between the two space based on those guide words. Then, when you apply that transformation to other words, that weren't used to learn the transformation, they'll land in contrasting positions in each others' spaces, maybe achieving the comparison you need. This is a technique that's been used for language-to-language translation, and there's a helper class and example notebook included with the Python gensim library.
There may be other better approaches, these are just two quick ideas that might work without much change to existing libraries. A project like 'HistWords', which used word-vector training to try to track evolving changes in word-meaning over time, might also have ideas for usable techniques.

How Does the Hashing Trick in Machine Learning Work?

I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).
I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.
I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.
I have reviewed the following links to try and understand it:
https://learn.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing
https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f
https://en.wikipedia.org/wiki/Vowpal_Wabbit
I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA
Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.
I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?
'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364
So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.
The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.

Natural Language Generation - how to go beyond templates

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.
We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?
EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.
The current implementation could give something like this:
Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.
We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.
What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect.
A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method.
You can find an introduction to basics of n-gram based NLG here:
Generating Random Text with Bigrams
this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg
Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.
Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.
Note that usually these models need a huge amount of training data.
You need to meet two criteria in order to benefit from such models:
You should be able to represent your input as a vector.
You need a relatively large amount of input/target pairs.

Classification with array of strings as input vector

I have a question related to the machine learning task. The problem is to predict a value based on the vector of strings. The most straightforward idea that came to mind was to use linear regression. However, since my input is non-numeric, I thought I'd use hashcode of my strings, but I've read somewhere here that the results will be meaningless. Another idea was to encode my strings in base 26 using the letter positions in the alphabet, but I haven't tested it yet, thus asking for advice.
Could someone recommend a good (meaningful) way of encoding strings so that they can be used in linear regression algorithm? Or suggest another machine learning algorithm suitable for the task.
To summarise: the input to the classifier will consist of a fixed size array of strings (arrays are fixed length, not strings), and the output should be an integer in range 0-100. The training data will consist of a collection of such input arrays (x-values) with corresponding numbers (y-values).
Transform each one of your M strings into an N-dimensional vector using a vector space model like word2vec or GloVe. Then concatenate these vectors to one vector with M*N components. Optionally normalize each component to e.g. 0-1. You should then be able to run any regression (or classification) algorithm on the result, e.g. logistic regression.
You might also try a clustering approach, where you cluster all the words in your vocabulary into N clusters, e.g. with k-means on the word vectors or using brown clustering. You could then represent each word in your input array with a one hot vector (i.e. N-1 zeros and a single one at the index of the cluster of that word). Then concatenate them again and run regression on the result.
I did the similar project with strings. I am suggesting one of the way you can implement it.
In machine learning "naive bayes classifier" will make your problem easy. That works on the probability theory. So if you are working with python there is NLTK(toolkit) and Textblob(library on NLTK), those will help you a lot.
Your question is very generic so I can't describe everything here but just feel free to ask anything you are struggling with, I would be happy to answer them.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.
Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?
This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.
Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

Resources