Classification with array of strings as input vector - string

I have a question related to the machine learning task. The problem is to predict a value based on the vector of strings. The most straightforward idea that came to mind was to use linear regression. However, since my input is non-numeric, I thought I'd use hashcode of my strings, but I've read somewhere here that the results will be meaningless. Another idea was to encode my strings in base 26 using the letter positions in the alphabet, but I haven't tested it yet, thus asking for advice.
Could someone recommend a good (meaningful) way of encoding strings so that they can be used in linear regression algorithm? Or suggest another machine learning algorithm suitable for the task.
To summarise: the input to the classifier will consist of a fixed size array of strings (arrays are fixed length, not strings), and the output should be an integer in range 0-100. The training data will consist of a collection of such input arrays (x-values) with corresponding numbers (y-values).

Transform each one of your M strings into an N-dimensional vector using a vector space model like word2vec or GloVe. Then concatenate these vectors to one vector with M*N components. Optionally normalize each component to e.g. 0-1. You should then be able to run any regression (or classification) algorithm on the result, e.g. logistic regression.
You might also try a clustering approach, where you cluster all the words in your vocabulary into N clusters, e.g. with k-means on the word vectors or using brown clustering. You could then represent each word in your input array with a one hot vector (i.e. N-1 zeros and a single one at the index of the cluster of that word). Then concatenate them again and run regression on the result.

I did the similar project with strings. I am suggesting one of the way you can implement it.
In machine learning "naive bayes classifier" will make your problem easy. That works on the probability theory. So if you are working with python there is NLTK(toolkit) and Textblob(library on NLTK), those will help you a lot.
Your question is very generic so I can't describe everything here but just feel free to ask anything you are struggling with, I would be happy to answer them.

Related

How to deal with a target variable containing nominal data?

Im working on an NLP project whose target variable contains seven unique sentences which are "inspirational and thought-provoking ", "informative", "acknowledgment and appreciations" and 4 others. As for my understanding, the target variable as we can't establish a quantitative comparison between them. So my question is what is the best way to encode such variables? And if I encode it using one Hot-encoding then the problem will be of multi-class classification?
In classification it does not matter what the class actually represents, the learning algorithm treats every class as categorical anyway. In other words whether the names of the classes are strings, characters or numbers does not change anything to the model. This is why the most common choice is to simply represent the classes as integers: 1,2,3,... For example in scikit this can be done with LabelEncoder.
It would be a bad idea to use one hot encoding because this would make the problem multi-label. This would make the problem much more complex for the model and would very likely lead to lower performance, or it would require much more data in order to reach the same performance as regular classification. This is because there are much more combinations possible in the multi-label problem, and in this case this higher level of complexity is pointless since there can be only one class.

get closest vector from unknown vector with gensim

I am currently implementing a natural text generator for a school project. I have a dataset of sentences of predetermined lenght and key words, I convert them in vectors thanks to gensim and GoogleNews-vectors-negative300.bin.gz. I train a recurrent neural network to create a list of vectors that I compare to the list of vectors of the real sentence. So I try to get as close as possible to the "real" vectors.
My problem happens when I have to convert back vectors into words: my vectors aren't necessarily in the google set. So I would like to know if there is an efficient solution to get the closest vector in the Google set to an outpout vector.
I work with python 3 and Tensorflow
Thanks a lot, feel free to ask any questions about the project
Charles
The gensim method .most_similar() (on KeyedVectors & similar classes) will also accept raw vectors as the 'origin' from which to search.
Just be sure to explicitly name the positive parameter - a list of target words/vectors to combine to find the origin point.
For example:
gvecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
target_vec = gvecs['apple']
similars = gvecs.most_similar(positive=[target_vec,])

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

How Does the Hashing Trick in Machine Learning Work?

I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).
I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.
I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.
I have reviewed the following links to try and understand it:
https://learn.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing
https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f
https://en.wikipedia.org/wiki/Vowpal_Wabbit
I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA
Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.
I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?
'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364
So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.
The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.

Variable-length tensors in Theano

This question refers to best practices in Theano. Here is what I am trying to do:
I am building a neural network for an SMT system. In this context, I conceptually represent sentences as variable-length lists of words, and words as fixed-length lists of integers. Ideally, I would like to represent my corpus as a 3D tensor (first dimension = sentences in corpus, second dimension = words in sentence, third dimension = integer features in words). The difficulty is that sentences have variable length and, to my knowledge, tensors in Theano have the strict requirement that all lengths in one dimension must be the same.
Solutions I have thought of include:
Use padding with dummy words so that sentences become equally sized. But this means that whenever I iterate over a sentence, I need to include special code to discard the padding.
Represent the corpus as a vector of matrices. However, this makes it hard to work with certain functions. For instance, if I want to add up the representations of all the words in a sentence, I can't simply use *corpus.sum(axis=1)*. I would have to loop over sentences, do *sentence.sum(axis=0)*, and then gather the results into another tensor.
My question is: which of these alternatives are preferred, or is there a better one?
The first option is probably the best option in most cases. It's what I do though it does mean passing around a separate vector of sentence lengths and masking certain results to eliminate the padding region when needed.
In general, if you want to perform a consistent operation to all sentences then you'll usually get much better speed applying that operation to a single 3D tensor than sequentially to a series of matrices. This is especially true for operations running on a GPU.
If you're using scan operations the speed differences will become even more magnified. You'll be better off scanning over a 3D tensor and operating on a per-word matrix in your step function that covers all (or a minibatch of) sentences. If needed, you may need to know which rows of that matrix are real data and which are padding. As an aside, I find that setting the first dimension of a 3D tensor to be the temporal/sequence position dimension helps when using scan, which always scans over the first dimension.
Often, using the value zero as your padding value will result in the padding have no impact on your operations.
The other option, looping over the sentences, would mean mixing Theano and Python code which can make some computations difficult or impossible. For example, getting the gradient of a cost function with respect to some parameters over a all (or batch) of your sentences may not be possible if the data is stored in lots of separate matrices.

Resources