Get closest text entry of a list from a string - nlp

I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?

Related

Keyword based text classification

I want to classify some texts based on available keywords in each class. In other words, I have a list of keywords for each category. I need some heuristic methods using these keywords and determine top similar categories for each text. I should say that in the current phase of the project, I didn't want to use a machine learning-based method for text classification.

Turn list of tuples into binary tensors?

I have a list of tuples in the form below. A given tuple represents a given pair of movies that a given user liked. All tuples, together, capture every combination of movie likes found in my data.
[(movie_a,movie_b),...(movie_a,movie_b)]
My task is to create movie embeddings, similar to word embeddings. The idea is to train a single hidden layer NN to predict the most likely movie which any user might like, given a movie supplied. Much like word embeddings, the task is inconsequential; it's the weight matrix I'm after, which maps movies to vectors.
Reference: https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
In total, there are 19,000,000 tuples (training examples.) Likewise, there are 9,000 unique movie IDs in my data. My initial goal was to create an input variable, X where each row represented a unique movie_id, and each column represented a unique observation. In any given column, only one cell would be set to 1, with all other values set to 0.
As an intermediate step, I tried creating a matrix of zeros of the right dimensions
X = np.zeros([9000,19000000])
Understandably, my computer crashed, simply trying to allocate sufficient memory to X.
Is there a memory efficient way to pass my list of values into PyTorch, such that a binary vector is created for every training example?
Likewise, I tried randomly sampling 500,000 observations. But similarly, passing 9000,500000 to np.zeroes() resulted in another crash.
My university has a GPU server available, and that's my next stop. But I'd like to know if there's a memory efficient way that I should be doing this, especially since I'll be using shared resources.

Which technique for training should I use for the following dataset?

I have a dataset which is a csv having 2 columns "Text", "Name".
"Text" column contains the news article.
"Name" column contains the extracted name from the corresponding text.
I have to train the model on this dataset, dataset contains 4000 plus unique news, where in, once your model is trained and validated, User should be able to pass any text and it should fetch the proper name.
What technique should I use and implement it. Please suggest.
Thanks in advance.
It sounds like you are looking to search for an item by keywords. In a basic case you could use a bag of words approach, in which you tokenise the words in the Text-field and index each document accordingly.
The relevance of each document can then be calculated given some measure (for instance cosine similarity).
You can find an example using the genesis-library here: https://radimrehurek.com/gensim/tut3.html
It is quite basic, note however that it does use LSI.

Polynominal error in Rapidminer when doing n-gram classification

I am trying to classify different concepts in a text using n-gram. My data tyically exists of six columns:
The word that needs classification
The classification
First word on the left of 1)
Second word on the left of 1)
First word on the right of 1)
Second word on the right of 1)
When I try to use a SVM in Rapidminer, I get the error that it can not handle polynominal values. I know that this can be done because I have read it in different papers. I set the second column to 'label' and have tried to set the rest to 'text' or 'real', but it seems to have no effect. What am I doing wrong?
You have to use the Support Vector Machine (LibSVM) Operator.
In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.
One approach could be to create attributes with names equal to the words and values equal to the distance from the word of interest. Of course, all possible words would need to be represented as attributes so the input data would be large.

Custom Word Tagger

I am new to NLP and am getting to know NLTK, but am having some trouble getting off the ground on something I am trying to accomplish.
I would like to build my own word tagger such that if I pass a string like "The Porsche is red" the function would return ('Porsche','Car', 'red', 'Color').
I already have the dictionaries built that define the categories. I am just struggling on how to get start. Could anyone offer some assistance?
Thanks very much.
UPDATE: The dictionary at this time is a simple two column list in .csv format with the word and its corresponding category.
Example Link: http://www.filedropper.com/carexampledictionary
Sincerely,
Mick
I think simple lookup in the list might work. First tokenize the text, then iterate through the tokens and look up each token in the list in your lists of categories.
One problem you might have is overlap between the categories. Is there any word which occurrs in more than one category list? If so you'd need a method to disambiguate which category a given token belongs to. If not simple list-lookup should work.
More precisely, here is what I would do step-by-step:
Import the data into a dictionary
Tokenize the text
For each token, look up whether the token is in the keys of your dictionary
Tag the word according to what category it belongs to
Hope that helps.

Resources