I want to classify some texts based on available keywords in each class. In other words, I have a list of keywords for each category. I need some heuristic methods using these keywords and determine top similar categories for each text. I should say that in the current phase of the project, I didn't want to use a machine learning-based method for text classification.
Related
I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?
I am trying to implement Sentence Similarity in NLP domain and wanted to know how to effectively utilise the Orange Data mining tool.
First, it would be nice if you provided a minimal example with some data and what you are trying to achieve.
I assume you have a text and you wish to inspect how similar sentences in this text are. First, you will have to split your text into sentences and put them in separate rows (one sentence per row). There's a script for that in orange-scripts. Then use any of the clustering approaches, bag of words or document embedding. You can find tutorials on YouTube.
I have a dataset which is a csv having 2 columns "Text", "Name".
"Text" column contains the news article.
"Name" column contains the extracted name from the corresponding text.
I have to train the model on this dataset, dataset contains 4000 plus unique news, where in, once your model is trained and validated, User should be able to pass any text and it should fetch the proper name.
What technique should I use and implement it. Please suggest.
Thanks in advance.
It sounds like you are looking to search for an item by keywords. In a basic case you could use a bag of words approach, in which you tokenise the words in the Text-field and index each document accordingly.
The relevance of each document can then be calculated given some measure (for instance cosine similarity).
You can find an example using the genesis-library here: https://radimrehurek.com/gensim/tut3.html
It is quite basic, note however that it does use LSI.
I am working on a NLP problem to classify the text to four classes.
1. Sports
2. Entertainment
3. Astrology
4. Unknown
I have created a training dataset for Sports, Entertainment, Astrology. But How to create a training dataset for "Unknown" category or how to classify the text which are not belong to first three category to the last category i.e "Unknown category"
I would select documents/texts which do not belong to any of the first 3 class.
There is an important catch here, probably that number of documents is going to be very high in comparison to the number of documents in each other class, so probably what you want to do is to sub-sample (for instances, randomly choosing a number of documents) that Unknown class.
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.