I am trying to implement Sentence Similarity in NLP domain and wanted to know how to effectively utilise the Orange Data mining tool.
First, it would be nice if you provided a minimal example with some data and what you are trying to achieve.
I assume you have a text and you wish to inspect how similar sentences in this text are. First, you will have to split your text into sentences and put them in separate rows (one sentence per row). There's a script for that in orange-scripts. Then use any of the clustering approaches, bag of words or document embedding. You can find tutorials on YouTube.
Related
I want to classify some texts based on available keywords in each class. In other words, I have a list of keywords for each category. I need some heuristic methods using these keywords and determine top similar categories for each text. I should say that in the current phase of the project, I didn't want to use a machine learning-based method for text classification.
I have a dataset which is a csv having 2 columns "Text", "Name".
"Text" column contains the news article.
"Name" column contains the extracted name from the corresponding text.
I have to train the model on this dataset, dataset contains 4000 plus unique news, where in, once your model is trained and validated, User should be able to pass any text and it should fetch the proper name.
What technique should I use and implement it. Please suggest.
Thanks in advance.
It sounds like you are looking to search for an item by keywords. In a basic case you could use a bag of words approach, in which you tokenise the words in the Text-field and index each document accordingly.
The relevance of each document can then be calculated given some measure (for instance cosine similarity).
You can find an example using the genesis-library here: https://radimrehurek.com/gensim/tut3.html
It is quite basic, note however that it does use LSI.
I want to extract information from the resume, for this, I have to identify headings and take text data underneath that heading.
I think you need to be more specific to your issue and approach you want to take. As of now, for heading extraction, you can define a corpus first form all the headings after reading in beautiful soup. Once such corpus is created you can now match the corpus with heading of the resume and get the section by defining the starting and ending data point. and then match skills et. whatever you want to do with it.
This is the simplest approach based on your current question. Be more specific so, i can guide with more precise approach.
Best,
I am currently doing out the top 10 types of fault chart. So the user will key in what is the fault about, ex. light bulb fused. As it is free flow text box, the words may not be the same. Is there anyway to make Alteryx understand that some words may be the same, allowing me to find the top 10 types of fault. Thank you.
You have a couple of ways. You can use the Fuzzy Match tools in the Join category to sort out slight spelling mistakes. You can find Alteryx examples of Fuzzy Match on Youtube.
You can also use the Record ID followed by Text to Columns (Split to Rows based on space) to get a list of single words.
In what you are trying to do, I would advise building up a bit of a lookup table. You can then use the Find-Replace Tool to Append the Category from the lookup depending on the words that are found.
Depending on the cleanliness of your data and how different each category is will guide you as to how far down the above paths you should go.
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.