Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm aware that this is kind of a general, open-ended question. I'm essentially looking for help in deciding a way forward, and perhaps for some reading material.
I'm working on an algorithm that does unstructured text mining, and trying to extract something specific - the names of bands (single artists, bands, etc) from that text. The text itself has no predictable structure, but it is relatively small (1, 2 rows of text).
Some examples may be (not real events):
Concert Green Day At Wembley Stadium
Extraordinary representation - Norah Jones in Poland - at the Polish Opera
Now, I'm thinking of trying out a classifier but the text seems to small to provide any real training information for it.
There probably are several other text mining techniques, heuristics or algorithms that may yield good results for this kind of problem (or perhaps no algorithm will).
Because of the structure of your data a pre-trained model will probably perform poorly. Besides, the general organization, location, and person categories will probably not be useful for you.
I don't think the text themselves are too small, most NER-systems work on one sentence at a time. So providing your own training set with a NER-library will probably work well, such as http://nlp.stanford.edu/ner/index.shtml
If you don't want to create a training set you will need a dictionary with all the bands/artists. Then you obviously can't find unknown bands/artists.
There is simple NER algorithm that could simplify the task a bit:
take the words which may be (or not be) named entity and search for them in Google or Yahoo (via API) twice: as separate words and as exact phrase (i.e. with quotation marks). Divide numbers of results. There is threshold (<30) which determines if words form a named entity.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
Given:
Confessionalized Optics: The Society of Jesus and Early Modern Optics
Author: Purkaple, Brent
University: University of Oklahoma
Year Published: 2022
Abstract:
This dissertation explores the investigation and explanation of optics
among prominent members of the Society of Jesus during the early
modern period. In doing so it aims to explain why it was that optics
became one of the more important scientific subjects among the members
of the Order. In addition to this it aims to explain how it was that
their identity as members of the Order shaped their explanation of
optics at a time when there was no agreed upon meaning of optics. As
argued, this interaction between Jesuit identity and optical theory
may best be understood as an act of confessionalization. The benefit
of this categorization is that it allows for a complex analysis of
optics among the Society of Jesus which avoids any essential
identification of the relationship between science and religion. This
dissertation, then, not only addresses why optics among the Jesuits
should be understood as confessionalized, but also how the category of
confessionalization may provide a path through the complex dynamics of
early modern science and religion.
I would like to have this (i.e. hundreds of strings in this format) converted into a table with columns Number, Title, Author, University, Year Published, Abstract (multi-lines!).
I don't rely on a specific tool but I fail doing it with Excel. I think I will need to use a RegEx formula.
It can be done in power query. Some Abstracts where missing, for those I inserted an extra row manually in Excel.
Here you can watch how to do it in Power Query:
https://www.youtube.com/watch?v=0W_0tvPIOng
With the help of a formula I added new rows
One or two manual changes needed also.
Final outcome:
Here is a copy of that file:
https://1drv.ms/x/s!AncAhUkdErOkguU9IADpGxeeEOspPQ?e=KyF0oq
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there some sort of algorithm out there or concept that can help with the following problem?
Say I have two snippets of text, snippet 1, and snippet two.
Snippet 1 reads as follows:
"The dog was too scared to go out into the storm"
Snippet 2 reads as follows:
"The canine was intimidated to venture into the rainy weather"
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
UPDATE:
Okay, to give a more specific example, say I wanted to reduce the number of bugs in a ticketing system. And I wanted to do some sort of scan, to see if there are any related or similar tickets. I wanted to know the best systematic way of determining the issue based on the body of a ticket. The Levenshtein Distance algorithm doesn't particularly work well, since it wouldn't know the difference between wet and dry.
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
Well, this is a very famous problem in NLP, and to be more precise, you are comparing semantics of two sentences.
Maybe you can look into libraries like gensim, Wordnet::Similarity etc which provide ways to retrieve semantically similar documents.
Here's another semantically similar SO question question.
An option here could be the Levenshtein Distance between two strings.
It is a measure of the number of operations required to get from one string to another. So, the larger the distance, the less similar the two strings.
This kind of algorithm is great for spell checking or voice recognition because the given string and expected string generally only differ by just a couple words/characters.
For your example, the Levenshtein Distance is 32 (you can try this calculator) which indicates that the strings are not very similar (since the strings are not much longer than the distance of 32).
This algorithm is not great for context sensitive comparisons but your example is kind of an extreme case. Very likely there would be more words in common which would result in a smaller Levenshtein Distance. You could use this algorithm in conjunction with some other methods (See: What are some algorithms for comparing how similar two strings are?) to try to get a more optimal comparison.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
For example, I have such a paragraph as below:
The first sentence (bold and italic) is what I hope to identify out.
The identification goal includes:
1. whether this paragraph contain such disclosure.
2. what this disclosure is.
The possible problems are :
1. this sentence may not be in the begin of the text string. it could be in any place of the given paragraph.
2. this sentence may vary with words but with same meaning. For example, it could also be expressed as: "Sample provided for review" or "They sent to me an item for evaluation" or something like this.
So how could I identify such disclosures ? Anyone's idea would be greatly appreciated. Thanks.
The paragraph:
I was sent this Earbuds Audiophile headphones to review. I am just going to copy here the information from the site: "High Definition Stereo Earphones with microphone Equipped with two 9mm high fidelity drivers, unique sound performance, well-balanced bass, mids and trebble. Designed specially for those who enjoy classic music, rock music, pop music, or gaming with superb quality sound. Let COR3 be your in ear sports earbuds. Replaceable Back Caps, inline controller and mic
Extreme flexible tangle free flat TPE cable including inline controller with universal microphone. Play/Pause your music or Answer/Hang up a call with a touch of a button right next to your hands, feature available depending on your device capability. COR3 should be your best gaming earbuds.
Extremely Comfortable
Methods I have tried:
Up to now, my processing is very naive: 1) humanly labeled 1000 pieces of reviews as a binary variable (1 represents including the disclosure text, 0 otherwise). 2) Collect all the disclosure texts as a corpus denoted by DisclosureCor; 3) Based on these DisclosureCor, I discovered some basic regular regression rules, like " review.* evaluation|test|opinion". 4) Using these summarized rules to label new data. 5) The problem is that rules may not be complete, since they are just my own subject summarizations. Besides, theses rules may not only occur in the disclosure text, but also some other parts in the review paragraphs, thus generating lots of noises (i.e. low precision); 6) I tried to use classification based association rules to train some rules from the labeled data. As keywords number is huge, long long time is needed to train the rule, crashed often. 7) I also tried to compare the similarity the review paragraph with the DisclosureCorp, but it's difficult to find a threshold to cut whether a review paragraph contains disclosure. These are all the efforts I have tried, could you please give me some hints ? Thanks.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm working on a large project for a university assignment, we're developing an application that is used by a business to compile quotes for their various services.
I need to document the algorithms in a way that the client can sign off on to make sure the way we calculate the prices is correct
So far I've tried using a large flow chart with decisions diamonds like in information systems modelling but it's proving to be overkill for even simple algorithms.
Can anybody please suggest some ways to do this? It needs to be as little like software code as possible, and enough for the client to see how we decide what prices are quoted
Maybe you should then use pseudocode.
Create two documents.
First: The business process model (BPM) that shows the sequence of steps required to be done. This should be annotated with the details for each step.
Second: Create a spreadsheet with each input data item defined so that business can see that you understand the type of field for entry of each data point and the rules for each data point. If the calculation uses a table for the step, then that is where you define the input lookup value from the table. So for each step you know where the data is coming from and then going to. Your spreadsheet can include the link to the BPM so they can walk through each data point in the BPM and see where it is coming from/going to.
You can prepare screen designs to show the users how your system is doing actually.
Well, the usual way to document algorithms is writing papers.
If your clients have studied business, I'm sure they are familiar with reading formulas.
Would a data flow diagrams help? Put psuedo code or math in the bubbles. I've had some success combining data flow models and entity relationship diagrams, but it's non standard.
What about Nassi-Shneiderman-Diagram, it's a diagram from structural programming. I think its good to show decision flows.
http://en.wikipedia.org/wiki/Nassi%E2%80%93Shneiderman_diagram
You could create an algorithm test screen to display and comment on the various steps through the calculations.