How could I identify a sentence disclosing some specific information in a paragraph? [closed] - nlp

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
For example, I have such a paragraph as below:
The first sentence (bold and italic) is what I hope to identify out.
The identification goal includes:
1. whether this paragraph contain such disclosure.
2. what this disclosure is.
The possible problems are :
1. this sentence may not be in the begin of the text string. it could be in any place of the given paragraph.
2. this sentence may vary with words but with same meaning. For example, it could also be expressed as: "Sample provided for review" or "They sent to me an item for evaluation" or something like this.
So how could I identify such disclosures ? Anyone's idea would be greatly appreciated. Thanks.
The paragraph:
I was sent this Earbuds Audiophile headphones to review. I am just going to copy here the information from the site: "High Definition Stereo Earphones with microphone Equipped with two 9mm high fidelity drivers, unique sound performance, well-balanced bass, mids and trebble. Designed specially for those who enjoy classic music, rock music, pop music, or gaming with superb quality sound. Let COR3 be your in ear sports earbuds. Replaceable Back Caps, inline controller and mic
Extreme flexible tangle free flat TPE cable including inline controller with universal microphone. Play/Pause your music or Answer/Hang up a call with a touch of a button right next to your hands, feature available depending on your device capability. COR3 should be your best gaming earbuds.
Extremely Comfortable
Methods I have tried:
Up to now, my processing is very naive: 1) humanly labeled 1000 pieces of reviews as a binary variable (1 represents including the disclosure text, 0 otherwise). 2) Collect all the disclosure texts as a corpus denoted by DisclosureCor; 3) Based on these DisclosureCor, I discovered some basic regular regression rules, like " review.* evaluation|test|opinion". 4) Using these summarized rules to label new data. 5) The problem is that rules may not be complete, since they are just my own subject summarizations. Besides, theses rules may not only occur in the disclosure text, but also some other parts in the review paragraphs, thus generating lots of noises (i.e. low precision); 6) I tried to use classification based association rules to train some rules from the labeled data. As keywords number is huge, long long time is needed to train the rule, crashed often. 7) I also tried to compare the similarity the review paragraph with the DisclosureCorp, but it's difficult to find a threshold to cut whether a review paragraph contains disclosure. These are all the efforts I have tried, could you please give me some hints ? Thanks.

Related

How to getting started with magnetic stripe cards? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am newbie in the field of Magnetic-Stripe cards. But if I don't have any idea about the structure of these kind of cards, I can't develop software for them also.
Searching a lot, gave me this information only :
This cards have 3 different Tracks named Track-1, Track-2 and Track-3 in their black bar. And the density of data on each Track is different from the others.
The questions that I have :
Is there any difference between Mag-Stripe card reader and writer? Or like smart cards reader, the reader do the writing also?
Does all the readers[/writers] can read from[/write on] all the three tracks by default and we choose that which track is our target on the program? or some readers [/writers] are for Track-1, some other for Track-2 and some other for Track-3? In the other word, does the device need three different head(Is it a head?) for working with different Tracks or a single head are for all the three Tracks?
Are these three Tracks both readable and writable or some are only readable for example?
Does we need fresh cards to writing data on them or we can clear an already used card and rewrite new data on its Tracks?
There is a device named Encoder in the list of devices for Mag-Stripe card. What is this Encoder for? What's the difference between Encoder and Reader or Writer?
Why the density of data and the type of data (Alphabetic or Numeric) is different for different Tracks?
Any tool, document, specification, standard, library or tutorial for getting started?
First, you'll want to read up on ISO-7811 and ISO-7812 mag card standards.
Then, you'll need to learn how to wire up a minimal working example (MWE) system. Fortunately, card readers are easy to come by, and you can just wire them up directly to something like an Arduino.
For at least one example, the format for bank cards is:
% "ASCII string on track 1" ?; "ACSII string on track 2" ?; "ASCII
string on track 3" ?
It's just a serial stream that is provided, so "packets" will be different for different types of cards. Since this is just a reader, treat all data as read-only.
You can also find some existing code examples for pulling the data off of the card.
You can also find "blank" cards on SparkFun as well, but you'll need to put in some more money for a writer setup. Also, all sort of mag swipe cards have security features these days, including universities, credit cards, etc, so that portions of the mag stripes are hard to read or are read-only, etc.
If you're planning on doing something shady, these tools won't work, and rightfully so.
If you're planning on making your own security system for a lab or school, these cards are easily cloned and cracked by a clever person.
If you're just trying to have some fun learning a new topic, the above advice will be helpful.
Cheers!

Extracting relationship from NER parse

I'm working on a problem that at the very least seems to require named entity recognition, but I'm not sure how to go farther than the NER parse. What I'm trying to do is parse information (likely from tweets) regarding scheduling of events. So, for example, I'd like to be able to automatically resolve the yes/no answer to the question of "Are The Beatles playing tomorrow?" from short messages like:
"The Beatles cancelled their show tomorrow" or
"The Beatles' show is still on tomorrow"
I know NER will get me close as it will identify the band of interest and the time (if it's indicated), but there are many ways to express the concepts I'm interested in, for example:
"The Beatles are on for tomorrow" or
"The Beatles won't be playing tomorrow."
How can I go from an NER parsed representation to extracting the information of interest? Any suggestions would be much appreciated.
I guess you should search by event detection (optionally - in Twitter); maybe, also by question answering systems, if your example with yes/no questions wasn't just an illustration: if you know user needs in advance, this information may increase the quality of the system.
For start, there are some papers about event detection in Twitter: here and here.
As a baseline, you can create a list with positive verbs for your domain (to be, to schedule) and negative verbs (to cancel, to delay) - just start from manual list and expand it by synonyms from some dictionary, e.g. WordNet. Also check for negations - again, by presence of pre-specified words ('not' in different forms) in a tweet. Then, if there is a negation, you just invert the meaning.
Since you work with Twitter and most likely there would be just one event mentioned in a tweet, it can work pretty well.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

Text mining - extract name of band from unstructured text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm aware that this is kind of a general, open-ended question. I'm essentially looking for help in deciding a way forward, and perhaps for some reading material.
I'm working on an algorithm that does unstructured text mining, and trying to extract something specific - the names of bands (single artists, bands, etc) from that text. The text itself has no predictable structure, but it is relatively small (1, 2 rows of text).
Some examples may be (not real events):
Concert Green Day At Wembley Stadium
Extraordinary representation - Norah Jones in Poland - at the Polish Opera
Now, I'm thinking of trying out a classifier but the text seems to small to provide any real training information for it.
There probably are several other text mining techniques, heuristics or algorithms that may yield good results for this kind of problem (or perhaps no algorithm will).
Because of the structure of your data a pre-trained model will probably perform poorly. Besides, the general organization, location, and person categories will probably not be useful for you.
I don't think the text themselves are too small, most NER-systems work on one sentence at a time. So providing your own training set with a NER-library will probably work well, such as http://nlp.stanford.edu/ner/index.shtml
If you don't want to create a training set you will need a dictionary with all the bands/artists. Then you obviously can't find unknown bands/artists.
There is simple NER algorithm that could simplify the task a bit:
take the words which may be (or not be) named entity and search for them in Google or Yahoo (via API) twice: as separate words and as exact phrase (i.e. with quotation marks). Divide numbers of results. There is threshold (<30) which determines if words form a named entity.

Best Algorithmic Approach to Sentiment Analysis [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.

Resources