Can anyone tell me what feature generators are with respect to natural language processors?
If I'm reading this correctly, I believe "feature generation" in this quote is referring to the process of extracting features from your text. Without going into too much detail this is basically getting the dimensions of your data you think would be useful for your prediction/classification task and putting it into a vector representation.
For example, suppose we were trying to create a classifier to determine if an e-mail was spam. We might extract features such as CONTAINS_WORD_NIGERIA or IS_FROM_PERSON_IN_CONTACT_LIST. Or if we were to follow the quote above we might make specialized features using the html tags such as PERCENT_OF_WORDS_IN_HREF_TAG. As you might imagine, you can go overboard when feature engineering, and the real challenge lies is in optimizing your feature set to give you good results on unseen data.
Related
Is there a package or methodology in existence for the detection of flawed logical arguments in text?
I was hoping for something that would work for text that is not written in an academic setting (such as a logic class). It might be a stretch but I would like something that can identify where logic is trying to be used and identify the logical error. A possible use for this would be marking errors in editorial articles.
I don't need anything that is polished. I wouldn't mind working to develop something either so I'm really looking for what's out there in the wild now.
That's a difficult problem, because you'll have to map natural language to some logical representation, and deal with ambiguity in the process.
Attempto Project may be interesting for you. It has several tools that you can try online. In particular, RACE may be doing something you wanted to do. It checks for consistency on the given assertions. But the bigger issue here is in transforming them to logical forms.
For an onology of logical axioms, OpenCyc and the commercial full Cyc ontologies might be worth investigating as well. CycML is used as a language to model the logical assertions, and the Cyc engine is capable of logical inference. The source for OpenCyc can be found in the OpenCyc SourceForge project. The Cyc Wikipedia page also has great information.
Yes, this is a very nasty problem. I would suggest you try to focus in on a narrow domain. For example, if you are looking for logic errors in cancer determination, you have to focus on which type of cancer as well as what are you trying to resolve eg: correct treatment plans, correct observations, correct procedures, correct stage determination, etc. Then you have to find the taxonomy or ontology for that specific cancer, eg: Medline. So for example, you will likely have to focus in on ONLY lung cancer and then only a subset of lung cancer types and only observations indicating lung cancer. Then you will have identify your corpus, knowledge trees, entity relationships and then worry about negation detection, hypotheticals and subject detection. If Healthcare doesn float your boat, I hear another challenging domain for logic errors is the legal/law industry.
i have planned to develop a tool that converts a program written in a programming language (eg: Java) to a common markup language (eg: XML) and that markup code is converted to another language (eg: C#).
in simple words, it is a programming language converter that converts program written in one language to another language.
i think it is possible but i don know where to start. i wanna know the possibilities to do so and information about some existing system.
What you are trying to do is extremely hard, but if you want to know what you are up for I've listed the steps you need to follow below:
First the hard bit:
First you obtain or derive an operational semantics for your source and target languages.
Then you enhance the semantics to capture your source and target memory models.
Then you need to unify the two enhanced-semantics within a common operational model.
Then you need to define a mapping from your source languages onto the common operational model.
Then you need to define a mapping from your operational model to your target language
Step 4, as you pointed out in your question, is trivial.
Step 1 is difficult, as most languages do not have sufficiently formal semantics specified; but I recommend checking out http://lucacardelli.name/TheoryOfObjects.html as this is the best starting point for building a traditional OO semantics.
Step 2 is almost certainly impossible in general, but may be merely obscenely difficult if you are willing to sacrifice some efficiency.
Step 3 will depend on how clean the result of step 1 turned out, but is going to be anything from delicate and tricky to impossible.
Step 5 is not going to be trivial, it is effectively writing a compiler.
Ultimately, what you propose to do is impossible in general, due to the difficulties inherited in steps 1 and 2. However it should be difficult, but doable, if you are willing to: severely restrict the source language constructs supported; pretty much forget handling threads correctly; and pick two languages with sufficiently similar semantics (ie. Java and C# are ok, but C++ and anything-else is not).
It depends on what languages you want to support, but in general this is a huge & difficult task unless you plan to only support a very small subset of each language.
The real problem is that each programming languages has different features (with some areas that overlap and others that don't) and different ways of solving the same problems -- and it's pretty tricky to detect the problem the programmer is trying to solve and convert that to a new idiom. :) And think about the differences between GUIs created in different languages....
See http://xmlvm.org/ as an example (a project aimed at converting between source code of many different languages, with an XML middle-point) -- the site covers in some depth the challenges they are tackling and the compromises they take, and (if you still have any interest in this kind of project...) ask more specific followup questions.
Notice specifically what the output source code looks like -- it's not at all readable, maintainable, efficient, etc..
It is "technically easy" to produce XML for any single langauge: build a parser, construct and abstract syntax tree, and dump out that tree as XML. (I build tools that do this off-the-shelf for many languages). By technically easy, I mean that the community knows how to do this (see any compiler textbook, e.g., Aho&Ullman Dragon book). I do not mean this is a trivial exercise in terms of effort, because real languages are complicated and messy; there have been many attempts to build C++ parsers and few successes. (I have one of the successes, and it was expensive to get right).
What is really hard (and I don't try to do) is produce XML according to a single schema in which the language semantics are exposed. And without that, it will be essentially impossible to write a translator from a generic XML to an arbitrary target language. This is known as the UNCOL problem and people have been looking since 1958 for the answer. I note that the Wikipedia article seems to indicate the problem is solved, but you can't find many references to UNCOL in the literature since 1961.
The closest attempt I've seen to this is the OMG's "ASTM" model (http://www.omg.org/spec/ASTM/1.0/Beta1/); it exports XMI which is XML. But the ASTM model has lots of escapes built into it to allow langauges that it doesn't model perfectly (AFAIK, that means every language) to extend the XMI in arbitrary ways so that the language-specific information can be encoded. Consequently each language parser produces a custom version of the XMI, and thus each reader has to pretty much know about the extensions and full generality vanishes.
I am looking for some resources pertaining to the parsing and understanding of English (or just human language in general). While this is obviously a fairly complicated and wide field of study, I was wondering if anyone had any book or internet recommendations for study of the subject. I am aware of the basics, such as searching for copulas to draw word relationships, but anything you guys recommend I will be sure to thoroughly read.
Thanks.
Check out WordNet.
You probably want a book like "Representation and Inference for Natural Language - A First Course in Computational Semantics"
http://homepages.inf.ed.ac.uk/jbos/comsem/book1.html
Another way is looking at existing tools that already do the job on the basis of research papers: http://nlp.stanford.edu/index.shtml
I've used this tool once, and it's very nice. There's even an online version that lets you parse English and draws dependency trees and so on.
So you can start taking a look at their papers or the code itself.
Anyway take in consideration that in any field, what you get from such generic tools is almost always not what you want. In the sense that the semantics attributed by such tools is not what you would expect. For most cases, given a specific constrained domain it's preferable to roll your own parser, and do your best to avoid any ambiguities beforehand.
The process that you describe is called natural language understanding. There are various algorithms and software tools that have been developed for this purpose.
Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms
Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.
any advice
Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!
I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.
There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.
The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".
Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.
As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.
Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.
Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.
I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure:
- hopefully with wordlists
- hopefully trainable on my set of data
Thanks!
What you are looking for is commonly dubbed Sentiment Analysis. Typically, sentiment analysis is not able to handle delicate subtleties, like sarcasm or irony, but it fares pretty well if you throw a large set of data at it.
Sentiment analysis usually needs quite a bit of pre-processing. At least tokenization, sentence boundary detection and part-of-speech tagging. Sometimes, syntactic parsing can be important. Doing it properly is an entire branch of research in computational linguistics, and I wouldn't advise you with coming up with your own solution unless you take your time to study the field first.
OpenNLP has some tools to aid sentiment analysis, but if you want something more serious, you should look into the LingPipe toolkit. It has some built-in SA-functionality and a nice tutorial. And you can train it on your own set of data, but don't think that it is entirely trivial :-).
Googling for the term will probably also give you some resources to work with. If you have any more specific question, just ask, I'm watching the nlp-tag closely ;-)
Some approaches to sentiment analysis use strategies popular on other text classification tasks. The most common being transforming your film review into a word vector, and feeding it into a classifier algorithm as training data. Most popular data mining packages can help you here. You could have a look at this tutorial on sentiment classification illustrating how to do an experiment using the open source RapidMiner toolkit.
Incidentally, there is a good data set made available for research purposes related to detecting opinion on film reviews. It is based on IMDB user reviews, and you can check many related research work on the area and how they use the data set.
Its worth bearing in mind that the effectiveness of these methods can only be judged from a statistical viewpoint, so you can pretty much assume there will be misclassifications and cases where opinion is hard to detect. As already noticed in this thread, detecting things like irony and sarcasm can be very difficult indeed.