what are the old or current approaches for word alignment for parallel text?
what is the best approach of word alignment for under resourced language or language without a written form?
so far i found about the Levenshtein distance algorithm, and a little bit about the IBM models and some others
if there is any suggestions or links that i can refer to that would be helpful
Without more specifics, it's hard to know what to suggest, but you may want to consider Facebook's Laser or Google's BERT, both have very good sentence embedding models that can be used for alignment tasks.
Related
I am working in NLP Project and I am looking for parser to construct simple Sentences from complex one, written in C# . Since Sentences may have complex grammatical structure with multiple embedded clauses.
Any Help ?
Text summarisation and sentence simplification are very much an open research area. Wikipedia has articles about both, you can start from there. Beware: this is hard problem and chances are the state of the art system is far worse than you might expect. There isn't an off-the-shelf piece of software that you can just grab and solve all your problems. You will have some success with the more basic sentences, but performance will degrade as you complex sentence gets more complex. Have a look at the articles referenced on Wikipedia or google around to get an idea of what is possible. My impression is most readily available software packages are for academic purposes and might take a bit of work to get running.
Are there any tools that analyze the meaning of given sentences? Recommendations are greatly appreciated.
Thanks in advance!
I am also looking for similar tools. One thing I found recently was this sentiment analysis tool built by researchers at Stanford.
It provides a model of analyzing the sentiment of a given sentence. It's interesting and even this seemingly simple idea is quite involved to model in an accurate way. It utilizes machine learning to develop higher accuracy as well. There is a live demo where you can input sentences to analyze.
http://nlp.stanford.edu/sentiment/
I also saw this RelEx semantic dependency relationship extractor.
http://wiki.opencog.org/w/Sentence_algorithms
Some natural language understanding tools can analyze the meaning of sentences, including NLTK and Attempto Controlled English. There are several implementations of discourse representation structures and semantic parsers with a similar purpose.
There are also several parsers that can be used to generate a meaning representation from the text that is being parsed.
Situation:
I wish to perform a Deep-level Analysis of a given text, which would mean:
Ability to extract keywords and assign importance levels based on contextual usage.
Ability to draw conclusions on the mood expressed.
Ability to hint on the education level (word does this a little bit though, but something more automated)
Ability to mix-and match phrases and find out certain communication patterns
Ability to draw substantial meaning out of it, so that it can be quantified and can be processed for answering by a machine.
Question:
What kind of algorithms and techniques need to be employed for this?
Is there a software that can help me in doing this?
When you figure out how to do this please contact DARPA, the CIA, the FBI, and all other U.S. intelligence agencies. Contracts for projects like these are items of current research worth many millions in research grants. ;)
That being said you'll need to process it in layers and analyze at each of those layers. For items 2 and 3 you'll find training an SVM on n-tuples (try, 3) words will help. For 1 and 4 you'll want deeper analysis. Use a tool like NLTK, or one of the many other parsers and find the subject words in sentences and related words. Also use WordNet (from Princeton)
to find the most common senses used and take those as key words.
5 is extremely challenging, I think intelligent use of the data above can give you what you want, but you'll need to use all your grammatical knowledge and programming knowledge, and it will still be very rough grained.
It sounds like you might be open to some experimentation, in which case a toolkit approach might be best? If so, look at the NLTK Natural Language Toolkit for Python. Open source under the Apache license, and there are a couple of excellent books about it (including one from O'Reilly which is also released online under a creative commons license).
We are building a database of scientific papers and performing analysis on the abstracts. The goal is to be able to say "Interest in this topic has gone up 20% from last year". I've already tried key word analysis and haven't really liked the results. So now I am trying to move onto phrases and proximity of words to each other and realize I'm am in over my head. Can anyone point me to a better solution to this, or at very least give me a good term to google to learn more?
The language used is python but I don't think that really affects your answer. Thanks in advance for the help.
It is a big subject, but a good introduction to NLP like this can be found with the NLTK toolkit. This is intended for teaching and works with Python - ie. good for dabbling and experimenting. Also there's a very good open source book (also in paper form from O'Reilly) on the NLTK website.
This is just a guess; not sure if this approach will work. If you're looking at phrases and proximity of words, perhaps you can build up a Markov Chain? That way you can get an idea of the frequency of certain phrases/words in relation to others (based on the order of your Markov Chain).
So you build a Markov Chain and frequency distribution for the year 2009. Then you build another one at the end of 2010 and compare the frequencies (of certain phrases and words). You might have to normalize the text though.
Other than that, something that comes to mind is Natural-Language-Processing techniques (there is a lot of literature surrounding the topic!).
I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure:
- hopefully with wordlists
- hopefully trainable on my set of data
Thanks!
What you are looking for is commonly dubbed Sentiment Analysis. Typically, sentiment analysis is not able to handle delicate subtleties, like sarcasm or irony, but it fares pretty well if you throw a large set of data at it.
Sentiment analysis usually needs quite a bit of pre-processing. At least tokenization, sentence boundary detection and part-of-speech tagging. Sometimes, syntactic parsing can be important. Doing it properly is an entire branch of research in computational linguistics, and I wouldn't advise you with coming up with your own solution unless you take your time to study the field first.
OpenNLP has some tools to aid sentiment analysis, but if you want something more serious, you should look into the LingPipe toolkit. It has some built-in SA-functionality and a nice tutorial. And you can train it on your own set of data, but don't think that it is entirely trivial :-).
Googling for the term will probably also give you some resources to work with. If you have any more specific question, just ask, I'm watching the nlp-tag closely ;-)
Some approaches to sentiment analysis use strategies popular on other text classification tasks. The most common being transforming your film review into a word vector, and feeding it into a classifier algorithm as training data. Most popular data mining packages can help you here. You could have a look at this tutorial on sentiment classification illustrating how to do an experiment using the open source RapidMiner toolkit.
Incidentally, there is a good data set made available for research purposes related to detecting opinion on film reviews. It is based on IMDB user reviews, and you can check many related research work on the area and how they use the data set.
Its worth bearing in mind that the effectiveness of these methods can only be judged from a statistical viewpoint, so you can pretty much assume there will be misclassifications and cases where opinion is hard to detect. As already noticed in this thread, detecting things like irony and sarcasm can be very difficult indeed.