I am trying to solve a problem but am not able to find a way other than training the data sets and making a classifier.
Problem:
The user says to translate a particular sentence from one language to another. I have the user speech in text part, and need to extract these 3 things from the text:
Sentence to be translated.
The language in which its supposed to be translated.
The origin language.
So, when we humans say, its usually in the form of these examples:
What is I love you in French from English?
Can you translate I love you from English to French?
What is French for I love you in English?
And any other possible way that a person can ask for translation.
I need to extract I love you, French (the language translated into) and English (the language translated from) from the sentence.
The first thing that came to my mind was to use Regular Expessions. But I found that it can only be used to detect the language and not the sentence part to be translated.
The other possible solution seems to have the various form of sentence as training data set and train a classifier, but I still feel that this NLP problem can be solved using some algorithm but am not able to get anything.
This seems to be a popular problem, so is there any way it can be done?
Related
I am looking for some way to determine if textual input takes the form of a valid sentence; I would like to provide a warning to the user if not. Examples of input I would like to warn the user about:
"dog hat can ah!"
"slkj ds dsak"
It seems like this is a difficult problem, since grammars are usually derived from textbanks, and the words in the provided sentence input might not appear in the grammar. It also seems like parsers maybe make assumptions that the textual input is comprised of valid English words to begin with. (just my brief takeaway from playing around with Stanford NLP's GUI tool). My questions are as follows:
Is there some tool available to scan through text input and determine if it is made up of valid English words, or at least offer a probability on that? If not, I can write this, just wondering if it already exists. I figure this would be step 1 before determining grammatical correctness.
My understanding is that determining whether a sentence is grammatically correct is done simply by attempting to parse the sentence and see if it is possible. Is that accurate? Are there probabilistic parsers that offer a degree of confidence when ambiguity is encountered? (e.g., a proper noun not recognized)
I hesitate to ask this last question, since I saw it was asked on SO over a decade ago, but any updates as to whether there is a basic, readily available grammar for NLTK? I know English isn't simple, but I am truly just looking to parse relatively simple, single sentence input.
Thanks!
A starting point are classification models trained on the Corpus of Linguistic Acceptability (CoLA) task. There are several recent blog articles on how to fine tune the BERT models from HuggingFace (python) for this task. Here is one such blog article. You can also find already fine-tuned models for CoLA for various BERT flavors in the HuggingFace model zoo.
I hope you can help me :).
I am working for a translation company.
As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.
In other words, the segments are considered as "translation units".
Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.
In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.
So, if you consider these two sentences of a document:
Lara, te has lavado las manos? (TU)
Lara usted se lavò las manos? (USTED)
They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.
I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.
I am looking in particular at Standford CoreNLP (I prefer Java to Python).
I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).
Anyone can help me to figure out how to start this?
Any help would be appreciated,
thanks in advance!
Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.
In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".
There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.
With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.
I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.
After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.
I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).
I just have a set of sentences, which I have generated based on painting analysis. However I need to test how natural they sound. Is there any api or application which does this?
I am using the Standford Parser to give me a breakdown, but this doesn't exactly do the job I want!
Also can one test how similar sentences are? As I randomly generating parts of sentences and want to check the variety of the sentences produced.
A lot of NLP stuff works using things called 'Language Models'.
A language model is something that can take in some text and return a probability. This probability should typically be indicative of how "likely" the given text is.
You typically build a language model by taking a large chunk of text (which we call the "training corpus") and computing some statistics out of it (which represent your "model"), and then using those statistics to take in new, previously unseen sentences and returning probabilities for them.
You should probably google for "language models", "unigram models", "n-gram models" and click on some of the results to find some article or presentation which helps you understand the previous sentence. (Its hard for me to recommend an appropriate tutorial for you because I don't know what your existing background is)
Anyway, one way to think about language models is that they are systems that take in new text and tell you how similar the new text is to the training corpus the language model was made out of. So if you build 2 language models, one out of all the plays written by Shakespeare and another out of a large number of legal documents, then the second one should be giving you a much higher probability to sentences for some new legal document that just got released (as compared to the first model) while the first model should give you a much higher probability for some other old english play (written by some other author) because that play is probably more similar to Shakespeare (in terms of the kind of words used, sentence lengths, grammar, etc) than it is to modern legal language.
All the things you see the stanford parser give you back for a sentence you give it are generated using language models. One way to think about how those features are built is to pretend that the computer tried every possible combination of tags and every possible parse tree for the sentence you gave it, and used some clever language model to identify which is most probable sequence of tags and most probable parse tree out there, and returned those back to you.
Getting back to your problem, you need to build a language model out of what you consider natural sounding text and then use that language model to evaluate the sentences you want to measure the naturalness of. To do this, you will have to identify a good training corpus and decide on what type of language model you want to build.
If you can't think of anything better, a collection of wikipedia articles might serve to be a good training corpus representing what natural sounding english looks like.
As for model type, an "n-gram model" would probably be good enough for your task. More complicated models like "Hidden Markov Models" and "PCFG's" (the stuff that is powering the stanford page you linked to) would definitely make things even better, but n-grams are definitely the most simple thing you could start with.
I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?
Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.
`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.
I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.
Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.
Try directly Wikipedia's extracts : http://dbpedia.org
There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.