Gender Detection for Nouns in Spanish - nlp

I am implementing a search engine in Spanish. In order to ensure gender neutrality, I need to get the gender of nouns in Spanish - e.g. "pintora" (painter, female) and "pintor" (painter, male). I am currently using FAIR library - that it is really great for NER in Spanish. However, I cannot find any good implementation/library for gender detection in Spanish nouns. Could you help me?
Thank you in advance for your help

After using multiple search engines, including academic ones to perhaps try and find research papers covering topics pertaining to Spanish word gender detection and other related terms, there seems to be no one that has tackled the problem and implemented a solution in a modern library.
Regardless, you can still tackle the problem by running a Spanish Part of Speech (PoS) tagger (for example, RuPERTa-base (Spanish RoBERTa) + POS) to detect nouns/pronouns, combine those labels with your NER output where required, and then write your own rules for determining the gender of particular nouns/pronouns based on Spanish grammar rules (such as those detailed in A New Reference Grammar of Modern Spanish, specifically Chapter 1 Gender of nouns).
Hopefully that helps give you some direction if you don't end up finding a ready-made implementation.

Related

NLP - linguistic consistency analysis

I hope you can help me :).
I am working for a translation company.
As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.
In other words, the segments are considered as "translation units".
Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.
In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.
So, if you consider these two sentences of a document:
Lara, te has lavado las manos? (TU)
Lara usted se lavò las manos? (USTED)
They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.
I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.
I am looking in particular at Standford CoreNLP (I prefer Java to Python).
I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).
Anyone can help me to figure out how to start this?
Any help would be appreciated,
thanks in advance!
Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.
In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".
There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.
With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.
I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.
After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.

How to extract meaning of colloquial phrases and expressions in English

I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).

How extracting meaning of sentences for sentiment analysis using NLP

"I had safe journey" ,assume this is a feedback for a driver ,provided by a passenger. I need to extract theses information from this sentence..
"I had safe journey" ->
SUBJECT= "driving"
SENTIMENT= "positive"
I tried with NLP Extracting Information from Text method. But I don't know how recognized Entities from these kind of sentences.How am I supposed to do that ?
To categorize entities of a sentence or a sentence as a whole, you first need to have defined set of classes/categories/groups.
for eg: To categorize journey to travelling/driving, you should train your system/algorithm to identify specific pattern of sentences which will fall under the category of driving/journey.
This training involves concepts of machine learning, Text Categorization is what you should be searching for.
Here is a reference (to just give you an idea) and you can find many more over the web.
Good Luck!
Note: Below are some links from Coursera which offers a course on NLP
Link 1
Link 2

Associating free text statements with pre-defined attributes

I have a list of several dozen product attributes that people are concerned with, like
Financing
Manufacturing quality
Durability
Sales experience
and several million free-text statements from customers about the product, e.g.
"The financing was easy but the housing is flimsy."
I would like to score each free text statement in terms of how strongly it relates to each of the attributes, and whether that is a positive or negative association.
In the given example, there would be a strong positive association to Financing and a strong negative association to Manufacturing quality.
It feels like this type of problem is probably the realm of Natural Language Programming (NLP). However, I spent several hours reading up on things like OpenNLP and NLTK and find there's so much domain specific terminology that I cannot figure out where to focus to solve this specific problem.
So my three-part question:
Is NLP the correct route to solve this class of problem?
What aspect of NLP should I focus on learning for this specific problem?
Are there alternatives I have not considered?
A resource you might find handy is SentiWordNet. (http://sentiwordnet.isti.cnr.it/) Which is like a dictionary that has a sentiment grade for words. It will tell you to what degree it thinks a word is positive, negative, or objective.
You can then combine that with some nltk code that looks through your sentences for the words you want to associate the sentiment with. So you would write a script to get some level of meaningful chunks of text that surround the words you were looking at, maybe sentence or clause level. Then you can have another thing that runs through the surrounding words and grab all the sentiment scores from the SentiWordNet.
I have some old code that did this and can place on github if you'd like, but you'd still need to make your own request for SentiWordNet.
I guess your problem is more on association rather than just classification. Now moving forward with this assumption:
Is NLP the correct route to solve this class of problem?
Yes.
What aspect of NLP should I focus on learning for this specific problem?
Part of speech tagging
Sentiment analysis
Maximum entrophy
Are there alternatives I have not considered?
In depth study of automata theory with respect to NLP will help you a lot, it helped me a lot in grasping the implementations like OpenNLP.
Yes, this is a NLP problem by the name of Sentiment analysis. Sentiment analysis is an active research area with different approaches and a task where a lot of other NLP-methods have to work together, so it is certainly not the easiest field to get started with in NLP.
A more or less recent survey of the academic research in the field can be found in Pang & Lee (2008).

What are the most challenging issues in Sentiment Analysis(opinion mining)?

Opinion Mining/Sentiment Analysis is a somewhat recent subtask of Natural Language processing.Some compare it to text classification,some take a more deep stance towards it. What do you think about the most challenging issues in Sentiment Analysis(opinion mining)? Can you name a few?
The key challenges for sentiment analysis are:-
1) Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?
2) Anaphora Resolution - the problem of resolving what a pronoun, or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
3) Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?
4) Sarcasm - If you don't know the author you have no idea whether 'bad' means bad or good.
5) Twitter - abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar, ...
I agree with Hightechrider that those are areas where Sentiment Analysis accuracy can see improvement. I would also add that sentiment analysis tends to be done on closed-domain text for the most part. Attempts to do it on open domain text usually winds up having very bad accuracy/F1 measure/what have you or else it is pseudo-open-domain because it only looks at certain grammatical constructions. So I would say topic-sensitive sentiment analysis that can identify context and make decisions based on that is an exciting area for research (and industry products).
I'd also expand his 5th point from Twitter to other social media sites (e.g. Facebook, Youtube), where short, ungrammatical utterances are commonplace.
I think the answer is the language complexity, mistakes in grammar, and spelling. There is vast of ways people expresses there opinions, e.g., sarcasms could be wrongly interpreted as extremely positive sentiment.
The question may be too generic, because there are several types of sentiment analysis (document level, sentence level, comparative sentiment analysis, etc.) and each type has some specific problems.
Generally speaking, I agree with the answer by #Ian Mercer, and I would add 3 other issues:
How to detect a more in depth sentiment/emotion. Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.
How to detect the object that the opinion is positive for and the object that the opinion is negative for. For example, if you say "She won him!", this means a positive sentiment for her and a negative sentiment for him, at the same time.
How to analyze very subjective sentences or paragraphs. Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer...
Although this is a little bit an old question, let me add some note related to Arabic sentiment anlsysis in specific. Arabic language has morphological complexities and dialectal varieties which require advanced preprocessing and lexical building processes that surpass what is needed for the English language.
Please, refer to
"https://www.researchgate.net/publication/280042139_Survey_on_Arabic_Sentiment_Analysis_in_Twitter"
"https://link.springer.com/chapter/10.1007/978-3-642-35326-0_14"

Resources