Phonetic support for indian language in English - search

I am working on developing phonetic support for search, where the users typically type indian language text in English to search for a relevant result.
I am not able to find the phonetic rules for generating this phonetics. I can use the Norvig algorithm to generate all candidates, but then without the phonetic rules for the language I am not able to filter the relevant data.
For example I want to check if for token "pyar", the tokens "pyaar", "piyar" are valid phonetics or not.
Do let me know if anyone has any resources that can be helpful here.
Thank you

Related

Gender Detection for Nouns in Spanish

I am implementing a search engine in Spanish. In order to ensure gender neutrality, I need to get the gender of nouns in Spanish - e.g. "pintora" (painter, female) and "pintor" (painter, male). I am currently using FAIR library - that it is really great for NER in Spanish. However, I cannot find any good implementation/library for gender detection in Spanish nouns. Could you help me?
Thank you in advance for your help
After using multiple search engines, including academic ones to perhaps try and find research papers covering topics pertaining to Spanish word gender detection and other related terms, there seems to be no one that has tackled the problem and implemented a solution in a modern library.
Regardless, you can still tackle the problem by running a Spanish Part of Speech (PoS) tagger (for example, RuPERTa-base (Spanish RoBERTa) + POS) to detect nouns/pronouns, combine those labels with your NER output where required, and then write your own rules for determining the gender of particular nouns/pronouns based on Spanish grammar rules (such as those detailed in A New Reference Grammar of Modern Spanish, specifically Chapter 1 Gender of nouns).
Hopefully that helps give you some direction if you don't end up finding a ready-made implementation.

NLP algorithm to extract part of sentence in language translation

I am trying to solve a problem but am not able to find a way other than training the data sets and making a classifier.
Problem:
The user says to translate a particular sentence from one language to another. I have the user speech in text part, and need to extract these 3 things from the text:
Sentence to be translated.
The language in which its supposed to be translated.
The origin language.
So, when we humans say, its usually in the form of these examples:
What is I love you in French from English?
Can you translate I love you from English to French?
What is French for I love you in English?
And any other possible way that a person can ask for translation.
I need to extract I love you, French (the language translated into) and English (the language translated from) from the sentence.
The first thing that came to my mind was to use Regular Expessions. But I found that it can only be used to detect the language and not the sentence part to be translated.
The other possible solution seems to have the various form of sentence as training data set and train a classifier, but I still feel that this NLP problem can be solved using some algorithm but am not able to get anything.
This seems to be a popular problem, so is there any way it can be done?

how to automatically detect acronym meaning / extension

How can you detect / find out the meaning (the extension) of an acronym using NLP / Information Extraction (IE) methods?
We want to detect in free text if a word or it's acronym is used and map it to the same entity / token.
Most papers available online are about medical acronyms and they do not provide a library for acomplish this task.
Any ideas?
Reading your question and the comments I understand that you want to create a mapping from an acronym to its extension.
Assuming you have a collection of textual documents where both the acronym and its expansion occur you can apply an algorithm to extract (acronym,extension) pairs.
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text by A.S Schwartz and M.A. Hearst, does exactly this by looking at patterns. The Java implementation is available here.
I applied this algorithm to the English Wikipedia, you can see the results here. I also applied it to a collection of Portuguese new articles, results are here.
Wordnet contains acronym for tons of words which you can use in variety of programming languages: http://wordnet.princeton.edu/wordnet/
Or get from Freebase. See this: What is one way to find related names using the web?

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?
Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.

Natural English language words

I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?
Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.
`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.
I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.
Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.
Try directly Wikipedia's extracts : http://dbpedia.org
There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.

Resources