I need the most exhaustive English word list I can find for several types of language processing operations, but I could not find anything on the internet that has good enough quality.
There are 1,000,000 words in the English language including foreign and/or technical words.
Can you please suggest such a source (or close to 500k words) that can be downloaded from the internet that is maybe a bit categorized? What input do you use for your language processing applications?
Kevin's wordlists is the best I know just for lists of words.
WordNet is better if you want to know about things being nouns, verbs etc, synonyms, etc.
`The "million word" hoax rolls along', I see ;-)
How to make your word lists longer: given a noun, add any of the following to it: non-, pseudo-, semi-, -arific, -geek, ...; mutatis mutandis for verbs etc.
I did research for Purdue on controlled / natural english and language domain knowledge processing.
I would take a look at the attempto project: http://attempto.ifi.uzh.ch/site/description/ which is a project to help build a controlled natural english.
You can download their entire word lexicon at: http://attempto.ifi.uzh.ch/site/downloads/files/clex-6.0-080806.zip it has ~ 100,000 natural English words.
You can also supply your own lexicon for domain specific words, this is what we did in our research. They offer webservices to parse and format natural english text.
Who told you there was 1 million words? According to Wikipedia, the Oxford English Dictionary only has 600,000. And the OED tries to include all technical and slang terms that are used.
Try directly Wikipedia's extracts : http://dbpedia.org
There aren't too many base words(171k according to this- oxford. Which is what I remember being told in my CS program in college.
But if include all forms of the words- then it rises considerably.
That said, why not make one yourself? Get a Wikipedia dump and parse it and create a set of all tokens you encounter.
Expect misspellings though- like all things crowd-sources there will be errors.
Related
I hope you can help me :).
I am working for a translation company.
As you know, every translation consists in splitting the original text into small segments and then re-joining them into the final product.
In other words, the segments are considered as "translation units".
Often, especially for large documents, the translators make some linguistic consistency errors, I try to explain it with an example.
In Spanish, you can use "tu" or "usted", depending on the context, and this determines the formality-informality tone of the sentence.
So, if you consider these two sentences of a document:
Lara, te has lavado las manos? (TU)
Lara usted se lavò las manos? (USTED)
They are BOTH correct, but if you consider the whole document, there is a linguistic inconsistency.
I am studying NLP basic in my spare time, and I am figuring out how to create a tool to perform a linguistic consistency analysis on a set of sentences.
I am looking in particular at Standford CoreNLP (I prefer Java to Python).
I guess that I need some linguistic tools to perform verb analysis first of all. And naturally, the tool would be able to work with different languages (EN, IT, ES, FR, PT).
Anyone can help me to figure out how to start this?
Any help would be appreciated,
thanks in advance!
Im not sure about Stanford CoreNLP, but if you're considering this an option, you could make your own tagger and use modifiers at pos tagging. Then, use this as a translation feature.
In other words, instead of just tagging a word to be a verb, you could tag it "a verb in the infinitive second person".
There are already good pre-tagged corpora out there for spanish that can help you do exactly that. For example, if you look at Universal Dependencies Ankora Corpus, you can find that there are annotations referring to the Person of a verb.
With a little tweaking, you could make a compose PoS that takes in "Verb-1st-Person" or something like that and train a Tagger.
I've made an article about how to do it in Python, but I bet that you can do it in Java using Weka. You can read the article here.
After this, I guess that the next step is that you ensure to match the person of one "translation unit" to the other, or make something in a pipeline fashion.
I am looking into extracting the meaning of expressions used in everyday speaking. For an instance, it is apparent to a human that the sentence The meal we had at restaurant A tasted like food at my granny's. means that the food was tasty.
How can I extract this meaning using a tool or a technique?
The method I've found so far is to first extract phrases using Stanford CoreNLP POS tagging, and use a Word Sense Induction tool to derive the meaning of the phrase. However, as WSI tools are used to get the meaning of words when they have multiple meanings, I am not sure if it would be the best tool to use.
What would be the best method to extract the meanings? Or is there any tool that can both identify phrases and extract their meanings?
Any help is much appreciated. Thanks in advance.
The problem you pose is a difficult one. You should use tools from Sentiment Analysis to get a gist of the sentence emotional message. There are more sophisticated approaches which attempt at extracting what quality is assigned to what object in the sentence (this you can get from POS-tagged sentences + some hand-crafted Information Extraction rules).
However, you may want to also explore paraphrasing the more formal language to the common one and look for those phrases. For that you would need to a good (exhaustive) dictionary of common expressions to start with (there are sometimes slang dictionaries available - but I am not aware of any for English right now). You could then map the colloquial ones to some more formal ones which are likely to be caught by some embedding space (frequently used in Sentiment Analysis).
I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?
Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
https://github.com/hrzafer/nuve
Since I'm the author, I'd be glad to help you with the process.
Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
Use the list to strip suffixes from words.
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.
You should look at Linguistica which has been developed by John Goldsmith and his team (#UChicago) for this purpose.
Are you talking about English? Then please see
English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.
Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?
Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.
[Caveat] This is not directly a programing question, but it is something that comes up so often in language processing that I'm sure it's of some use to the community.
Does anyone have a good list of uninteresting (English) words that have been tested by more then a casual look? This would include all prepositions, conjunctions, etc... words that may have semantic meaning, but are often frequent in every sentence, regardless of the subject. I've built my own lists from time to time for personal projects but they've been ad-hoc; I continuously add words that I forgotten as they come in.
These words are usually called stop words. The Wikipedia article contains much more information about them, including where to find some lists.
I think you mean stop words.
There's a few links to lists of stop words on Wikipedia, including this one.