I am trying to automatically process English sentences and detect the words which might be referring to humans. e.g. he, everybody, someone, niece, I, son, ...
I am already using NER, and have implemented some simple heuristic rules as well.
But I think, other than tricky cases which is fine if I mis-label them, the problem can be solved with a simple dictionary look-up. Is there any list of English words that I can use?
found some lists on the web:
I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.
The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.
I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.
Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.
You can try to do it like this:
Split document into sentences.
Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
Clean "forbidden sentences" as well
Prepare vector representation of each sentence - you can use spaCy methods
Calculate similarity between sentences - cosine similarity
You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden
I've recently starting looking at data extraction using NLTK. While there are several examples and techniques for detecting "real" names, locations, etc.. I haven't found an efficient way to detect "made up" or "imaginary" names. An example string would be:
His name is wuzzywugg and he has a dog named fizzbuzz
I would like to train NLTK to be able to detect that "wuzzywugg" and "fizzbuzz" are names of characters. Seen some solutions that rely on the word starting with a CAPITAL letter, but this feels very "hacky"
and prone to errors and false positives.
I ran on the same problem when processing Russian folktales, turns out that most of their names don't appear in western Gazeteers. A quick approach may be to use part-of-speech tags and get only NNP (proper nouns). Check this: http://www.nltk.org/book/ch05.html
This didn't work entirely for me, my approach involved actually extracting all noun phrases (NP nodes from the parse tree) and then extracting feature vectors that I annotated myself to build a ML classifier. You can find more information here: http://ieeexplore.ieee.org/document/7489041/
I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?
Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
Since I'm the author, I'd be glad to help you with the process.
Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
Use the list to strip suffixes from words.
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.
You should look at Linguistica which has been developed by John Goldsmith and his team (#UChicago) for this purpose.
Are you talking about English? Then please see
English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.
How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.
As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.
I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.
Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?
(My initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).
It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.
Would this be useful?
Have you tried /usr/share/dict/words?
Create text list manually
Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.
Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.
Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.
Use Wordnet
Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.