I'm using the KeyATM package in R to characterize the relative prevalence of several topics in each document of a corpus. KeyATM treats "freedom" and "free" as different keywords and does not seem to allow the use of "free*" as way to focus on the word root when specifying the list of keywords.
Is there a way to tell KeyATM to treat all versions of "free*" as the same keyword?
(I pre-process the texts with Quanteda and the stemming functions (which use Snowball) do not effectively reduce "freedom" and other variants to "free.")
Related
I need to identify difficult words from input text. I do not want to use common word lists because the level of difficulty will need to be set for children. Is there a scoring mechanism that computes the difficulty level of each word? I can use a threshold for score to separate difficult words from easy ones. The end objective is to provide word meanings for all such difficult words.
There are several ways in which overall text can be scored for complexity or difficulty level, eg. The Dale–Chall formula, The Gunning fog formula, etc. However, these are used for defining "readability" i.e. the ease with which a reader can understand the written text. My requirement is related to the difficulty level of individual words in a text.
There are a few methods that I have come across for defining difficult words such words with more than 2 syllables or any word not appearing in the 10000 most common words, etc. However, none of these methods are useful for me. I am trying to build an application that can identify difficult words, and provides relevant dictionary meanings for only those words. Is there a scoring mechanism that will allow me to use a threshold to separate the difficult words from the easy ones?
Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.
I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?
Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
https://github.com/hrzafer/nuve
Since I'm the author, I'd be glad to help you with the process.
Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
Use the list to strip suffixes from words.
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.
You should look at Linguistica which has been developed by John Goldsmith and his team (#UChicago) for this purpose.
Are you talking about English? Then please see
English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.
I'm somewhat familiar with stemming, but the stemming library I've been given to use for a project doesn't work very well for a case where I want to find related words like if I do a query for any of these:
"dental", "dentist", "dentistry"
I should get a match for the others. I've been looking into this and I'm learning about parts of speech I didn't even know existed, like pertainyms and troponyms so I'm wondering if there isn't a library out there that has a mapping between all of these different parts of speech that could give back the sort of match I'm looking for?
I've been searching on this and haven't found a whole lot that I can make sense of. I probably don't know the right terminology, etc and I would greatly appreciate if anyone can point me in the right direction.
One approach common in IR is to stem all words in the index and the query itself. Meaning, documents containing the word 'dentistry' will be stemmed and stored in the index as 'dentist'. The keyword 'dental' is also stemmed as 'dentist' thereby matching it in the index.
Have a look at WordNet. WordNet is an organized ontology of words and concepts with links for various types of relations between words. I'm not sure if it will have exactly the relationships you want, but it's probably a good start. There are many interfaces in various programming languages (Java and Python that I've used; presumably many more).
How would you go about parsing a string of free form text to detect things like locations and names based on a dictionary of location and names? In my particular application there will be tens of thousands if not more entries in my dictionaries so I'm pretty sure just running through them all is out of the question. Also, is there any way to add "fuzzy" matching so that you can also detect substrings that are within x edits of a dictionary word? If I'm not mistaken this falls within the field of natural language processing and more specifically named entity recognition (NER); however, my attempt to find information about the algorithms and processes behind NER have come up empty. I'd prefer to use Python for this as I'm most familiar with that although I'm open to looking at other solutions.
You might try downloading the Stanford Named Entity Recognizer:
http://nlp.stanford.edu/software/CRF-NER.shtml
If you don't want to use someone else's code and you want to do it yourself, I'd suggest taking a look at the algorithm in their associated paper, because the Conditional Random Field model that they use for this has become a fairly common approach to NER.
I'm not sure exactly how to answer the second part of your question on looking for substrings without more details. You could modify the Stanford program, or you could use a part-of-speech tagger to mark proper nouns in the text. That wouldn't distinguish locations from names, but it would make it very simple to find words that are x words away from each proper noun.