Is there a way to identify similar noun phrases. Some suggest use pattern-based approaches, for example X as Y expressions:
Usain Bolt as Sprint King
Liverpool as Reds
There are many techniques to find alternative names for a given entity,
using patterns such as:
X also known as Y
X also titled as Y
and scanning large collections of documents (e.g., Wikipedia or news papers articles) is one way to do it.
There are also other alternatives, one I remember is using Wikipedia inter-links structure, for instance, by exploring the redirect links between articles. You can download a file with a list of redirects from here: https://wiki.dbpedia.org/Downloads2015-04 and exploring the file you can find alternative names/synonyms for entities, e.g.:
Kennedy_Centre -> John_F._Kennedy_Center_for_the_Performing_Arts>
Lord_Alton_of_Liverpool -> David_Alton,_Baron_Alton_of_Liverpool
Indiana_jones_2 -> Indiana_Jones_and_the_Temple_of_Doom
Another thing you can do is combine these two techniques, for instance, look for text segments where both Indiana Jones and Indiana_Jones_and_the_Temple_of_Doom occur and are not further apart more than, let's say, 4 or 5 tokens. You might find patterns like also titled as, then you can use these patterns to find more synonyms/alternative names.
Related
I am trying to match a list of entries in a given text file. The list is quite huge. Its a list of organization names, where names can have more than one word. Each text file is a kind of usual write-up with several paragraphs, totaling to approximately 5000 words per txt. Its a plain text content, and there is no clear boundary by which I can locate organization names.
I am looking for a way by which all the entries from the list are searched in the text file and whichever gets matched are recognized and tagged.
Is there any tool or framework to do this?
I tried to go through all the text mining tools listed in Wikipedia, but none seems to match this need.
Any inputs would be highly appreciated.
Approach 1: Finite State Machine
You can combine your search terms into a finite state machine (FSM). The resulting FSM can then scan a document for all the terms simultaneously in linear time. Since the FSM can be reused on each document, the expense of creating it is amortized over all the text you have to search.
A good regular expression library will make an FSM under the covers. Writing code to build your own is probably beyond the scope of a Stack Overflow answer.
The basic idea is to start with a regular expression that is an alternation of all your search terms. Suppose your organization list consists of "cat" and "dog". You'd combine those as cat|dog. If you also had to search for "pink pigs", your regular expression would be cat|dog|pink pigs.
From the regular expression, you can build a graph. The nodes of the graph are states, which keep track of what text you've just seen. The edges of the graph are transitions that tell the state machine which state to go to given the current state and the next character in the input. Some states are marked as "final" states, and if you ever get to one of those, you've just found an instance of one of your organizations.
Building the graph from all but the most trivial regular expressions is tedious and can be computationally expensive, so you probably want to find a well-tested regular expression library that already does this work.
Approach 2: Search for One Term at a Time
Depending on how many search terms you have, how many documents you have, and how fast your simple text searching tool is (possibly sub-linear), it may be best to just loop through the terms and search each document for each term as a separate command. This is certainly the simplest approach.
for doc in documents:
for term in search_terms:
search(term, doc)
Note that nesting the loops this way is probably most friendly to the disk cache.
This is the approach I would take if this were a one-time task. If you have to keep searching new documents (or with different lists of search terms), this might be too expensive.
Approach 3: Suffix Tree
Concatenate all the documents into one giant document, build a suffix tree, sort your search terms, and walk through the suffix tree looking for matches. Most of the details for building and using a suffix array are in this Jon Bentley article from Dr. Dobb's, but you can find many other resources for them as well.
This approach is memory intensive, mostly cache-friendly, and thus very fast.
Use a prefix tree aka Trie.
Load all your candidate names into the prefix tree.
For your documents, match them against the tree.
A prefix tree looks roughly like this:
{}
+-> a
| +-> ap
| | +-> ... apple
| +-> az
| +-> ... azure
+-> b
+-> ba
+-> ... banana republic
I'm playing around with a recommendation system that takes key descriptive words and phrases and matches them against others. Specifically, I'm focusing on flavors in beer, with an algorithm searching for things like malty or medium bitterness, pulling those out, and then comparing against other beers to come up with flavor recommendations.
Currently, I'm struggling with the extraction. What are some techniques for identifying words and standardizing them for later processing?
How do I pull out hoppy and hops and treat them as the same word, but also keeping in mind that very hoppy and not enough hops have different meanings that are modified by the preceding word(s)? I believe I can use stemming for things like plurals and suffixed/prefixed words, but what about pairs or more complicated patterns? What techniques exist for this?
I would first ignore the finer-grained distinctions and compile a list of lexico-semantic patterns, which can be used to extract some information structure-- for example:
<foodstuff> has a <taste-description> taste
<foodstuff> tastes <taste-description>
very <taste-description>
not enough <taste-description>
You can use instances of such patterns in your text to infer useful concepts (such as different taste descriptions) which can then be used again in order to bootstrap the extraction of new patterns and thus new concepts.
Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.
Is there any open source/free software available that gives you semantically related keywords for a given word. for example the word dog: it should give the keywords like: animal, mammal, ...
or for the word France it should give you keywords like: country, Europe ... .
basically a set of keywords related to the given word.
or if there is not, has anybody an idea of how this could be implemented and how complex this would be.
best regards
Wordnet might be what you need. Wordnet groups English words in sets of synonyms and provides general definitions, and records the various semantic relations between these groups.
There are tons of projects out there using Wordnet, here you have a list:
http://wordnet.princeton.edu/wordnet/related-projects/
Look at this one, you might find it particularly useful (http://kylescholz.com)
you can see the live demo here :
http://kylescholz.com/projects/wordnet/?text=dog
I hope this helps.
Yes. A company named Saplo in Sweden specialize in this. I beleive you can use their API for this and if you ask nicely you might be able to use it for free (if it's not for commercial purposes of course).
Saplo
Yes. What you are looking for is something similar to vector space model for searching and it is the best efficient way of doing. There are some open source libraries available for latent semantic indexing / searching ( special case of vector space model). Apache Lucene is one of the most pupular one. Or something from google code.
If you are looking for online resources, there are several to consider (at least in 2017; the OP is dated 2010).
Semantic Link (http://www.semantic-link.com): The creator of Semantic Link offers an interface to the results of a computation of a metric called "mutual information" on pairs of words over all of English Wikipedia. Only words occurring more than 1000 times in Wikipedia are available.
"Dog" gets you, for example: purebred, breeds, canine, pet, puppies.
It seems, however, you are really looking for an online tool that gives hyponyms and hypernyms. From the Wikipedia page for "Hyponymy and hypernymy":
In linguistics, a hyponym (from Greek hupó, "under" and ónoma, "name") is a word or phrase whose semantic field is included within that of another word, its hyperonym or hypernym (from Greek hupér, "over" and ónoma, "name") . In simpler terms, a hyponym shares a type-of relationship with its hypernym. For example, pigeon, crow, eagle and seagull are all hyponyms of bird (their hyperonym); which, in turn, is a hyponym of animal.
WordNet(https://wordnet.princeton.edu) has this information and has an online search tool. With this tool, if you enter a word, you'll get one or more entries with an "S" beside them. If you click the "S", you can browse the "Synset (semantic) relations" of the word with that meaning or usage and this includes direct hyper- and hyponyms. It's incredibly rich!
For example: "dog" (as in "domestic dog") --> "canine" --> "carnivore" --> "placental mammal" --> "vertebrate" --> "chordate" --> etc. or "dog" --> "domestic animal" --> "animal" --> "organism" --> "living thing" -->
There is also WordNik which lists hypernyms and reverse dictionary words (words with the given word in their definition). Hypernyms for "France" include "european country/nation" and reverse dictionary includes regions and cities in France, names of certain rulers, etc.. "Dog" gets the hypernym "domesticated animal" (and others).
I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).
How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.
What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).
Possible solution No 1. (partial): Bayesian categorization
(added 2009-05-21)
One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.
Text::Ngrams + Algorithm::Cluster
Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?
I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.
Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.
It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.