What is the purpose of dbo:wikiPageDisambiguates in dbepdia? - dbpedia

I try to figure out the purpose of the dbo:wikiPageDisambiguates in DBpedia. I cannot find a definition what it means

You may find this is clarified by looking at the original Wikipedia page from which the DBpedia page was derived.
Bluntly, some Wikipedia entities are identified ambiguously, e.g., the string "Gerrard" may be meant to refer to any of the listed entities. A disambiguation page is then created, such that someone searching for "Gerrard" is presented with a list of the entities they might have meant, such that they can (for instance) view the page about Paul Gerrard (an entity of type person) and not a page about Gerrard Street (a named roadway), which happens to exist in both London and Ontario -- so there's another disambiguation page, so you can see the page about the street in the city you care about.
These pages are not necessarily exhaustive, as Wikipedia and DBpedia are living resources, growing and evolving as people add and correct and adjust data.
It would probably be good if some version of the above were put into the description of dbo:wikiPageDisambiguates

Related

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Is OpenNLP able to extract keyword from content?

Is OpenNLP able to extract keyword from content?
If yes, how?
If no, which tool should I use?
I would like to tag content automatically.
For example.
Jessica Chastain has revealed that a meeting has taken place with Marvel over an undisclosed role, although the star has confirmed it is not Captain Marvel.
“We’ve talked about aligning our forces in the future,” Chastain told MTV of her relationship with the studio. “And here’s the thing with me… If you’re going to be in a superhero movie, you only get one chance.”
“You’re that character forever. So why do a superhero movie and play the boring civilian?” A possible reference to Maya Hansen there? Chastain had been attached to the Iron Man 3 character before eventually dropping out on account of scheduling difficulties…
“I don’t want to say too much,” continues the star, “but there was one thing, there was a possibility in the future of the character becoming… And I was like, ‘I understand that, but I want to do it now!’”
Just who that character might be is up for interpretation, although Chastain has moved to quash subsequent rumours that she is in line to play Captain Marvel.
It should be tagged as "superhero", "movie".
Is OpenNLP able to do this?
Thanks.
OpenNLP is able to extract Named entities for you. This means anything that is the name of a person, place, organization etc. would potentially be recognized by the system.
However, what you are looking for is keyword extraction, where you want to identify relevant keywords that explain a document in the general sense. I would recommend checking out Alchemyapi.com
They have models to extract keywords, taxonomy, named entities amongst other things. The only issue is that the free version just gives you 1000 transactions per day (which might be enough for your task)

DBpedia resource name standard

Does DBpedia name have any standard or convention? By that, I mean, e.g., United Kingdom has a resource named United_Kingdom. But I'm seeing that the fact of having an underscore and having each word being capitalized doesn't hold. For instance, take University_of_Manchester; if you type it as University_Of_Manchester with a capital ‘O’ in “of,” you won't get the resource. Is it obligatory to do a filtering to get the resource name in the proper case, because we may want to make all letters lowercase, have underscore in spaces and just make a query because doing in filtering in the SPARQL do takes some time.
Any suggestions? I've just started to learn about DBpedia, so I may be missing something.
DBpedia encodes the information available in Wikipedia, and its naming convention is based on the names of Wikipedia articles. The DBpedia wiki page, The DBpedia Data Set, says, in Section 3. Denoting or Naming “Things”:
Each thing in the DBpedia data set is denoted by a de-referenceable IRI- or URI-based reference of the form http://dbpedia.org/resource/Name, where Name is derived from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each DBpedia entity is tied directly to a Wikipedia article. Every DBpedia entity name resolves to a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English Wikipedia, but since DBpedia release 3.7, we also provide Internationalized Datasets that contain IRIs like http://xx.dbpedia.org/resource/Name, where xx is a Wikipedia language code and Name is taken from the source URL, http://xx.wikipedia.org/wiki/Name.
Thus, since the Wikipedia article is University of Manchester, not University Of Manchester, the DBpedia resource is http://dbpedia.org/page/University_of_Manchester, and not http://dbpedia.org/page/University_Of_Manchester.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

Synonym style text lookup and parsing

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.

Resources