DBpedia resource name standard - dbpedia

Does DBpedia name have any standard or convention? By that, I mean, e.g., United Kingdom has a resource named United_Kingdom. But I'm seeing that the fact of having an underscore and having each word being capitalized doesn't hold. For instance, take University_of_Manchester; if you type it as University_Of_Manchester with a capital ‘O’ in “of,” you won't get the resource. Is it obligatory to do a filtering to get the resource name in the proper case, because we may want to make all letters lowercase, have underscore in spaces and just make a query because doing in filtering in the SPARQL do takes some time.
Any suggestions? I've just started to learn about DBpedia, so I may be missing something.

DBpedia encodes the information available in Wikipedia, and its naming convention is based on the names of Wikipedia articles. The DBpedia wiki page, The DBpedia Data Set, says, in Section 3. Denoting or Naming “Things”:
Each thing in the DBpedia data set is denoted by a de-referenceable IRI- or URI-based reference of the form http://dbpedia.org/resource/Name, where Name is derived from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each DBpedia entity is tied directly to a Wikipedia article. Every DBpedia entity name resolves to a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English Wikipedia, but since DBpedia release 3.7, we also provide Internationalized Datasets that contain IRIs like http://xx.dbpedia.org/resource/Name, where xx is a Wikipedia language code and Name is taken from the source URL, http://xx.wikipedia.org/wiki/Name.
Thus, since the Wikipedia article is University of Manchester, not University Of Manchester, the DBpedia resource is http://dbpedia.org/page/University_of_Manchester, and not http://dbpedia.org/page/University_Of_Manchester.

Related

What is the purpose of dbo:wikiPageDisambiguates in dbepdia?

I try to figure out the purpose of the dbo:wikiPageDisambiguates in DBpedia. I cannot find a definition what it means
You may find this is clarified by looking at the original Wikipedia page from which the DBpedia page was derived.
Bluntly, some Wikipedia entities are identified ambiguously, e.g., the string "Gerrard" may be meant to refer to any of the listed entities. A disambiguation page is then created, such that someone searching for "Gerrard" is presented with a list of the entities they might have meant, such that they can (for instance) view the page about Paul Gerrard (an entity of type person) and not a page about Gerrard Street (a named roadway), which happens to exist in both London and Ontario -- so there's another disambiguation page, so you can see the page about the street in the city you care about.
These pages are not necessarily exhaustive, as Wikipedia and DBpedia are living resources, growing and evolving as people add and correct and adjust data.
It would probably be good if some version of the above were put into the description of dbo:wikiPageDisambiguates

Is there a way to have a reference term in addition to a label with Doccano?

Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks
Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.

NER: Relate extracted entity to single real world concept

I am processing plain text documents and identifying entities like college/university names present in the document. Some times these names are written in different formats but they refer to a single college/university name.
Example:
Jawaharlal Nehru Technological University Hyderabad
J.N.T.U Hyderabad
JNTU Hyderabad
JNTU-H
Jawaharlal Nehru Technological University (JNTU) Hyderabad
All the above names refer to same college name.
How can we relate all these names to a single college/university names?
(I am looking for some kind of web service or something like Google search because if i search for any of those names it returns same college link.)
This task is named "Entity Linking". Some systems are dedicated to this, in most cases by leveraging Wikipedia (in particular redirects which give possible mentions for entities), such as Babelfy or DBpedia Spotlight.
Those service rely on data to link mentions to unique identifiers: if they have possible mentions for your entities, it should probably work in most cases (but for those that are to ambiguous). But in many cases their lexicon are not sufficient and you'll probably face unknown entities or mentions. In that case, you'll have to build your own system by using an existing framework and provide it with relevant database of entities and their mentions. Acronyms could be automatically generated from their full names.

Wiktionary API to retrieve word forms (or other free service)

This is a question particularly for Russian/Ukrainian languages but may be useful for other languages too.
Is there a possibility to retrieve word forms as raw data? To use in mobile application for example. These forms are present on the general wiki page. For example Forms of verb 'to be'. The same you can find for nouns Noun forms for 'apple' in Russian.
I need these forms with description of the form. What I mean is for example:
to be - infinitive; am - first person singular, present time; are - first person plural, present time; etc.
So far I have found that only wiktionary.org provides such information for Russian language. It would be nice if someone could point me to some other services/dictionaries for Russian, Ukrainian and English.
If you're interested to use Wiktionary, you can consider Wikokit which is an interface to parsed Wiktionary database.The English and Russian database dumps are available in their download section, but they also provide code/library (Java) for you to create your own database dump. They also provided (I think) the code/library for you to interface with the database, so you no longer have to deal with web services, since you have it running locally.

Synonym style text lookup and parsing

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.

Resources