DBPedia: What's the meaning of '__1' (double underscores) in URIs? - dbpedia

On DBPedia you can find a lot of URIs that containing double underscores and a number at the end, eg.:
http://dbpedia.org/resource/Eric_Cheney__1
http://dbpedia.org/resource/Eli_Wallach__1
http://dbpedia.org/resource/Ed_Wood__1
http://dbpedia.org/resource/Francis_Ford_Coppola__1
Mostly these items are of the type PersonFunction, but I can't find any documentation on why these objects exist (and a person's function isn't an ObjectProperty?)...
So why are these created?

After reading this DBPedia discussion on blank nodes storing, it seems the purpose is to avoid clashing with WikiPedia's URIs.
This, presumably, is used for nodes that don't have a corresponding article on WikiPedia, but rather points to a close article on the subject. As DBPedia tries to create a URI for everything, this URI is assembled according to specific rules (more on this can be found at the above linked discussion).
From the discussion:
Note that intermediate node URIs always contain double underscores,
e.g. 1. Wikipedia doesn't allow consecutive underscores in page
titles, so we can be sure that these URIs will not clash with DBpedia
URIs for Wikipedia pages. We pick a name from the arguments of the
template from which the intermediate node is extracted, append two
underscores, that name and a number to the URI of the main page and
use that as the URI for the intermediate node. If there are multiple
intermediate nodes on one page for which we pick the same name, we use
different numbers, e.g. see 1 and [2].

Related

Why are major DNS systems designed to write the subdomain before the root domain and not the reverse?

We write app.example.com instead of com.example.app in major DNS-based systems, including WWW, FTP, and email. What is the reason behind this design? Why not the reverse order?
Hostnames existed before the DNS and even before TLDs.
Structures to the names have been added through RFC 921 "Domain Name System Implementation Schedule - Revised" (October 1984)
This document explains the change from simple names (no dot) to hierarchical ones, and this was needed because the Internet was growing at that time and a single flat list of names was not enough to describe every hosts.
Some excerpts:
The names are being changed from simple names, or globally unique
strings, to structured names, where each component name is unique
only with respect to the superior component name.
...
The elements (or components) of the structured names are separated
with periods, and the elements are written from the most
specific on the left to the most general on the right.
For example: USC-ISIF.ARPA
RFC 882 "DOMAIN NAMES - CONCEPTS and FACILITIES" (November 1983) just says it is a convention:
The domain name of a node or leaf is the path from
the root of the tree to the node or leaf. By convention, the
labels that compose a domain name are read left to right, from the
most specific (lowest) to the least specific (highest).
A clue may come from RFC 1034 "DOMAIN NAMES - CONCEPTS AND FACILITIES" (November 1987) that repeats the above with some details:
The domain name of a node is the list of the labels on the path from
the node to the root of the tree. By convention, the labels that
compose a domain name are printed or read left to right, from the most
specific (lowest, farthest from the root) to the least specific
(highest, closest to the root).
RFCs (see RFC 1166) have the tradition to use the "MSB 0" bit numbering: it means that when you write down a byte, with 8 bits, you start with the most significant one, the bit with highest value (the encode encoding for the 128 decimal value).
This was then extended with the concent of network byte order, where the most significant one is firsts.
I guess that the idea of starting with the most specific label of the name comes directly from this idea of the most specific bit first, which means starting with the label farthest from the root and hence finally having the root at the top right side and then reading a full name in a kind of right to left pattern.

How can I get all the exercises for a topic (e.g., math) and all its subtopics from the khanacademy api?

Khan Academy's API Explorer has an exercises section that mentions filtering by tags, but the url with math tag applied returns nothing.
The generic exercise objects don't contain the topic they're in. My guess is that there's an id to join on somewhere in the topictree/exercises json objects, but I don't know an efficient way to find it.
Here are the raw exercises json and raw topictree json (note, the second one is huge, and contains many topics other than math).
I don't think there is a nice way to return exercises from just a subtree of the topictree (e.g. just math). Tags are a different concept, and there isn't a tag common to everything in math. Probably your best bet is to load the full topictree with just Exercises (and Topics) and work from there:
http://www.khanacademy.org/api/v1/topictree?kind=Exercise
If you need to reference this structure repeatedly, it probably makes sense to download and filter it ahead of time, and maybe re-fetch it from time to time to account for changes to Khan Academy content. But it depends on your exact use case.
Generally, any content item can be referenced by content_id (sometimes just called id) or by slug, but unfortunately, the naming and usage aren't consistent everywhere.
You can use the following to get all the exercises:
http://www.khanacademy.org/api/v1/exercises
http://www.khanacademy.org/api/v1/topictree?kind=Exercise
I'm not sure what's the difference between these two - I don't use them.
I prefer to fetch the data for the individual topic nodes as follows:
http://www.khanacademy.org/api/v1/topic/%s
http://www.khanacademy.org/api/v1/topic/%s/exercises
http://www.khanacademy.org/api/v1/topic/%s/videos
where %s is the "node_slug" property for each topic. The root of the tree is just "root". The first one will give you the topic details and a list of sub-items in the "child_data" array. Use the "id" properties of each sub-topic in this array to look up its details in the "children" array having "internal_id" equal to "id". There you get the "node_slug" to for the next API call for that sub-topic. The "child_data" array has all the sub-items in the order that they appear on the website when you're working with the missions.
I cache these responses so that I don't have to download everything every time.

How to determine the stopping point of a loop when crawling a web-site

My program currently goes through pages of a website gathering information. How do I set my loop to end when I have visited all the websites pages?
Is there some way of knowing the amount of webpages in any site?
Or do I have compare a block of pages I have visited eg 10 and if the pages are checked in that order again i know its repeating itself.
I'm sure there has to be a better way of knowing when to stop.
Keep track of pages visited ( may be keeping visited URL in a set) and when trying to scan a new page, check if it is already visited.
Breadth first search
Depth first search
Check these two algorithms. Think of the site as a graph
whose nodes are the pages and whose edges/arcs are the links
from one page to another. So two pages are neighboring
A → B, if there's a link from page A to page B.
Then just implement one of these two algorithms
(whichever you find more appropriate for your case).
Both of them have their respective stop conditions.
Your search in both cases should start with the root
page(s) which is usually default.ext or index.ext or
something similar (ext = html, asp, aspx, jsp, php, whatever).
You may want to pre-process the website with a SitemapGenerator and only visit the webpages included in the sitemap.
Is there some way of knowing the amount of webpages in any site
No. All you can do to examine a web-site is to make HTTP GET (or HEAD) requests and examine the response. That will tell you whether the URI is a valid identifier for a resource, and get you a representation of that resource. You can not know which requests will indicate a valid resource, nor can you practically generate all the possible URIs to perform an exhaustive search.
At best, all you can do is to start with a URI and find all the resources reachable from that URI, by examining resources that contain links to other resources, and then following those links.

Is it possible to have one (single) character top level domain name?

I'm writing a Regex to validate email. The only one thing confuse me is:
Is it possible to have single character for top level domain name? (e.g.: lockevn.c)
Background: I knew top level domain name can be from 2 characters to anything (.uk, .us to .canon, .museum). I read some documents but I can't figure out does it allow 1 character or not.
It is technically possible, however, there are no single character tlds that have been accepted into the root (as of the moment) so the answer is:
Yes, it is possible to have single character for top level domain name, however, there are currently no single character TLDs in the root.
You can see the list of TLDs that are currently in the root at this URL:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
RFC-952 shows what a "name" is, this includes what is valid as a top level domain:
A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.).
Additionally, the grammar from RFC-952 shows:
<name> ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
RFC-1123 section 2.1 specifically allowed single letter domains & subdomains, changing the initial grammar of RFC-952 from starting with just a letter to being more relaxed, so now you are allowed to have single letter top level domains that are a number:
2.1 Host Names and Numbers
The syntax of a legal Internet host name was specified in RFC-952.
One aspect of host name syntax is hereby changed: the
restriction on the first character is relaxed to allow either a
letter or a digit. Host software MUST support this more liberal
syntax.
EDIT: As per #mr.spuratic's comment, RFC-3696 section 2 tightened the rules for top level domains, stating:
There is an additional rule that essentially requires
that top-level domain names not be all-numeric.
This means that:
a. is a valid top level domain
1. is not a valid top level domain
A very unscientific test of this shows that if I add "a" into my hosts file pointing to my local machine, going to http://a in my address bar does show my Apache welcome page.
I'm not sure about the internet standard, but in practice, no.
See,
http://www.norid.no/domenenavnbaser/domreg.html
and,
http://sqa.fyicenter.com/Online_Test_Tools/Domain_Name_Format_Validator.php
You should DEFINITELY allow 1-character domains since some registries allow them not by accident (and I speak of quite big registries like UK, Germany, Poland, Ireland too - so important contributors to the Internet community, not oney exotic small exceptions). Since I also plan using such domains, that definitely work also with all e-mail services I used, letters AND numbers, I really would give the hint to allow this, else your script might need later correction.
Also some of the biggest internet companies use such domains - one of the most famous examples is Twitters t.co for shortening. Other companies I know of who have such domains are Facebook, Google, PayPal, Deutsche Telekom. But the list is longer and also some bigger investors hold them as assets.
By the way as proof there is a website trading this kind of domains online if You search for "1 letter domain names" :)

standard way of encoding pagination info on a restful url get?

I think my question would be better explained with a couple of examples...
GET http://myservice/myresource/?name=xxx&country=xxxx&_page=3&_page_len=10&_order=name asc
that is, on the one hand I have conditions ( name=xxx&country=xxxx ) and on the other hand I have parameters affecting the query ( _page=3&_page_len=10&_order=name asc )
now, I though about using some special prefix ( "_" in thes case ) to avoid collisions between conditions and parameters ( what if my resourse has an "order" property? )
is there some standard way to handle these situations?
--
I found this example (just to pick one)
http://www.peej.co.uk/articles/restfully-delicious.html
GET http://del.icio.us/api/peej/bookmarks/?tag=mytag&dt=2009-05-30&start=1&end=2
but in this case condition fields are already defined (there is no start nor end property)
I'm looking for some general solution...
--
edit, a more detailed example to clarify
Each item is completely indepent from one another... let's say that my resources are customers, and that (luckily) I have a couple millions of them in my db.
so the url could be something like
http://myservice/customers/?country=argentina,last_operation=2009-01-01..2010-01-01
It should give me all the customers from argentina that bought anything in the last year
Now I'd like to use this service to build a browse page, or to fill a combo with ajax, for example, so the idea was to add some metada to control what info should I get
to build the browse page I would add
http://...,_page=1,_page_len=10,_order=state,name
and to fill an autosuggest combo with ajax
http://...,_page=1,_page_len=100,_order=state,name,name=what_ever_type_the_user*
to fill the combo with the first 100 customers matching what the user typed...
my question was if there was some standard (written or not) way of encoding this kind of stuff in a restfull url manner...
While there is no standard, Web API Design (by Apigee) is a great book of advice when creating Web APIs. I treat it as a sort of standard, and follow its recommendations whenever I can.
Under "Pagination and partial response" they suggest (page 17):
Use limit and offset
We recommend limit and offset. It is more common, well understood in leading databases, and easy for developers.
/dogs?limit=25&offset=50
There's no standard or convention which defines a way to do this, but using underscores (one or two) to denote meta-info isn't a bad idea. This is what's used to specify member variables by convention in some languages.
Note:
I started writing this as a comment to my previous answer. Then I was going to add it as an edit, but I think that it belongs as a separate answer instead. This is a completely different approach and a separate answer in its own right since it is a different approach.
The more that I have been thinking about this, I think that you really have two different resources that you have to deal with:
A page of resources
Each resource that is collected into the page
I may have missed something (could be... I've been guilty of misinterpretation). Since a page is a resource in its own right, the paging meta-information is really an attribute of the resource so placing it in the URL isn't necessarily the wrong approach. If you consider what can be cached downstream for a page and/or referred to as a resource in the future, the resource is defined by the paging attributes and the query parameters so they should both be in the URL. To continue with my entirely too lengthy response, the page resource would be something like:
http://.../myresource/page-10/3?name=xxx&country=yyy&order=name&orderby=asc
I think that this gets to the core of your original question. If the page itself is a resource, then the URI should describe the page so something like page-10 is my way of saying "a page of 10 items" and the next portion of the page is the page number. The query portion contains the filter.
The other resource names each item that the page contains. How the items are identified should be controlled by what the resources are. I think that a key question is whether the result resources stand on their own or not. How you represent the item resources differs based on this concept.
If the item representations are only appropriate when in the context of the page, then it might be appropriate to include the representation inline. If you do this, then identify them individually and make sure that you can retrieve them using either URI fragment syntax or an additional path element. It seems that the following URLs should result in the fifth item on the third page of ten items:
http://.../myresource/page-10/3?...#5
http://.../myresource/page-10/3/5?...
The largest factor in deciding between these two is how strongly coupled the individual item is with the page. The fragment syntax is considerably more binding than the path element IMHO.
Now, if the item resources are free-standing and the page is simply the result of a query (which I think is likely the case here), then the page resource should be an ordered list of URLs for each item resource. The item resource should be independent of the page resource in this case. You might want to use a URI that is based on the identifying attribute of the item itself. So you might end up with something like:
http://.../myresource/item/42
http://.../myresource/item/307E8599-AD9B-4B32-8612-F8EAF754DFDB
The key deciding factor is whether the items are freestanding resources or not. If they are not, then they are derived from the page URI. If they are freestanding, then they should have their are defined by their own resources and should be included in the page resource as links instead.
I know that the RESTful folk tend to dislike the usage of HTTP headers, but has anyone actually looked into using the HTTP ranges to solve pagination. I wrote a ISAPI extension a few years back that included pagination information along with other non-property information in the URI and I never really like the feel of it. I was thinking about doing something like:
GET http://...?name=xxx&country=xxxx&_orderby=name&_order=asc HTTP/1.1
Range: pageditems=20-29
...
This puts the result set parameters (e.g., _orderby and _order) in the URI and the selection as a Range header. I have a feeling that most HTTP implementations would screw this up though especially since support for non-byte ranges is a MAY in RFC2616. I started thinking more seriously about this after doing a bunch of work with RTSP. The Range header in RTSP is a nice example of extending ranges to handle time as well as bytes.
I guess another way of handling this is to make a separate request for each item on the page as an individual resource in its own right. If your representation allows for this, then you might want to consider it. It is more likely that intermediate caching would work very well with this approach. So your resources would be defined as:
myresource/name=xxx;country=xxx/orderby=name;order=asc/20/
myresource/name=xxx;country=xxx/orderby=name;order=asc/21/
myresource/name=xxx;country=xxx/orderby=name;order=asc/22/
myresource/name=xxx;country=xxx/orderby=name;order=asc/23/
myresource/name=xxx;country=xxx/orderby=name;order=asc/24/
I'm not sure if anyone has tried something like this or not. This would make URIs constructible which is always a useful property IMHO. The bonus to this approach is that the individual responses could be cached and the server is free to optimize handling of collecting pages of items and what not in the most efficient way. The basic idea is to have the client specify the query in the URI and the index of them item that it wants to retrieve. No need to push the idea of a "page" into the resource or even to make it visible. The client can iteratively retrieve objects until it's page is full or it receives a 404.
There is a downside of course... the HTTP server and infrastructure has to support pipelining or the cost of creation/destruction of connections might kill the idea outright.

Resources