Why there is no resources represented on DBpedia-Live after 2016? - dbpedia

I have noticed that in Wikipedia, new ressources and updates after 2016 aren't represented in DBpedia-Live (like movies released after 2016), isn't DBpedia-Live supposed to be a much current version than DBpedia? for example, the movie "Spiderman: far from home" isn't represented, why so?

You're mistaken, but I don't know how you came to be so, because you provided no information about how you reached your conclusion.
Wikipedia article -- https://en.wikipedia.org/wiki/Spider-Man:_Far_From_Home
DBpedia-Live (definitely exists) -- http://live.dbpedia.org/resource/Spider-Man:_Far_From_Home
DBpedia (doesn't exist) -- http://dbpedia.org/resource/Spider-Man:_Far_From_Home

Related

impact world and recipe 2016 in Brightway

Has somebody tried to implement the impact assessment methods impact world+ ref or Recipe 2016 ref in Brightway?
The characterization factors of impact world + are available for download (beta version). The spreadsheet facilitates an implementation in Simapro but I guess that may cause some trouble if the biosphere flows are defined differently in Brightway2 and Simapro (is this the case?). I have not been able to find the characterisation factors for Recipe 2016.
I don't think anyone has done this yet. I have started with the regionalized characterization factors, but these are not complete. ReCiPe factors are here.
SimaPro makes changes to biosphere flow names and other metadata, and additionally includes a number of biosphere flows that ecoinvent doesn't, so there is a compatibility problem with the default Brightway flows.
Matching is possible - not even all that hard - but enough of a pain that it would be great if someone would do this. Ecoinvent knows it is something that people are asking for, but my understanding is that it is not currently a priority.

What is the "Steam Method" Dijkstra refers to in "Structure of 'THE'-Multiprogramming System"?

Does anyone know what the "Steam Method" is that Dijkstra refers to in "Structure of 'The'-Multiprogramming System"? This paper is very old (from 1968), but lays down some of the ground work for much of modern day programming. Here is the context it appears in below:
The construction stage has been rather traditional,
perhaps even old-fashioned, that is, plain machine code.
Reprogramming on account of a change of specifications
has been rare, a circumstance that must have contributed
greatly to the feasibility of the "steam method." That the
first two stages took more time than planned was somewhat
compensated by a delay in the delivery of the
machine.
Note: I am suspecting that this may be a typo, and it could refer to the "stream method". If it is the stream method, I would like to see what this method is, and if it fits in the context here.
In reading through the paper, it appears to be a self-referencing summation.
Take for instance, page 341 where the goal is stated:
The primary goal of the system is to process smoothly a continuous flow of user programs as a service to the University.
And then in further describing the system on page 343:
Therefore we have arranged the whole system as a society of sequential processes, progressing with undefined speed ratios.
And within that same paragraph, two other items are quoted in an effort to name a particular item of the system.
Now, if we jump back to page 342, the system is described as uni-directional:
There is no common data base via which independent users can communicate with each other: they only share the configuration and a procedure library.
So both in his manner of speaking and description of The Multiprogramming System being developed at the time, the system being deployed can reasonably be seen as being described as using the "steam method" conceptually created by the 6 mathematicians working on the project.
NOTE: Citations and page numbers are taken from the paper here

Extracting user interests from social profiles

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

organizing information for a software development organization

over time our information strategy has gone all over the place and we are looking to have a clearer policy and a more explicit way for everyone to be in sync on information sharing. Some things to note is that the org is 300+ people and is in multiple countries across the world. Also, we have people that are comfortable in Sharepoint, people that are comfortable in confluence, etc so there is definately a "change" factor here
Here are our current issues and what we are thinking about doing about them. I would love to hear feedback, suggestions, etc.
The content we have today:
Technical design info / architecture docs
Meeting minutes, action items, etc
Project plans and roadmaps
organization business mgmt info - travel, budget info, headcount info, etc
Project pages with business analysis, requirements, etc
Here are some of our main issues:
Where should data go - Confluence WIKI versus Sharepoint versus intranet site - we use confluence WIKI for #1, #2, #3, #5 but we also use sharepoint for #1, #3, #4, #5. We are trying to figure out if we should mandate each number to a specific place to make things consistent. We are using Sharepoint more a directory structure of documents, and we are using confluence for more adhoc changable content.
Stale Data - this is maybe a cultural thing with the org but at certain points in time data just becomes stale and is no longer relevant. What is the best way to ensure old data doesn't create a lot of noise and to ensure that the latest correct data is up to date. Should there be people in the org responsible for this or should it be an implicit "everyones job". This is more of an issue when people leave, join, etc . .
More active usage - whats is the best way to get people off of email and trying to stop and think "could this be useful for others . . let me put it in a centralized place instead of in email chains" . .
also, any other stories of good ways to improve an org's communication and information management
A fundamental root cause of information clutter is "no ownership".
People are assigned to projects. The projects end (or are cancelled), the people move on and the documents remain behind to gather "dust" and become information clutter.
This is hard to prevent. The wiki vs. sharepoint doesn't address the clutter, it just shifts the technology base that's used to accumulate clutter.
Let's look at the clutter
Technical design info / architecture docs. Old ones don't matter. There's current and there's irrelevant. Wiki.
Last year's obsolete design information is -- well -- obsolete.
Meeting minutes, action items, etc. Action items become part of someone's backlog in a development sprint, or, they're probably never going to get done. Backlogs are wiki items. Everything else is history that might be interesting but usually isn't. If it didn't create a sprint backlog items, update an architecture, or solve a development problem, the meeting was probably a waste of time.
Project plans and roadmaps. The sprint backlog matters -- this is what a "plan and roadmap" aspires to be. If you have to supplement your plans with roadmaps, you probably ought to give up on the planning and just use Scrum and just keep the backlog current.
The original plan is someone's guess at project inception time, and not really very interesting to the current project team.
Organization business mgmt info - travel, budget info, headcount info, etc. This is a weird mixture of highly structured stuff (budget, organization) and unstructured stuff ("travel"?)
How much history do you need? None? Wiki at best. Financial or HR System is where it belongs. But, in big organizations, the accounting systems can be difficult and cumbersome to use, so we create secondary sources of information like a SharePoint page with out-of-date budget numbers because the real budget numbers are buried inside Oracle Financials.
Project pages with business analysis, requirements, etc. This is your backlog. Your project roadmap and your requirements and your analysis ought to be a single document. In the wiki.
History rarely matters. Someone's concept at project inception time of what the requirements are doesn't matter very much any more. What the requirements evolved to in their final form matters far more than any history. This is wiki material.
How old is 'too old'?
I've worked with customers that have 30-year old software. The software -- obviously -- is relevant because it's in production.
The documentation, however, is all junk. The software has been maintained. It's full of change control records. The "original" specifications would have to be meticulously rewritten with each change control folded in. Since the change control documents can be remarkably pervasive, the only way to see where the changes were applied is to read the source and -- from that -- reverse engineer the current-state specification.
If we can only understand a 30-year old app by reverse engineering the source, then, chuck the 30-year old pile of paper. It's useless.
As soon as maintenance is done, the "original" specification has been devalued.
How to clean it up?
If you create the wiki page or sharepoint site, you own it forever.
When you leave, your replacement owns it forever.
Each manager is 100% responsible for every piece of information their staff creates. They have to delete things. The weak solution is to "archive" stuff. Which is just a polite way of saying "delete" without the "D-word".
Cleanup must be every manager's ongoing responsibility. If they can't remember what it is, or why they own it, they should be required (or "encouraged") to delete it. Everything unaccessed in the last two years should be archived without question. Everything 10 years old is just irrelevant history.
It's painful, and it doesn't appear to be value-creating work. After all, we work in IT. Our job is to "write" software, not delete it. No one will do it unless compelled on threat of firing.
The cost of storage is relatively low. The cost of cleanup appears higher.
How to stop the email chain?
Refuse to participate. Create a "Break the Chain" campaign focused on replacing email chains with wiki updates (or sharepoint updates).
Be sure your wiki provides links and is faster to edit than an email.
You can't force people to give up a really, really convenient solution (Email). You have to make the wiki more valuable and almost as convenient as email.
Ramp up the value on the wiki. Deprecate email chains. Refuse to respond to email chains. Refuse to accept "to do" action items through email.
You can use Confluence Wiki for storing documents as attachements and have the Wiki's paths work as the file paths in Sharepoint.
Re: stale data: have ownership of the data (both person and team) and ensure that deliverables for the owners include maintenance of ALL the data.
As far as "Off email", this is hard to do as you can't force people to do this short of actively monitoring all email... but you can try some deliverables with metrics regarding content added to the Wiki. That way people would be more likely to want to re-use the work already done on the email to paste into Wiki to meet the "quota" instead of composing fresh stuff.
Our company and/or team used all 3 of these approaches with some degree of success in the past
Is there a reason not to have the wiki hold the files?
Also, perhaps limiting the mail server to not allowing attachments on internal emails is too draconian, but asking folks to put everything in the wiki that needs to be emailed more than once is pretty darn useful.
Efficient information management is indeed a very hard problem. We found that "the simpler the better" principle can make miracles to solve it.
Where should data go - we are big believers of the wiki approach. In fact, we use Confluence for sharing possibly every type of information, except really large binary files. For those, we use Dropbox. Its simplicity is an absolutely killer feature. (Tip: you can integrate them with the Dropbox in Confluence plugin.)
Finding stale data - in our definition, stale data is something that is not updated or viewed for a specific period of time. The Archiving Plugin of Confluence can quickly and automatically find these, then report them to the authors and administrators, who may potentially update them (or remove them, see next item). There is, of course, information that never expires, but the plugin is able to skip them after you mark the corresponding pages.
Removing stale data - we are fairly aggressive on this. If the data is not (highly) relevant anymore, clean it up now! We can safely follow this practice, because we never actually delete data. We just move outdated data to hidden archive spaces using, again, the Archiving Plugin. If we changed our mind later, it is very easy to find it in the the archive, view it or even to recover it.
More active usage - our rule: if the information is required to be persistent, don't email it. Put it to a wiki page instead. The hard thing for some people is to find the best location for the information (which space? where in the page hierarchy?). Badly organized spaces with vague scope are another big efficiency divider, unfortunately. Large companies may consider introducing a wiki gardener to cure this.

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.
I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:
"denver" is a USCITY and a PLACENAME
"aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
"555 555 5555" is a USPHONENUMBER
I know that each of these cases will most likely require specific handling, however I'm not sure where to start.
Ideally I'd end up with something simple like:
queryCategory = magicCategoryFinder( query )
>print queryCategory
>"SOMECATEGORY or a list"
Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.
also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.
And if you can solve that problem reliably you can make yourself very wealthy.
What's all this in aid of anyway?
To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).
You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.
As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:Time flies like an arrow.Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.
It can be done -- but it really is decidedly non-trivial.
Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Resources