Is there a XSD Schema of Activity Stream feed? - xsd

I am trying to map Activity Stream feed to Java entity beans using JAXB. If there were a XSD schema of ActivityStream, it would be so much easier.
Does anyone know if there exists a XSD schema of ActivityStrea.ms?

Thankfully I have some G&T to wipe away the vomit in my mouth from this being the only link on the entire web that has what you and I are looking for:
https://code.google.com/p/nter/source/browse/feeds/activity-streams-1.0.xsd?repo=xml-schemas-devel&r=741fa55f12e5f4b4e5ccab5b84fbb6abcf256427
Three years latter and this is where "social" stands. Machines need specs too, not just humans. This is the only satisfactory "we can" in activitystrea.ms world.
For more soapboxing, visit yoyodyne.net!

Related

NLP to find relationship between entities

My current understanding is that it's possible to extract entities from a text document using toolkits such as OpenNLP, Stanford NLP.
However, is there a way to find relationships between these entities?
For example consider the following text :
"As some of you may know, I spent last week at CERN, the European high-energy physics laboratory where the famous Higgs boson was discovered last July. Every time I go to CERN I feel a deep sense of reverence. Apart from quick visits over the years, I was there for three months in the late 1990s as a visiting scientist, doing work on early Universe physics, trying to figure out how to connect the Universe we see today with what may have happened in its infancy."
Entities: I (author), CERN, Higgs boson
Relationships :
- I "visited" CERN
- CERN "discovered" Higgs boson
Thanks.
Yes absolutely. This is called Relation Extraction. Stanford has developed several useful tools for working on this problem.
Here is there website: http://deepdive.stanford.edu/relation_extraction
Here is the github repository: https://github.com/philipperemy/Stanford-OpenIE-Python
In general here is how the process works.
results = entract_entity_relations("Barack Obama was born in Hawaii.")
print(results)
# [['Barack Obama','was born in', 'Hawaii']]
Of some importance is that only triples are extracted of the form (subject,predicate,object).
You can extract verbs with their dependants using Stanford Parser, for example. E.g., you might get "dependency chains" like
"I :: spent :: at :: CERN".
It is a much tougher task to recognise that "I spent at CERN" and "I visited CERN" and "CERN hosted my visit" (etc) denote the same kind of event. Going into how this can be done is beyond the scope of an SO question, but you can read up literature of paraphrases recognition (here is one overview paper). There is also a related question on SO.
Once you can cluster similar chains, you'd need to find a way to label them. You could simply choose the verb of the most common chain in a cluster.
If, however, you have a pre-defined set of relation types you want to extract and lots of texts manually annotated for these relations, then the approach could be very different, e.g., using machine learning to learn how to recognize a relation type based on annotated data.
Don't know if you're still interested but CoreNLP added a new annotator called OpenIE (Open Information Extraction), which should accomplish what you're looking for. Check it out: OpenIE
Similar to the Stanford parser, you can also use the Google Language API, where you send a string and get a dependency tree response.
You can test this API first to see if it works well with your corpus: https://cloud.google.com/natural-language/
The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.
There are many ways to do relation extraction. As colleagues mentioned that you have to know about NER and coreference resolution. Different techniques require different approaches. Nowadays, Distant Supervision is most common, and for detecting the relation between entities, they used FREEBASE.

List of all questions along with tags on StackOverflow (for NLP tasks)

Since StackOverflow comes with a wealth of questions and user-contributed tags, I am looking at it as an interesting, richly annotated, text corpus for NLP (natural language processing) tasks.
Basically, I want to automatically predict question tags based on the questions body. I am sure this can be done to a certain extend, and there's a number of nice use cases, such as tag suggestions (e.g. to make tag usage more consistent), to name just one.
For this I would need a lot - or even better: - all questions along with their body text and user tags to train a tag predicter with machine learning algorithms.
I know there's the StackOverflow API, but the amount of data I can fetch through it seems to be very limited - for good reasons of course.
So the question is: Is there a way to fetch/download all questions along with their user-tags from StackOverflow?
You can get the data dump at http://www.clearbits.net/torrents/2076-aug-2012, sans the meta sites, a minor oversight which has been fixed with an alternate release, but is not applicable to your request.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

about semantic search

I am a "rookie" in Semantic Web. So a lot things confuse me right now. I am going to make a semantic web search in website. But I am not sure what should be the workflow of that?
I just have basic opinion.
Please correct me
use a webspider to get web resources, and put thoses reources in files.
parse those resource files (lexical ananlysis) and use RDF format to describe those resources (now, the RDF contains the ontologies,
which are about resources).
parse the RDF files (contain resources), use OWL (combine inference mechanism) to describe the ontologies in RDF files.
semantic analyze the user input (from search text box), match it in OWL files, and then match in the RDF reources files, then provide
the related results.
Please give me suggestions and correct me.
See this resource for your engine.
You should learn to search and use existing resources (ontologies and more generally APIs), that allow to reuse semantic annotations on data. (Linked Data, see here). Anyway, if you get web resources, don't put them in files and reference the origin, because the copy changes the links semantic. Knowledge evolve over time...
Regards the semantic analysis, could be a difficult task. Before you start to implement yourself, search if there is some API out there, that fits your bill.

Displaying OWL file in web page

Currently, I'm developing an ontology related to diseases using the software Protege. When I save the file, it's saved in OWL file displaying in XML. Now, I would like to know the method to call the OWL file in my website. I am interested to do a website which allows user to ask questions related to diseases and the answer comes from the ontology that I've created. Is there anybody who can enlighten me regarding this matter?
Essentially you're looking for a tool that will let your users query your ontology. SPARQL is an RDF query language that can achieve what you desire but I cannot recommend any of the available SPARQL query builders as I have no experience with them.
I am developing something similar -- right now I am looking at using java and OWLAPI -- (http://owlapi.sourceforge.net/index.html).
There is a PHP tool that I will look into as well -- (http://arc.semsol.org/).
You could always publish your ontology on freebase.com, and use their SPARQL endpoint.
A good place to get started on testing SPARQL out is http://dbpedia.org/snorql/

Resources