How to get DBpedia ontology classes as JSON or JSON-LD? - dbpedia

I would like to get class information from DBpedia in a machine-readable format (like JSON).
For example, how do I get this page as JSON or JSON-LD?
I did find this JSON file, but it doesn't contain the properties.

Why not use the ontology file?
Published alongside each release in .nt and .owl
For 2016-10:
http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.owl
http://downloads.dbpedia.org/2016-10/dbpedia_2016-10.nt
Then just query for http://dbpedia.org/ontology/Galaxy
Something like:
select ?domain ?prop
Where{
?class rdfs:subClassOf{0,1} ?domain.
?prop rdfs:domain ?domain.
FILTER(?class = <http://dbpedia.org/ontology/Galaxy>)
}
UPDATE
For json output append &format=json or &format=json-ld: click.

Related

Reading and writing an XML file using ADF lookup activity

We need to read a file and post an XML payload to an HTTP endpoint via Azure Data Factory (ADF). We have the XML file in our blob storage. we are using Lookup activity to read it. And we plan to put web activity after that to post it to the HTTP endpoint. But, the lookup activity does not support an XML output. Is there a way to read a file and send it in XML format to the next activity in Azure Data Factory?
You can use the xml() function supported among Conversion functions in ADF.
Checkout MS docs: xml function
Return the XML version for a string that contains a JSON object.
xml('<value>')
Parameter:
The string with the JSON object to convert. The JSON object must
have only one root property, which can't be an array. Use the
backslash character () as an escape character for the double
quotation mark (").
Example:
Sample Source xml file stored in blob storage.
Solution:
#string(xml(json(string(activity('Lookup1').output.value[0]))))
Now, you can either use this as string to store in a variable or directly use it dynamically in a web activity payload.
#xml(json(string(activity('Lookup1').output.value[0])))
Checkout MS docs for more : xml function examples

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files
I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:
# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')
# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')
# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')
The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.
My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.
The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:
<mods:role>
<mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>
reading then serializing witih Spark-XML returns:
<mods:role>
<mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>
However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.
Any insight would be appreciated.
Found a solution from this Databricks Spark-XML issue:
xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})
Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:
In [8]: xml_rdd.first()
Out[8]:
(4994,
'<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n <mods:titleInfo>\n\n\n <mods:title>Jessie</mods:title>\n\n\n...
...
...
Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.

Error in U-SQL Job on Azure Data Lake

I have lots of json files in my Azure Data Lake account. They are organized as: Archive -> Folder 1 -> JSON Files.
What I want to do is extract a particular field: timestamp from each json and then then just put it in a csv file.
My issue is:
I started with this script:
CREATE ASSEMBLY IF NOT EXISTS [Newtonsoft.Json] FROM "correct_path/Assemblies/JSON/Newtonsoft.Json.dll";
CREATE ASSEMBLY IF NOT EXISTS [Microsoft.Analytics.Samples.Formats] FROM "correct_path/Assemblies/JSON/Microsoft.Analytics.Samples.Formats.dll";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #INPUT_FILE string = #"correct_path/Tracking_3e9.json";
//Extract the different properties from the Json file using a JsonExtractor
#json =
EXTRACT Partition string, Custom string
FROM #INPUT_FILE
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
OUTPUT #json
TO "correct_path/Output/simple.csv"
USING Outputters.Csv(quoting : false);
I get error:
E_STORE_USER_FILENOTFOUND: File not found or access denied
But I do have access to the file in the data explorer of the Azure Data Lake, so how can it be?
I don't want to run it for each file one by one. I just want to give it all the files in a folder (like Tracking*.json) or just a bunch of folders (like Folder*) and it should go through them and put the output for each file in a single row of the output csv.
Haven't found any tutorials on this.
Right now, I am reading the entire json, how to read just one field like time stamp which is a field within a particular field, like data : {timestamp:"xxx"}?
Thanks for your help.
1) Not sure why you're running into that error without more information - are you specifically missing the input file or is it the assemblies?
2) You can use a fileset to extract data from a set of files. Just use {} to denote the wildcard character in your input string, and then save that character in a new column. So for example, your input string could be #"correct_path/{day}/{hour}/{id}.json", and then your extract statement becomes:
EXTRACT
column1 string,
column2 string,
day int,
hour int,
id int
FROM #input
3) You'll have to read the entire JSON in your SELECT statement, but you can refine it down to only the rows you want in future rowsets. For example:
#refine=
SELECT timestamp FROM #json;
OUTPUT #refine
...
It sounds like some of your JSON data is nested however (like the timestamp field). You can find information on our GitHub (Using the JSON UDFs) and in this blog for how to read nested JSON data.
Hope this helps, and please let me know if you have additional questions!

Python: Universal XML parser

I'm trying to make simple Python 3 program to read weather information from XML web source, convert it into Python-readable object (maybe dictionary) and process it (for example visualize multiple observations into graph).
Source of data is national weather service's (direct translation) xml file at link provided in code.
What's different from typical XML parsing related question in Stack Overflow is that there are repetitive tags without in-tag identificator (<station> tags in my example) and some with (1st line, <observations timestamp="14568.....">). Also I would like to try parse it straight from website, not local file. Of course, I could create local temporary file too.
What I have so far, is simply loading script, that gives string containing xml code for both forecast and latest weather observations.
from urllib.request import urlopen
#Read 4-day forecast
forecast= urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/forecast.php").read().decode("iso-8859-1")
#Get current weather
observ=urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/observations.php").read().decode("iso-8859-1")
Shortly, I'm looking for as universal as possible way to parse XML to Python-readable object (such as dictionary/JSON or list) while preserving all of the information in XML-file.
P.S I prefer standard Python 3 module such as xml, which I didn't understand.
Try xmltodict package for simple conversion of XML structure to Python dict: https://github.com/martinblech/xmltodict

Is it possible to read a standard etherpad as a text document over the api?

I have a public etherpad containing an yaml-file. Using php I would like read this yaml and convert it to a json-string.
There are some great libraries for converting yaml to json. for example:
https://github.com/mustangostang/spyc/
What i'm looking for is an url that will return the contents of a pad as pure text.
You want the text export then. You can easily get the contents of a pad exported as raw text, using the following url:
//<yourdomain>.com/p/<yourpad>/export/txt

Resources