Workflow for interpreting linked data in .ttl files with Python RDFLib - python-3.x

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?

RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

Related

How to parse big XML in google cloud function efficiently?

I have to extract data from XML files with the size of several hundreds of MB in a Google Cloud Function and I was wondering if there are any best practices?
Since I am used to nodejs I was looking at some popular libraries like fast-xml-parser but it seems cumbersome if you only want specific data from a huge xml. I am also not sure if there are any performance issues when the XML is too big. Overall this does not feel like the best solution to parse and extract data from huge XMLs.
Then I was wondering if I could use BigQuery for this task where I simple convert the xml to json and throw it into a Dataset where I then can use a query to retrieve the data I want.
Another solution could be to use python for the job since it is good in parsing and extracting data from a XML so even though I have no experience in python I was wondering if this path could still be
the best solution?
If anything above does not make sense or if one solution is preferable to the other or if anyone can share any insights I would highly appreciate it!
I suggest you to check this article in which they discuss how to load XML data into BigQuery using Python Dataflow. I think that this approach may work in your situation.
Basically what they suggest is:
To parse the xml into a Python dictionary using the package xmltodict.
Specify a schema of the output table in BigQuery.
Use a Beam pipeline to take an XML file and use it to populate a BigQuery table.

Is there an equivalent to SAS sas7bdat table files in Python?

Is there a Python equivalent to reading and writing tabular files such as SAS sas7bdat files?
My team is moving away from SAS and we'd like to replicate the SAS process in Python with our methodology as follows:
1) Pull data from various sources i.e. Excel, CSV, DBs etc.
2) Update our Data Warehouse with the new information and export this data as a Python table file (to be used next)
3) Rather than pulling data from our warehouse (super slow) we'd like to read in those Python table files and then do some data matching on a bigger set of data.
We're trying to avoid using the sas7bdat altogether (SASPy) files since we won't have SAS for much longer
Any advice, insights is greatly appreciated!
Unlike SAS, Python doesn't have a native data format. However, there are modules that implements binary protocols for serializing and de-serializing a Python object. Consider using HDF5 format to save and read files (https://www.h5py.org/). Another possibility is Pickle (https://docs.python.org/3/library/pickle.html).
Parquet is also worth considering.

How to collect RDF triples for a simple knowledge graph?

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by #AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?
Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

Generating different datasets from live dbpedia dump

I was playing around with the different datasets provided at the dbpedia download page and found that it is kind of outdated.
Then I downloaded the latest dump from the dbpedia live site. When I extracted the June 30th file, I just got one huge 37GB .nt file.
I want to get different datasets (like the different .nt files available at the download page) from the latest dump. Is there a script or process to do it?
Solution 1:
You can use dbpedia live extractor.https://github.com/dbpedia/extraction-framework.
You need to configure proper extractors(Ex: infobox properties extractor, abstract extractor ..etc). It will download the latest wikipedia dumps and generates the dbpedia datasets.
You may need to make some code changes to get only the required data. One of my colleague did this for German data sets. You still need a lot of disk space for this.
Solution 2(I don't know whether it is really possible or not.):
Do a grep for the required properties on the datasets. You need to know the exact URIs of the properties you want to get.
ex: For getting all the home pages:
bzgrep 'http://xmlns.com/foaf/0.1/homepage' dbpedia_2013_03_04.nt.bz2 >homepages.nt
It will give you all the N-triples with homepages. You can load that in the rdf store.

How should I load the contents of a .txt file to serve on a website?

I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?
Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.

Resources