Generating different datasets from live dbpedia dump - dbpedia

I was playing around with the different datasets provided at the dbpedia download page and found that it is kind of outdated.
Then I downloaded the latest dump from the dbpedia live site. When I extracted the June 30th file, I just got one huge 37GB .nt file.
I want to get different datasets (like the different .nt files available at the download page) from the latest dump. Is there a script or process to do it?

Solution 1:
You can use dbpedia live extractor.https://github.com/dbpedia/extraction-framework.
You need to configure proper extractors(Ex: infobox properties extractor, abstract extractor ..etc). It will download the latest wikipedia dumps and generates the dbpedia datasets.
You may need to make some code changes to get only the required data. One of my colleague did this for German data sets. You still need a lot of disk space for this.
Solution 2(I don't know whether it is really possible or not.):
Do a grep for the required properties on the datasets. You need to know the exact URIs of the properties you want to get.
ex: For getting all the home pages:
bzgrep 'http://xmlns.com/foaf/0.1/homepage' dbpedia_2013_03_04.nt.bz2 >homepages.nt
It will give you all the N-triples with homepages. You can load that in the rdf store.

Related

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

ETL: Transforming/cleaning excel files

I am working for a start up where they are getting excel files from different companies with customer information. We do not have any ETL tool at present as the work is handled manually to transform the data into required structure and load into CRM system.
My plan is to load these excel files into a database and also replicate CRM into a database and do some fuzzy maping.
Can you please recommend a light weight ETL tool to apply a few rules to clean the data and compare the existing customer data that we have?
Thanks,
mc
Getting Excel feeds is certainly very common, and you need a good process for ingesting and validating them, especially since they are often manually created or tweaked, leading to frequent data and formatting issues. Adding insult to injury, Excel has a very fuzzy concept of data types, often throwing spanners in the works.
Where possible, switch your data sources to other formats (JSON, CSV, database extract). This requires upstream work but so does troubleshooting feed issues, so switching to a better format (and defining the feed well!) pays off for both sides fairly quickly.
Process Incoming Files Example describes a general approach for reliably handling multiple feeds of incoming files, with processing and archiving of successful and failing files. The example uses my company's actionETL cross-platform .NET ETL library, but I've also used the same approach previously with other ETL tools.
Map out all current and upcoming data sources and destinations, and see which tools are a good fit. Try before you buy with your actual ETL feeds and requirements. Expect the ETL data integration to be an ongoing project since feeds and requirements never stop changing and growing.
Cheers,
Kristian

How to use Solr for multiple data sources?

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.
It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book
I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

How should I load the contents of a .txt file to serve on a website?

I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?
Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.

Storing lots of attachments in single CouchDB document

tl;dr : Should I store directories in CouchDB as a list of attachments, or a single tar
I've been using CouchDB to store project documents. I just create documents via Futon and upload them directly from there. I've also written a script to bulk-upload directories. I am using it like a basic content repository. I replicate it, so other people on my team have a copy of the repository.
I noticed that saving directories as a series of files seems to have a lot of storage overhead, so instead I upload a .tar.gz file containing the directory. This does significantly reduce the size of the document but now any change to the directory requires replicating the entire tarball.
I am looking for thoughts or perspective on the matter.
It really depends one what you want to achieve. I will try and provide some options for you to consider.
Storing one tar.gz will save you space, but it does make it harder to work with. If you are simply archiving it may work for you.
Storing all the attachments on one document works well for couchapps. The workflow is you mess around with attachments until you are ready to release the application, then there is not a lot of overhead for replication, because it is usually one time. It is nice that they are one one document because they all move/replicate as one bundle. Downsides for using this approach for a content management system are that you can get a lot of history baggage that you have to compact on your local couch. Also you will get a lot of conflicts during replication between couches, and couch will keep conflicts around for you to resolve. Therefore if you choose this model, you should compact frequently to reduce disk size.
For a content management system, I might recommend using one document per attachment. That would give you less conflicts. There will be a slight overhead as each doc will have some space allocated for the doc itself, but the savings in having to do frequent compaction and/or conflict resolution will be better.
Hope that gives you some options to weigh out.

Resources