How should the DBPedia live changesets be interpreted with regard to time and edits? - dbpedia

The changesets at http://live.dbpedia.org/liveupdates/ seem to be time-ordered, but their interpretation for replay isn't completely clear from surrounding descriptions.
Regarding the paired add and remove files, when an existing value (such as a <http://dbpedia.org/ontology/abstract>) is edited, does that result in just an 'added' entry with the new value, or a 'removed' of the old, then an 'added' with the new?
After downloading a daily summary tar -- such as http://live.dbpedia.org/liveupdates/2013/07/2013-07-07.tar.gz -- the initial untarring gives a large number of top-level added/removed file pairs (1232 to be precise). But then, also, 24 hourly additional tarfiles (2013-07-07-[00-23].tar.gz), each with their own added/removed file pairs. Are the top-level files sequenced 'before', 'after' or redundant-with the hourly files?

DBpedia Live generates two sets of files added and removed, which contain the added and removed triples respectively.
Upon an article change, the new and old triples are written in nt format and saved to the added/ removed files respectively, and then those files are compressed and stored on the server.
The DBpedia sync-tool continuously downloads those files, decompresses them and uses them to update a local mirror of the official DBpedia Live endpoint.
So, first the old triples are removed from the local mirror, and then the new triples are inserted.

Related

How to load only changed portion of YAML file in Ruamel

I am using ruamel.yaml library to load and process YAML file.
The YAML file can get updated after I have called
yaml.load(yaml_file_path)
So, I need to call load() on the same YAML file multiple times.
Is there a way/optimization parameter to pass to loader to load only the new entries in the YAML file?
There is no such facility currently built into ruamel.yaml.
If a file consists of multiple YAML documents, you can optimize the loading, by splitting the file on the document marker (---). This is fairly trivial and then you can load a single document from start to finish.
If you only want to reload parts of a document things get more difficult. If there are anchors and aliases involved, there is no easy way to do this as you may need a (non-updated) anchor definition in an updated part that needs an alias. If there are no such aliases, and you know the structure of your file, and have a way to determine what got updated, you can do partial loads and update your data structure. You would need to do some parsing of the YAML document, but if you only use a subset of YAML possibilities, this is often possible.
E.g. if you know that you only have simple scalar keys at the root level mapping of a YAML document, you can parse the document and extract non-indented strings that are followed by the value indicator. Any such string that is not in your "old" data structure is a new key and its value should be parsed (i.e. the YAML document content until the next non-indented string).
The above is far less trivial to do for any added data that is not added at the root level (whether mapping or sequence).
Since there is no indication within the YAML specification of the complexity of a YAML docment (i.e. whether it includes anchors, alias, tags etc), any of this is less easy to built in ruamel.yaml itself.
Without specific information on the format of your YAML document, and what can get updated, specific implementation details cannot be given. I assume however that you will not update and write out the loaded data, if that is so, make sure to use
yaml = YAML(typ='safe')
when possible as this will get you much faster loading times than the default round-trip loader provides.

When using github to store an excel file, file seems to revert back to an old version every time

I am trying to store an excel file in github. The file is a master list of "off limits entities" that my team must use and each update daily with items that the full team needs to exclude from their data analyses. Here is our current workflow:
Pull most recent version of "off_limits_entities.xlsx" list before starting work
Edit "off_limits_entities.xlsx" locally (generally, will add 10 - 50 entries in various columns and sheets)
Push "off_limits_entities.xlsx" to github
The problem is that when I do this, and then one of my teammates does the same, when we pull again our files are both missing all recent additions (even though our pushes and pulls all seemed to be successful). As a test, I created a smaller xlsx file (one sheet) and uploaded it to github, then had one of my teammates pull, edit, and push it, and when I pulled the same file, it WAS updated appropriately with new columns and cells. I also considered whether our "off limits entities" files was too large, but it is only 223KB and the github limit is 100MB.
Does anyone know why the github might be losing/not saving/erasing new entries in the file?

yaml library which supports writing multiple documents

I have a small NodeJS app which generates two YAML-files.
I want to merge them into one file so that I have one file with two Document nodes. Something like:
---
yaml1
---
yaml2
I tried using the npm package yaml but to no avail.
Browsing through the docs of js-yaml, I cannot find how to achieve this.
Any help is appreciated.
YAML has been designed so that it is easy to merge multiple documents in a stream. Quoting the spec:
Concatenating two YAML streams requires both to use the same character encoding. In addition, it is necessary to separate the last document of the first stream and the first document of the second stream. This is easily ensured by inserting a document end marker between the two streams. Note that this is safe regardless of the content of either stream. In particular, either or both may be empty, and the first stream may or may not already contain such a marker.
The document end marker is ... (followed by a newline). Joining the contents of both files with this marker will do the trick. This works since YAML allows a document to be ended by multiple document end markers. On the other hand, the directives end marker (---) you use always starts a document, so it is not safe to join the documents with it since the second document may already start with one, leading to the creation of an empty document in between.

azure data factory: iterate over millions of files

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).
It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

How to load files in a specific order

I would like to know how I can load some files in a specific order. For instance, I would like to load my files according to their timestamp, in order to make sure that subsequent data updates are replayed in the proper order.
Lets say I have 2 types of files : deal info files and risk files.
I would like to load T1_Info.csv, then T1_Risk.csv, T2_Info.csv, T2_Risk.csv...
I have tried to implement a comparator, as it is said on Confluence, but it seems that the loadInstructions file has the priority. It will order the Info files and the risk files independently. (loading T1_Info.csv, T2_Info.csv and then T1_Risk.csv, T2_Risk.csv..)
Do I have to implement a custom file loader, or is it possible using an AP configuration ?
The loading of the files based on load instructions is done in
com.quartetfs.tech.store.csv.impl.CSVDataModelFactory.load(List<FileLoadDescriptor>). The FileLoadDescriptor list you receive is created directly from the load instructions files.
What you can do is create a simple instructions files with 2 entries, one for deal info and one for risk. So your custom implementation of CSVDataModelFactory will be called with a list of two items. In your custom implementation you scan the directory where the files are, sort them in the order you want them to be parsed and call the super.load() with the list of FileLoadDescriptor you created from the directory scanning.
If you want to also load files that are place in the future in this folder you have to add to your load instructions a line that will match all files and that will make the super.load() implementation to create a directory watcher for that (you should then maybe override createDirectoryWatcher() to not watch the files already present in the folder when load is called).

Resources