How to load only changed portion of YAML file in Ruamel - python-3.x

I am using ruamel.yaml library to load and process YAML file.
The YAML file can get updated after I have called
yaml.load(yaml_file_path)
So, I need to call load() on the same YAML file multiple times.
Is there a way/optimization parameter to pass to loader to load only the new entries in the YAML file?

There is no such facility currently built into ruamel.yaml.
If a file consists of multiple YAML documents, you can optimize the loading, by splitting the file on the document marker (---). This is fairly trivial and then you can load a single document from start to finish.
If you only want to reload parts of a document things get more difficult. If there are anchors and aliases involved, there is no easy way to do this as you may need a (non-updated) anchor definition in an updated part that needs an alias. If there are no such aliases, and you know the structure of your file, and have a way to determine what got updated, you can do partial loads and update your data structure. You would need to do some parsing of the YAML document, but if you only use a subset of YAML possibilities, this is often possible.
E.g. if you know that you only have simple scalar keys at the root level mapping of a YAML document, you can parse the document and extract non-indented strings that are followed by the value indicator. Any such string that is not in your "old" data structure is a new key and its value should be parsed (i.e. the YAML document content until the next non-indented string).
The above is far less trivial to do for any added data that is not added at the root level (whether mapping or sequence).
Since there is no indication within the YAML specification of the complexity of a YAML docment (i.e. whether it includes anchors, alias, tags etc), any of this is less easy to built in ruamel.yaml itself.
Without specific information on the format of your YAML document, and what can get updated, specific implementation details cannot be given. I assume however that you will not update and write out the loaded data, if that is so, make sure to use
yaml = YAML(typ='safe')
when possible as this will get you much faster loading times than the default round-trip loader provides.

Related

Limiting DICOM tags

I am trying to limit the DICOM tags, which are retained, by using
for key in keys:
if key.upper() not in {'0028|0010','0028|0011'}:
image_slice.EraseMetaData(key)
in Python 3.6 where image_slice is of type SimpleITK.SimpleITK.Image
I then use
image_slice.GetMetaDataKeys()
to see what tags remain and they are the tags I selected. I then save the image with
writer.SetFileName(outputDir+os.path.basename(sliceFileNames[i]))
writer.Execute(image_slice)
where outputDir is the output directory name and os.path.basename(sliceFileNames[i]) is the DICOM image name. However, when I open the image, with Weasis or with MIPAV, I notice that there are a lot more tags than were in image_slice. For example, there is
(0002,0001) [OB] FileMetaInformationVersion: binary data
(0002,0002) [UI] MediaStorageSOPClassUID:
(0002,0003) [UI] MediaStorageSOPInstanceUID:
(0008,0020) [DA] StudyDate: (which is given the date that the file was created)
I was wondering how, and where these additional tags were added.
The group 2 tags you are seeing are meta data tags, that are always written while writing the dataset. Unless "regular" tags, which start with group 8, these group 2 tags do not belong to the dataset itself, but contain information about the encoding/writing of the dataset, like the transfer syntax - more information can be found in the DICOM standard, part 10. They will be recreated on saving a dataset to a file, otherwise, the DICOM file would not be valid.
About the rest of the tags I can only guess, but they are probably written by the software because they are mandatory DICOM tags and have been missing in the dataset. StudyDate is certainly a mandatory tag, so adding it if it is missing is correct, if the data is seen as derived data (which it usually is if you are manipulating it with ITK). I guess the other tags that you didn't mention are also mandatory tags.
Someone with more SimpleITK knowledge can probably add more specific information.

yaml library which supports writing multiple documents

I have a small NodeJS app which generates two YAML-files.
I want to merge them into one file so that I have one file with two Document nodes. Something like:
---
yaml1
---
yaml2
I tried using the npm package yaml but to no avail.
Browsing through the docs of js-yaml, I cannot find how to achieve this.
Any help is appreciated.
YAML has been designed so that it is easy to merge multiple documents in a stream. Quoting the spec:
Concatenating two YAML streams requires both to use the same character encoding. In addition, it is necessary to separate the last document of the first stream and the first document of the second stream. This is easily ensured by inserting a document end marker between the two streams. Note that this is safe regardless of the content of either stream. In particular, either or both may be empty, and the first stream may or may not already contain such a marker.
The document end marker is ... (followed by a newline). Joining the contents of both files with this marker will do the trick. This works since YAML allows a document to be ended by multiple document end markers. On the other hand, the directives end marker (---) you use always starts a document, so it is not safe to join the documents with it since the second document may already start with one, leading to the creation of an empty document in between.

Using a list for a feature in an ML model

I want to run a machine learning algorithm on some data, so I'm exporting the data into a file first.
But one of my features for the text I'm classifying is a list of tags,
and each text can have multiple tags ex. (["mystery", "thriller"]).
Is it recommended that when I write to my CSV file for exporting the data, that I write that entire list as one of the features for my data (the "tags" feature).
Or is it better to make a separate feature for each tag. The only problem then is that most examples will only have one tag, so the other feature columns for those will be blank.
So it seems like writing this list of tags as one feature makes the most sense, but then when parsing it for training, would I then treat every element of that list as its own feature still or no?
If you do it as a single feature just make sure to use some delimiter to separate the tags that won't occur in any of the tags, and also isn't a comma (as that will mess with the csv format), something like | would probably do fine. When you go to build your models and read in that list of tags you can then split it based on that delimiter. In Java this would look like:
String[] tagList = inputString.split("|");
I'm sure most languages will have a similar method to do this.

how can i store elasticsearch settings+mappings in one file (like schema.xml for Solr)

How can I store elasticsearch settings+mappings in one file (like schema.xml for Solr)? Currently, when I want to make a change to my mapping, I have to delete my index settings and start again. Am I missing something?
I don't have a large data set as of now. But in preparation for a large amount of data that will be indexed, I'd like to be able to modify the settings and some how reindex without starting completely fresh each time. Is this possible and if so, how?
These are really multiple questions disguised as one. Nevertheless:
How can I store elasticsearch settings+mappings in one file (like schema.xml for Solr)?
First, note, that you don't have to specify mapping for lots of types, such as dates, integers, or even strings (when the default analyzer is OK for you).
You can store settings and mappings in various ways, in ElasticSearch < 1.7:
In the main elasticsearch.yml file
In an index template file
In a separate file with mappings
Currently, when I want to make a change to my mapping, I have to delete my index settings and start again. Am I missing something?
You have to re-index data, when you change mapping for an existing field. Once your documents are indexed, the engine needs to reindex them, to use the new mapping.
Note, that you can update index settings, in specific cases, such as number_of_replicas, "on the fly".
I'd like to be able to modify the settings and some how reindex without starting completely fresh each time. Is this possible and if so, how?
As said: you must reindex your documents, if you want to use a completely new mapping for them.
If you are adding, not changing mapping, you can update mappings, and new documents will pick it up when being indexed.
Since Elasticsearch 2.0:
It is no longer possible to specify mappings in files in the config directory.
Find the documentation link here.
It's also not possible anymore to store index templates within the config location (path.conf) under the templates directory.
The path.conf (/etc/default/elasticsearch by default on Ubuntu) stores now only environment variables including heap size, file descriptors.
You need to create your templates with curl.
If you are really desperate, you could create your indexes and then backup your data directory and then use this one as your "template" for new Elasticsearch clusters.

How can I perform Search&Replace on an XML file after WIX installation?

After installing my files using WIX 3.5 I would like to changes some values in one of my xml files.
Currently there are multiple entries like this:
<endpoint address="net.tcp://localhost/XYZ" .../>
I would like to change the localhost to the real servername wich is available due to a property. How can I perform this replacement on each entry inside this xml file? Is there a way to do this without writing an own CA?
Thanks in advance!
XmlConfig and/or XmlFile elements are your friends here.
UPDATE: Well, according to the comments below, it turns out that only part of the attribute (or element) value should be changed. This seems not to be supported by either of two referenced elements.
I would take one of the two approaches then:
Use third-party "read XML" actions, like this one
It's better than creating your own because you can rely on deeper testing in this case
Teach your build script to control the string pattern
Let's say you put `net.tcp://localhost/XYZ` to build file and your code is pointed out to take this value as a string pattern to use at install time. For instance, keep the string pattern as a Property in your MSI package. When it changes, e.g. to `net.tcp://localhost/ABC` you'll have to change nothing in your action. In this case from a XMLFile perspective you always know your FROM and TO attribute values.
If your XML configuration file is not large, you can load the file into memory and perform replace using JScript.
var s = "<endpoint address=\"net.tcp://localhost/XYZ\" .../>";
var re = /"net.tcp:\/\/localhost\//g;
var r = s.replace(re, "\"http://newhost.com/");
Here s is your complete XML file, re is the regular expression, and r would contain the result or replace.
You can read and write to public properties of Windows Installer using JScript. Yet there's still one problem: you have to read your XML file and to write it back to disk. To do it, you can use Win32_ReadFile and Win32_WriteFile custom actions from the AppSecInc. MSI Extensions library referenced by Yan in his answer.
However, it could be easier to write a complete Custom Action which will load your XML configuration file, do the replace, and write the file back to disk. To do it you can use XSLT and JScript (see an example code).
InstallShield has a built-in data driven custom action called Text Search. It basically allows for RegEx style replacements like what you are describing.
WiX doesn't have this functionality but you could write a custom action ( say using C#/DTF ) to do it for you.
There nothing in Wix, you can do to change something in a file without using a custom action. If you don't want to use CA, you can consider saving the settings in some other place e.g. User's registry and always read that setting from there

Resources