Merging GDML files - python-3.x

The GDML manual documents how to setup multiple GDML files as follows.
Now lxml has a facility to parse multiple files as follows
from lxml import etree
parser = etree.XMLParser(resolve_entities=True)
root= etree.parse(filename, parser=parser)
I assume that at least under the covers it must combine the multiple files before parsing.
My question is there some way I can just save the combined files, as I would like to
have the facility to combine and save a LARGE ( > 100 ) number of xml files where they are specified as includes in a gdml file and discard the !ENTITY definitions.
note: I don't want to parse the files as for the large number this takes a long time.
I guess one option would be to code something like
gdml = etree.tostring('gdml')
etree.ElementTree(gdml).write('filename')
But have a concern that I might hit a max string size, which I understand would be limited to the maximum addressable.
import sys
sys.maxsize
Wondering if there is a better way.

Related

Avoiding for loops when working with folders in Python

The code below is an attempt at a minimal reproducible example, it relies on the folders (folder_source and folder_target) and files (file_id1.csv, fileid2.csv). The code loads a csv from a directory, changes the name, and saves it to another directory.
The code works fine. I would like to know if there is a way of avoiding the nested for loop.
Thank you!
list_of_file_paths =['C:\\Users\\user\\Desktop\\folder_source\\file_id1.csv','C:\\Users\\user\\Desktop\\folder_source\\file_id2.csv']
list_of_variables =['heat','patience','charmander']
target_path=r'C:\\Users\\user\\Desktop\\folder_target\\'
for filepath_load in list_of_file_paths:
for variable in list_of_variables:
df_loaded = pd.read_csv(filepath_load) #grab one of the csv in the source folder
id_number=filepath_load.split(".")[0].split("_")[-1] #extracts the name of the id from the csv file
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False) #rename the folder and saves into another folder
You're looking for Cartesian product of 2 lists I guess?
from itertools import product
for (filepath_load, variable) in (product(list_of_file_paths, list_of_variables)):
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False)
But as Roland Smith says, you have some redundancy here. I'd prefer his code, which has two loops but the minimal amount of I/O and computation.
If you really want to save each file into three identical copies with a different name, there is really no alternative.
Although I would move the inner loop down, removing redundant file reads.
for filepath_load in list_of_file_paths:
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
for variable in list_of_variables:
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False)
Adittionally, consider using shutil.copy since the source file is not modified:
import shutil
for filepath_load in list_of_file_paths:
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
for variable in list_of_variables:
shutil.copy(filepath_load, target_path+id_number+'_'+variable+'.csv')
That would employ the operating system's buffer cache, at least for the second and third write.

copying files from one folder to another folder based on the file names in python 3

In Python 3.7, I want to write a scrip that
creates folders based on a list
iterates through a list (elements represent different "runs")
searches for .txt files in predifined directories derived from certain operations
copies certain .txt files to the previously created folders
I managed to do that via following script:
from shutil import copy
import os
import glob
# define folders and batches
folders = ['folder_ce', 'folder_se']
runs = ['A001', 'A002', 'A003']
# make folders
for f in folders:
os.mkdir(f)
# iterate through batches,
# extract files for every operation,
# and copy them to target folder
for r in runs:
# operation 1
ce = glob.glob(f'{r}/{r}/several/more/folders/{r}*.txt')
for c in ce:
copy(c, 'folder_ce')
# operation 2
se = glob.glob(f'{r}/{r}/several/other/folders/{r}*.txt')
for s in se:
copy(s, 'folder_se')
In the predifined directories there are several .txt files
one file with the format A001.txt (where the "A001"-part is derived from the list "runs" specified above)
plus sometimes several files with the format A001.20200624.1354.56.txt
If a file with the format A001.txt is there, I only want to copy this one to the target directory.
If the format A001.txt is not available, I want to copy all files with the longer format (e.g. A001.20200624.1354.56.txt).
After the comment of #adamkwm, I tried
if f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt' in cepa:
copy(f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt', 'c_py_1')
else:
for c in cepa:
copy(c, 'c_py_1')
but that still copies both files (A001.txt and A001.20200624.1354.56.txt), which I understand. I think the trick is to first check in ce, which is a list, if the {r}.txt format is present and if it is, only copy that one. If not, copy all files. However, I don't seem to get the logic right or use the wrong modules or methods, it seems.
After searching for answers, i didn't find one resolving this specific case.
Can you help me with a solution for this "selective copying" of the files?
Thanks!

Is it possible to remove characters from a compressed file without extracting it?

I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.

Fastest Python way to compare two dirs and delete non-duplicate basenames

I have two directories, one called images and one called annotations. In the images directory I have images with long string names and the .jpg file extension. In the annotations directory I have .xml files with the same string names (up to the extension).
I removed a bunch of xml files (approx 20k out of 200k), but I still have all 200k images. Now I want to remove the image files that no longer have a corresponding xml file. I can do this easily by globbing each directory and comparing every pair in each file list, but this will take quite a bit of time to run. Is there any faster way to go about this?
So in other words what's the fastest way in python to compare list A and sublist B, then return all non matches from A?
I would use the pathlib library and set data structure as follows.
from pathlib import Path
keep_stems = set(p.stem for p in Path('images').glob('*.jpg'))
delete_paths = [p for p in Path('annotations').glob('*.xml') if p.stem not in keep_stems]
for p in delete_paths:
p.unlink()
You could technically just do the unlink in the list comprehension but having side-effects from a list comprehension seems unpleasant.

Use recursive globbing to extract XML documents as strings in pyspark

The goal is to extract XML documents, given an XPath expression, from a group of text files as strings. The difficulty is the variance of forms the text files may be in. Might be:
single zip / tar file with 100 files, each 1 XML document
one file, with 100 XML documents (aggregate document)
one zip / tar file, with varying levels of directories, with single XML records as files and aggregate XML files
I thought I had found a solution with Databrick's Spark Spark-XML library, as it handles recursive globbing when reading files. It was amazing. Could do things like:
# read directory of loose files
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/mods/*.xml')
# recursively discover and parse
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/qs/**/*.xml')
# even read archive files without additional work
df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='mods:mods').load('file:///tmp/combine/mods_archive.tar')
The problem, this library is focused on parsing the XML records into DataFrame columns, where my goal is retrieve just the XML documents as strings for storage.
My scala is not strong enough to easily hack at the Spark-XML library to utilize the recursive globbing and XPath grabbing of documents, but skipping the parsing and instead save the entire XML record as a string.
The library comes with the ability to serialize DataFrames to XML, but the serialization is decidely different than the input (which is to be expected to some degree). For example, element text values become element attributes. Given the following original XML:
<mods:role>
<mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm>
</mods:role>
reading then serializing witih Spark-XML returns:
<mods:role>
<mods:roleTerm VALUE="creator" authority="marcrelator" type="text"></mods:roleTerm>
</mods:role>
However, even if I could get the VALUE to be serialized as an actual element value, I'm still not acheiving my end goal of having these XML documents that were discovered and read via Spark-XML's excellent globbing and XPath selection, as just strings.
Any insight would be appreciated.
Found a solution from this Databricks Spark-XML issue:
xml_rdd = sc.newAPIHadoopFile('file:///tmp/mods/*.xml','com.databricks.spark.xml.XmlInputFormat','org.apache.hadoop.io.LongWritable','org.apache.hadoop.io.Text',conf={'xmlinput.start':'<mods:mods>','xmlinput.end':'</mods:mods>','xmlinput.encoding': 'utf-8'})
Expecting 250 records, and got 250 records. Simple RDD with entire XML record as a string:
In [8]: xml_rdd.first()
Out[8]:
(4994,
'<mods:mods xmlns:mets="http://www.loc.gov/METS/" xmlns:xl="http://www.w3.org/1999/xlink" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" version="3.0">\n\n\n <mods:titleInfo>\n\n\n <mods:title>Jessie</mods:title>\n\n\n...
...
...
Credit to the Spark-XML maintainer(s) for a wonderful library, and attentiveness to issues.

Resources