Split an XML file with multiple records and invalid characters into multiple text files by element - python-3.x

I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.

For my issues 1 and 2, I kind of solved my own problem:
Issue 1 (parsing and invalid characters)
I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:
for %%f in (*.xml) do (
java -jar atlassian-xml-cleaner-0.1.jar %%f > clean\%%~f
)
This utility standardized all of the XML files and made them parseable by lxml.
Issue 2 (CR, spaces, tabs inside the Data element)
This configuration for lxml stripped all whitespace and handled the invalid character issue
from lxml import etree
parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
tree = etree.parse(filepath, parser=parser)
With these two steps I'm now able to start extracting records and writing them to individual files:
# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
dt = parse_datestamp(item.text.strip())
content = item.getnext().text.strip()

Related

How to use Relational Stores with a position based data file?

I have different data files that are mapped on relational stores. I do have a formatter which contains the separators used by the different data files (most of them csv). Here is an example of how it looks like:
DQKI 435741198746445 45879645422727JHUFHGLOBAL COLLATERAL SERVICES AGGREGATOR V9
The rule to read this file is as following: from index 0 to 3, it's the code name, from index 8 to 11, it's PID, from index 11 to 20, it's account number, and so on...
How do you specify such rule in ActivePivot Relational Stores?
The relational-store of ActivePivot ships with a high performance, multithreaded CSV-Source to parse files and load them into data stores. I suppose that's what you hope to use for your fixed-length field file.
But this is not supported in the current version of the Relational Store (1.5.x).
You could pre-process your file with a small script to add a separator character at the end of each of the fields. Then the entire CSV Source can be reused immediately.
You could write your own data source that defines fields as offset in the text line. If you do that you can reuse all of the fast field parsers available in the CSV Source project (they work on any char sequences):
com.quartetfs.fwk.format.impl.DoubleParser
com.quartetfs.fwk.format.impl.FloatParser
com.quartetfs.fwk.format.impl.DoubleVectorParser
com.quartetfs.fwk.format.impl.FloatVectorParser
com.quartetfs.fwk.format.impl.IntegerParser
com.quartetfs.fwk.format.impl.IntegerVectorParser
com.quartetfs.fwk.format.impl.LongParser
com.quartetfs.fwk.format.impl.ShortParser
com.quartetfs.fwk.format.impl.StringParser
com.quartetfs.fwk.format.impl.DateParser

Matching text files from a list of system numbers

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,
Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Start loop at specific line of text file in groovy

I am using groovy and I am trying to have a text file be altered at specific line, without looping through all of the previous lines. Is there a way to state the line of a text file that you want to wish to alter?
For instance
Text file is:
1
2
3
4
5
6
I would like to say
Line(3) = p
and have it change the text file to:
1
2
p
4
5
6
I DO NOT want to have to do a loop to iterate through the lines to change the value, aka I do not want to use a .eachline {line ->...} method.
Thank you in advance, I really appreciate it!
I dont think you can skip lines and traverse like this. You could do the skip by using the Random Access File in java, but instead of lines you should be specifying the number of bytes.
Try using readLines() on file text. It will store all your lines in a list. To change content at line n, change content at n-1 index on list and then join on list items.
Something like this will do
//We can call this the DefaultFileHandler
lineNumberToModify = 3
textToInsert = "p"
line( lineNumberToModify, textToInsert )
def line(num , text){
list = file.readLines()
list[num - 1] = text
file.setText(list.join("\n"))
}
EDIT: For extremely large files, it is better that you have a custom implementation. May be something on the lines of what Tim Yates had suggested in the comment on your question.
The above readLines() can easily process upto 100000 lines of text within less than a sec. So you can do something like this:
if(file size < 10 MB)
use DefaultFileHandler()
else
use CustomFileHandler()
//CustomFileHandler
- Split the large file into buckets of acceptable size.
- Ex: Bucket 1(1-100000 lines), Bucket 2(100000-200000 lines), etc.
- if (lineNumberToModify falls in bucket range)
insert into line in the bucket
There is no hard and fast rule to define how you implement your CustomFileHandler as it completely depends on the use case scenario. If you need to do the above operation multiple times on the same file, you can choose to do the complete bucket split first, store them in memory and use the buckets for the following operations. Or if it is a one time operation, you can avoid manipulating all the buckets first but deal with only what you need and process the others later on on-demand basis.
And even within the buckets you can define your own intelligence to speed up your job. Say if you want to insert into 99999 line of a bucket with 1-100000 lines, you can exploit groovy's methods and closures to their fullest,
file.readLines().reverse()[1] = "some text"

Wolfram Mathematica import data from multiple files

I have a lot of files. Every of which contains data.
I can happy import one file to Mathematica. But there are more than 500 hundreds of files.
I do it so:
Import["~/math/third_ks/mixed_matrices/1.dat", "Table"];
aaaa = %
(*OUTPUT - some data, I can access them!*)
All that I want is just to make circle(I can do it), but I cannot change name of file - 1.dat. I want to change it.
I tried to make such solution. I generated part of possible names and I have written them to separated file.
Import["~/math/third_ks/mixed_matrices/generate_name_of_files.dat", "Table"];
aaaa = %
Output: {{"~/math/third_ks/mixed_matrices/0.dat"}, \
{"~/math/third_ks/mixed_matrices/1.dat"}, ......
All that I want to do is Table[a=Import[aaaa[[i]] ,{i,1,500}]
But the function Import accepts only String " " objects as filename/paths.
You can use FileNames to collect the names of the data files you want to import, with the usual wildcards.
And then just map the Import statement over the list of filenames.
data will then contain a list comprising the data from each file as a separate element.
data = Import[#,"Table"]& /# FileNames["~/math/third_ks/mixed_matrices/*.dat"];
It's a bit hard to work out what is going on without the file of filenames. However, I think you might be able to solve your problem by using Flatten on the list of filenames to make it a vector of String objects that can be passed to Import. Currently your list is an n*1 matrix, where each row is a List containing a String, not a vector of Strings.
Incidentally you could use Map (/#) instead of Table in this instance.
Thank you for your response.
It happened so that I got two solutions in the same time.
I think it would be not fair to forget about second way.
aaaa = "~/math/third_ks/mixed_matrices/" <> ToString[#] <> ".dat" & /# Range[0, 116];
(*This thing generates list of lines
Output:
{"~/math/third_ks/mixed_matrices/0.dat", \
"~/math/third_ks/mixed_matrices/1.dat", \
"~/math/third_ks/mixed_matrices/2.dat", .....etc, until 116
Table[Import[aaaa[[i]], "Table"], {i, 1, 117}];
(*and it just imports data from file*)
bbbb = %; (*here we have all data, voila!*)
Incidentally, it's not my solution.
It was supposed by one my friend:
https://stackoverflow.com/users/1243244/light-keeper

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!
First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Resources