CSV_NON_TRIMABLE_CHAR_AFTER_CLOSING_QUOTE csv parse for node? - node.js

I'm using a 3rd party application that throws this error:
Error: Invalid Closing Quote: found non trimable byte after quote at line 5
code: CSV_NON_TRIMABLE_CHAR_AFTER_CLOSING_QUOTE
column: 16
empty_lines: 0
header: false
index: 16
invalid_field_length: 0
quoting: true
lines: 5
records: 4
when loading a csv file (I have to use this application). I think this application is using this node service: https://csv.js.org/parse/errors/
I also think the trouble is that column 16 has fields that look like this:
\Circular Logic 3\""
But I don't quite understand the issue.
Can anyone tell me if this column is not escaping the quotes properly? I have a ton of files and many instances like this in column 16 so it would be a pain to fix this manually.

It looks like proper escaping for double quotes in csv is another double quote. So the line should be:
\Circular Logic 3"""
We'll have to reformat the data.

Related

Removing the first date and timestamp in each line of a log file using Python

I have a series of log files in text file format.
The document format is this:
[2021-12-11T10:21:30.370Z] Branch indexing
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
[2021-12-11T10:21:30.374Z] Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
Each line in the document starts with:[2021-12-11T10:21:30.370Z]
I want to remove the first set of characters that represent date and timestamp and have a result something like this:
Branch indexing
Starting the program with default pipeID
Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
Starting the program with default pipeID
Can anyone please help me explain how I can do this?
I tried to use this method but it doesn't work since I have '[]' in the date stamp.
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
re.sub("[.*?]", "", text)
This doesn't work for me.
If I try the same method on a text like text = "<2021-12-11T10:21:30.370Z> Branch indexing".
import re
text = "<2021-12-11T10:21:30.370Z> Branch indexing"
re.sub("<.*?>", "", text)
It removes <2021-12-11T10:21:30.370Z>. Why does this not work with [2021-12-11T10:21:30.370Z]?
I need help removing every instance of this format "[2021-12-11T10:21:30.370Z]" in all the log files.
Thank you so much.
I'd rather go with a simple solution for this case, pal. Split the string where the ] ends, then trim the second element of the resulting list, to remove all those extra spaces and then print it, bud. Hope this helps, cheers!
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
print(re.split("]", text)[1].strip())
Your current regex pattern is off because square brackets are regex metacharacters which need to be escaped. Also, you should be running the regex in multiline mode. And the timestamp pattern should be more generic.
text = re.sub(r'^\[.*?\]\s+', '', text, flags=re.M)

Split an XML file with multiple records and invalid characters into multiple text files by element

I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.
For my issues 1 and 2, I kind of solved my own problem:
Issue 1 (parsing and invalid characters)
I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:
for %%f in (*.xml) do (
java -jar atlassian-xml-cleaner-0.1.jar %%f > clean\%%~f
)
This utility standardized all of the XML files and made them parseable by lxml.
Issue 2 (CR, spaces, tabs inside the Data element)
This configuration for lxml stripped all whitespace and handled the invalid character issue
from lxml import etree
parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
tree = etree.parse(filepath, parser=parser)
With these two steps I'm now able to start extracting records and writing them to individual files:
# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
dt = parse_datestamp(item.text.strip())
content = item.getnext().text.strip()

How to remove special characters when using .splitlines()

I have been working on a script that pulls email data from my gmail account.
Sometimes the output was printed on multiple lines, which caused the CSV formatting to malfunction.
In order to solve this issue, I have tried to use the .splitlines() command. This works really well, however the output looks like this now:
['ROSE Bikes News <newsletter#news.rosebikes.com>;10 % AUF DEUTER;22 Aug 2019 11:11:30']
Ideally I do not want the following characters ['CONTENT'].
I have tried to use the .strip() function without success.
content = (email_from + ';' + subject + ';' + local_message_date).splitlines()
print(content.strip("[']"))

Fortran error check on formatted read

In my code I am attempting to read in output files that may or may not have a formatted integer in the first line of the file. To aid backwards compatibility I am attempting to be able to read in both examples as shown below.
head -n 3 infile_new
22
8
98677.966601475651 -35846.869655806520 3523978.2959464169
or
head -n 3 infile_old
8
98677.966601475651 -35846.869655806520 3523978.2959464169
101205.49395364164 -36765.047712555031 3614241.1159234559
The format of the top line of infile_new is '(i5)' and so I can accommodate this in my code with a standard read statement of
read(iunit, '(I5)' ) n
This works fine, but if I attempt to read in infile_old using this, I as expected get an error. I have attempted to get around this by using the following
read(iunit, '(I5)' , iostat=ios, err=110) n
110 if(ios == 0) then
print*, 'error in file, setting n'
naBuffer = na
!rewind(iunit) #not sure whether to rewind or close/open to reset file position
close(iunit)
open (iunit, file=fname, status='unknown')
else
print*, "Something very wrong in particle_inout"
end if
The problem here is that when reading in either the old or new file the code ends up in the error loop. I've not been able to find much documentation on using the read statement in this way, but cannot determine what is going wrong.
My one theory was my use of ios==0 in the if statement, but figured since I shouldn't have an error when reading the new file it shouldn't matter. It would be great to know if anyone knows a way to catch such errors.
From what you've shown us, after the code executes the read statement it executes the statement labelled 110. Then, if there wasn't an error and iostat==0 the true branch of the if construct is executed.
So, if there is an error in the read the code jumps to that statement, if there isn't it walks to the same statement. The code doesn't magically know to not execute the code starting at label 110 if there isn't an error in the read statement. Personally I've never used both iostat and err in the same read statement and here I think it's tripping you up.
Try changing the read statement to
read(iunit, '(I5)' , iostat=ios) n
You'd then need to re-work your if construct a bit, since iostat==0 is not an error condition.
Incidentally, to read a line which is known to contain only one integer I wouldn't use an explicit format, I'd just use
read(iunit, * , iostat=ios) n
and let the run-time worry about how big the integer is and where to find it.

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!
First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Resources