I'm using SAX JS to parse an XML file in Node. I want it to produce an object of the parsed file, but the best I seem to be able to do is console.log my parsed data.
I'm really new to streams in Node. I've googled and tried some things, but my fundamental problems seems to be that I can't get a grasp on where to begin with streams and how they relate to SAX JS.
How do I output the parsed XML file from SAX to a JS object?
Addendum
Ideally I'd like a JS object in a variable, but I'd also be happy getting JSON text out, which I could deserialize into a variable.
With SAX JS, I tried this.write(JSON.stringify(val)); from the closetag event handler and it produces countless error! Error: Invalid characters in closing tag. I really have no idea what I'm doing here.
I've already tried xml2js (didn't do what I need), and xml4js (not maintained). The big problem I had with xml2js is that my xml file's text includes essential data in self-closing tags that ended up in a different key, completely separate from the text.
Here's an XML structure somewhat like what I need it to handle:
<p>The quick brown fox <del>jumps</del>
over the <lb n="15"/> lazy dog.</p>
I need all the text, and I need some what to insert the attribute of the lb tag into the text with a custom format.
Addendum 2
Here's a better example, along with an ideal result:
<p>The quick brown fox <del>jumps</del>
over the <lb n="15"/> lazy
<note type="marginal">325a</note> dog.</p>
Result:
The quick brown fox jumps over the [line 15] lazy [B:325a] dog.
From the sax npm package description we can see:
You can use it to build an object model out of XML, but it doesn't do
that out of the box.
Perhaps, you might want to rethink your choice and take a look at one of the available alternatives unless you really need streams if XML file is huge and doesn't fit into machine memory.
As an example, here is how we can construct on object representation of an xml file using fast-xml-parser:
const parser = require('fast-xml-parser');
const data = `<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend! <pb n="1"/> And have a plenty of sleep!</body>
</note>`;
const xmlObj = parser.parse(data, {
ignoreAttributes: false,
allowBooleanAttributes: true,
parseNodeValue: true,
parseAttributeValue: true
});
console.log('XML object: ', JSON.stringify(xmlObj));
The output will be:
XML object: {"note":{"to":"Tove","from":"Jani","heading":"Reminder","body":{"#text":"Don't forget me this weekend2!And have a plenty of sleep!","pb":{"#_n":1}}}}
I've prepared a working demo on Repl.it.
If a file is big enough but fits into memory, you might want to spin a child process to offload the main thread.
Related
nice to meet you!
I'm currently making an AfterEffect script that writes layer information to Excel, but no matter how much I research, I can't find a way to do it. If someone knows how to do it, can you tell me?
Actually, I'm Japanese and I don't understand English very well, so I used Google Translate to write the sentences, so I'm glad if it's conveyed well.
Layer information can be obtained from the API using the Layer object which can be accessed directly like so: app.project.item(index).layer(index) or by looping through a CompItem's layers like so:
var theComp = app.project.activeItem;
for (var i = 1; i <= theComp.numLayers; i++){
// layers in a comp are indexed from 1, rather than 0
theLayer = theComp.layer(i);
<do something with theLayer>
}
You can write this to a CSV XML or JSON file using the File.write() or File.writeln() methods of the File object. These can easily be imported into Excel.
Because the version of Javascript that extendscript uses dates back to 1995, it doesn't have native JSON.stringify() or XML.write() methods, so to create JSON or XML you will need Javascript implementations like this one for XML and this one for JSON. If you search for core JS polyfill for these functions there are dozens around.
I've downloaded some web pages with requests and saved the content in a postgres database [in a text field] using Django's ORM. For some sudocode of what's going on, here ya go:
art = Article()
page = requests.get("http://example.com")
art.raw_html = page.content
art.save()
I verified that page.content is a bytes object, and I guess I assumed that this object would automatically be decoded upon saving, but it doesn't seem to be... it has been converted to some weird string representation of a bytes object, ostensibly by Django. It looks like this in the interpreter when I call art.raw_html:
'b\'<!DOCTYPE html>\\n<html lang="en" class="pb-page"
And if I call it with print I get this:
b'<!DOCTYPE html>\n<html lang="en" class="pb-page"
And for the life of me I can't re-encode it to a bytes object, even if I trim off the leading b' and trailing '.
I feel like there's an easy solution to this and I feel like an idiot... but after lots of experiments and googling, I'm not figuring it out.
Incidentally, if I manually copy what's returned from the print statement (like with my cursor), I can convert the clipboard contents back to a bytes object just fine and then decode it into some readably-formatted html.
Clearly there is a better way. (And yes, going forward I'll stop saving the content like this in the first place.)
You can use eval or ast.literal_eval as below.
data = "b'gAAAAABc1arg48DmsOwQEbeiuh-FQoNSRnCOk9OvXXOE2cbBe2A46gmP6SPyymDft1yp5HsoHEzXe0KljbsdwTgPG5jCyhMmaA=='"
eval(data)
b'gAAAAABc1arg48DmsOwQEbeiuh-FQoNSRnCOk9OvXXOE2cbBe2A46gmP6SPyymDft1yp5HsoHEzXe0KljbsdwTgPG5jCyhMmaA=='
Using ast.literal_eval
import ast
ast.literal_eval(data)
thanks to #juanpa.arrivillaga. I just added to answer.
I have a situation in which an XML document has information in varying depth (according to S1000D schemas), and I'm looking for a generic method to extract correct sentences.
I need to interpret a simple element containing text as one individual part/sentence, and when an element that's containing text contains other elements that in turn contain text, I need to flatten/concatenate it into one string/sentence. The nested elements shall not be visited again if this is done.
Using Pythons lxml library and applying the tostring function works ok if the source XML is pretty-printed, so that I may split the concatenated string into new lines in order to get each sentence. If the source isn't pretty-printed, in one single line, there won't be any newlines to make the split.
I have tried the iter function and applying xpaths to each node, but this often renders other results in Python than what I get when applying the xpath in XMLSpy.
I have started down some of the following paths, and my question is if you have some input on which ones to continue on, or if you have other solutions.
I think I could use XSLT to preprocess the XML file, and then use a simpler Python script to divide the content into a list of sentence for further processing. Using Saxon with Python is now doable, but here I run into problems if the XML source contains entities that I cannot redirect Saxon to resolve (such as & nbsp;). I have no problem parsing files with lxml, so I tend to lean towards a cleaner Python solution.
lxml doesn't seem to have xpath support that can give me all nodes with text that contains one or more children containing text, and all nodes that are simple elements with no parents containing text nodes. Is there way to preprocess the parsed tree so that I can ensure it is pretty printed in memory, so that tostring works the same way for every XML file? Otherwise, my logic gives me one string for a document with no white space, and multiple sentences/strings if the source had been pretty printed. This doesn't feel ok.
What are my options? Use XSLT 1.0 in Python, other parsers to get a better handle on where I am in the tree, ...
Just to reiterate the issue here; I am looking for a generic way to extract text, and the only rules to the XML source are that a sentence may be built from an element with child elements with text, but there won't be additional levels. The other possibility is the simple element, but this one cannot be included in a parent element with text since this is included in the first rule.
Help/thoughts are appreciated.
This is a downright ugly code, a hastily hack with no real thought on form, beauty or finesse. All I am after is one way of doing this in Python. I'll tidy things up when I find a good solution that I want to keep. This is one possible solution so I figured I'd post it to see if someone can be kind enough to show me how to do this instead.
The problems has been to have xpath expressions that could get me all elements with text content, and then to act upon the depending on their context. All my xpath expressions has given me the correct nodes, but also a root, or ancestor that has pulled a more or less complete string at the beginning, so I gave up on those. My xpath functions as they should in XSLT, but not in Python - don't know why...
I had to revert to regex to find nodes that contains strings that are not white space only.
Using lxml with xpath and tostring gives different results depending on how the source XML is formatted, so I had to get around that.
The following formats have been tested:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<subroot>
<a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element,
</c> and back to b.</b>
</a>
<!-- Comment -->
<a>Simple element.</a>
<a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a>
</subroot>
</root>
<?xml version="1.0" encoding="UTF-8"?><root><subroot><a>Intro, element a: <b>Nested b to be included in a, <c>and yet another nested c-element</c> and back to b.</b></a><!-- Comment --><a>Simple element.</a><a>Text with<b> 1st nested b</b>, back in a, <b>and yet another b-element</b>, before ending in a.</a></subroot></root>
Python code:
dmParser=ET.XMLParser(resolve_entities=False, recover=True)
xml_doc = r'C:/Temp/xml-testdoc.xml'
parsed = ET.parse(xml_doc)
for elem in parsed.xpath("//*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"}):
tmp = elem.xpath("parent::*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})
if(tmp and tmp[0].text and tmp[0].text.strip()): #Two first checks can yield None, and if there is something check if only white space
continue #If so, discard this node
elif(elem.xpath("./*[re:match(text(), '\S')]", namespaces={"re": "http://exslt.org/regular-expressions"})): #If a child node also contains text
line =re.sub(r'\s+', ' ',ET.tostring(elem, encoding='unicode', method='text').strip()) #Replace all non wanted whitespace
if(line):
print(line)
else: #Simple element
print(elem.text.strip())
Always yields:
Intro, element a: Nested b to be included in a, and yet another nested c-element, and back to b.
Simple element.
Text with 1st nested b, back in a, and yet another b-element, before ending in a.
I will be receiving the following XML data in a variable.
<order>
<name>xyz</name>
<city>abc</city>
<string>aGVsbG8gd29ybGQgMQ==</string>
<string>aGVsbG8gd29ybGQgMg==</string>
<string>aGVsbG8gd29ybGQgMw==</string>
</order>
Output:
<order>
<name>xyz</name>
<city>abc</city>
<string>hello world 1</string>
<string>hello world 2</string>
<string>hello world 3</string>
</order>
I know how I can decode from base64 but the problem is some of the values are decoded already and some are encoded. What is the best approach to decode this data using groovy so that I get the output as shown?
Always: tag value will be encoded. rest all other tags and value will be decoded.
Since there's no uncertainty on which nodes could come encoded and which not, hence no need to detect base64 encoding, the way to do it is pretty simple:
Parse it. There's two preferable ways to do that in Groovy: XmlSlurper & XmlParser. They differ in computation & mem consumption modes, both provide object/structure representation in the end, though.
Work with that object structure: traverse all required elements, decode the content/attributes you need to decode.
Either proceed further with the data with them and/or serialize it back to the XML text.
Articles to look at:
Load, modify, and write an XML document in Groovy
https://www.baeldung.com/groovy-xml
https://groovy-lang.org/processing-xml.html
and many, many more.
Another cheat sheet always useful for Groovy noobs: http://groovy-lang.org/groovy-dev-kit.html
Check out how to traverse the structures there, for instance.
I want to copy some files using Node.js. Basically, this is quite easy, but I have two special requirements I need to fulfill:
I need to parse the file's content and replace some placeholders by actual values.
The file name may include a placeholder as well, and I need to replace this as well with an actual value.
So, while this is not a complex task basically, I guess there are various ways how you could solve this. E.g., it would be nice if I could use a template engine to do the replacements, but on the other hand then I need to have the complete file as a string. I'd prefer a stream-based approach, but then - how should I do the replacing?
You see, lots of questions, and I am not able to decide which way to go.
Any hints, ideas, best practices, ...?
Or - is there a module yet that does this task?
You can write your own solution without reading the entire file. fs.readFile() should only be used when you are 100% sure that the files are no longer than a buffer chunk (typically 8KB or 16KB).
The simplest solution is to create a readable stream, attach a data event listener and iterate the buffer reading character by character. If you have a placeholder like this: ${label}, then check if you find ${, then set a flag to true. Begin storing the label name. If you find } and flag is true then you've finished. Set flag to false and the temporal label string to "".
You don't need any template engine or extra module.
If the whole file can be safely loaded into memory (isn't crazy big), then the library fs-jetpack might be very good tool for this use case.
const jetpack = require("fs-jetpack");
const src = jetpack.cwd("path/to/source/folder");
const dst = jetpack.cwd("path/to/destination");
src.find({ matching: "*" }).forEach((path) => {
const content = src.read(path);
const transformedContent = transformTheFileHoweverYouWant(content);
const transformedPath = transformThePath(path);
dst.write(transformedPath, transformedContent);
});
In the example code is synchronous, but you can easily make async equivalent.