Nodejs how to optimize writing very large xml files? - node.js

I have a huge CSV (1,5GB) which I need to process line by line and construct 2 xml files. When I run the processing alone my program takes about 4 minutes to execute, if I also generate my xml files it takes over 2.5 hours to generate two 9GB xml files.
My code for writing the xml files is really simple, I use fs.appendFileSync to write my opening/closing xml tags and the text inside them. To sanitize the data I run this function on the text inside the xml tags.
function() {
return this.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "&apos;");
};
Is there something I could optimize to reduce the execution time?

fs.appendFileSync() is a relatively expensive operation: it opens the file, appends the data, then closes it again.
It'll be faster to use a writeable stream:
const fs = require('node:fs');
// create the stream
const stream = fs.createWriteStream('output.xml');
// then for each chunk of XML
stream.write(yourXML);
// when done, end the stream to close the file
stream.end();

I drastically reduced the execution time (to 30 minutes) by doing 2 things.
Setting the ENV variable UV_THREADPOOL_SIZE=64
Buffering my writes to the xml file (I flush the buffer to the file after 20,000 closed tags)

Related

Node.js "readline" + "fs. createReadStream" : Specify start & end line number

https://nodejs.org/api/readline.html
provides this solution for reading large files like CSVs line by line:
const { createReadStream } = require('fs');
const { createInterface } = require('readline');
(async function processLineByLine() {
try {
const rl = createInterface({
input: createReadStream('big-file.txt'),
crlfDelay: Infinity
});
rl.on('line', (line) => {
// Process the line.
});
await once(rl, 'close');
console.log('File processed.');
} catch (err) {
console.error(err);
}
})();
But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
Is this doable with readline & fs.createReadStream?
If not please suggest alternate approach.
PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.
But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.
There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.
Is this doable with readline & fs.createReadStream?
Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.
It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.
Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.
A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.
You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.
Details on how to specify start and end points are available in the createReadStream docs.

Better way to read and write to file. Async in python?

I have a 6000 lines long data file which I'm going to load up in buffer parse it and write to another json file. What is a better way to accomplish this task ? Should I load the file in buffer, then parse it , and then write it to the file ? Or should I load chunk of file in buffer, process it, and write it to the while keeping tasks simultaneously ? Is this close to async function in javascript ? Is there examples in python for simple file loading and writing to a file ?
You can use aiofiles:
async with aiofiles.open('filename', mode='r') as f:
async for line in f:
print(line)
They have good usage documentation in their GitHub repo.

Somtime issues when Read csv data config from jmeter

I have an issue when creating a performance script, which has related to reading data from csv data config.
My script has a structure below:
setup Thread
Create csv Thread. After view dashboard, using Json extractor to get list of data and put it to csv file
Create csv file - After this thread, i will have a lot of csv file base on number of center. For example: 4 files with different names
String[] attempt = (vars.get("ListAttemptId_ALL")).split(",");
int length = attempt.length;
String dir = props.get("UserFilePath").toString();
String center = vars.get("Center");
File csvFile = new File(dir, center + ".csv");
if(!csvFile.exists()){
FileWriter fstream = new FileWriter(csvFile);
BufferedWriter out = new BufferedWriter(fstream);
for(int i = 1; i <= length; i++){
out.write(attempt[i-1]);
out.write(System.getProperty("line.separator"));
}
out.close();
fstream.close();
}
Next thread gets the name of the file and using CSV file to loop over
all line
String center = vars.get("Center");
String fileName = center + ".csv";
props.put("path_${__threadNum}", String.valueOf(fileName));
Because i have alot of threads will run the same file, so i just check __threadNum of to find the name of the file this thread need to use.
I'm using loop Controller to go over CSV file, run to the end of the file will stop thread. Here is inside this loop
CSV data Set config:
Filename: ${__property(UserFilePath)}\\${__P(path_${__threadNum})}
where ${__property(UserFilePath)} = path of the folder and
${__P(path_${__threadNum})} is name of the csv file were extracted
My issue is this code is not stable, sometimes threads can read file normally, sometimes it's show error that file does not exist (actually it did) so it's hard to chase that where issue from. Can anyone suggest a solution to my issue? Or suggest any idea better than my solution to read csv file in thread group?
I have answer for this issue:
- I add all data AttemptId, Center to one file csv and read from beginning to and end. Using If controller to verify data before action.
This statement can be problematic:
props.put("path_${__threadNum}", String.valueOf(fileName));
as per JSR223 Sampler documentation
JMeter processes function and variable references before passing the script field to the interpreter, so the references will only be resolved once. Variable and function references in script files will be passed verbatim to the interpreter, which is likely to cause a syntax error. In order to use runtime variables, please use the appropriate props
methods, e.g.
props.get("START.HMS");
props.put("PROP1","1234");
So I would recommend replacing ${__threadNum} with ctx.getThreadNum() where ctx is a shorthand for JMeterContext class
According to Execution Order chapter of JMeter Documentation:
0. Configuration elements
1. Pre-Processors
2. Timers
3. Sampler
4. Post-Processors (unless SampleResult is null)
5. Assertions (unless SampleResult is null)
6. Listeners (unless SampleResult is null)
your CSV Data Set Config is executed at the first place, before any other scripting test elements. So the times when it "works" IMO are being caused by "false positive" situation as JMeter Properties are global and "live" while JMeter (and underlying JVM) is running. When you launch JMeter next time the properties will be null and your CSV Data Set Config will fail. So my expectation is that you should consider using __CSVRead() function instead which is evaluated in the runtime exactly in the place where it's being called. Check out Apache JMeter Functions - An Introduction article to learn more about JMeter Functions concept.

NodeJS - Stream a large ASCII file from S3 with Hex Charcaters (NUL)

I am trying to read (via streaming) a large file in a Lambda function. My goal is to just read the first few lines and look for some information. The input file in S3 seems to have hex characters (NUL) and the following code stops reading the line when it hits the NUL character and goes to the next line. I would like to know how can I read the whole line and replace/remove the NUL character before I look for the information in the line. Here is the code that does not work as expected:
var readline = require('line-reader');
var readStream = s3.getObject({Bucket: S3Bucket, Key: fileName}).createReadStream();
readline.eachLine(readStream, {separator: '\n', encoding: 'utf8'}, function(line) {
console.log('Line ',line);
});
As mentioned by Brad its hard to help as this is more an issue with your line-reader lib.
I can offer an alternate solution however that doesn't require the lib.
I would use GetObject as you are, but I would also specify a value for the range parameter, then work my way through the file in chunks and then stop reading chunks when I am satisfied.
If the chunk I read doesn't have a /n then read another chunk, keep going until I get a /n, then read from the start of my buffered data to /n, set the new starting position int based on the position of /n and then read a new chunk from that position if you want to read more data.
Check out the range parameter in the api:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property

Node.js reading piped streams

I'm trying to read the contents of some .gz log files using streams in node.
I've started simply with: fs.createReadStream('log.gz').pipe(zlib.createUnzip().
This works and I can pipe to process.stdout to verify. I'd like to pipe to this a new writeableStream, that I can have a data event to actually work with the contents. I guess I just don't fully understand how the streams work. I tried just creating a new writable stream, var writer = fs.createWriteStream() but this doesn't work because it requires a path.
Any ideas how I can go about doing this (without creating any other files to write to)?
var unzipStream = zlib.createUnzip()
unzipStream.on('data', myDataHandler).on('end', myEndHandler)
fs.createReadStream('log.gz').pipe(unzipStream)
That will get you the data and end events. Since it's log data you may also find split useful to get events line by line.

Resources