How to read big csv file batch by batch in Nodejs? - node.js

I have a csv file which contains more than 500k records. Fields of the csv are
name
age
branch
Without loading huge data in to memory I need to process all the records from the file. Need to read few records insert them in to collection and manipulate and then continue reading remaining records. As I'm new to this, unable to understand how it would work. If I try to print the batch, it prints buffered data, will the below code work for my requirement? With that buffered value, how can i get the csv record & insert, manipulate file data.
var stream = fs.createReadStream(csvFilePath)
.pipe(csv())
.on('data',(data) => {
batch.push(data)
counter ++;
if(counter == 100){
stream.pause()
setTimeout(() => {
console.log("batch in ",data)
counter = 0;
batch = []
stream.resume()},5000)
}
})
.on('error',(e) => {
console.log("er ",e);
})
.on('end',() => {
console.log("end");
})

I have written you some sample code how to work with streams.
You basically create a stream and proceed with it's chunks. A chunk is an object of type buffer. To work on it as text call toString().
Haven't a lot of time to explain you more but the comments should help out.
Also consider to use a module, since csv parsing was already done a lot.
Hope this helps>
import * as fs from 'fs'
// end oof line delimiter, system specific.
import { EOL } from 'os'
// the delimiter used in csv
var delimiter = ','
// add your own implementttaion of parsing a portion of the text here.
const parseChunk = (text, index) => {
// first chunk, the header is included here.
if(index === 0) {
// The first row will be the header. So take it
var headerLine = text.substring(0, text.indexOf(EOL))
// remove the header from the text for further processing.
// also replace the new line character..
text = text.replace(headerLine+EOL, '')
// Do something with header here..
}
// Now you have a part of the file to process without headers.
// The csv parse function you need to figure out yourself. Best
// is to use some module for that. there are plenty of edge cases
// when parsing csv.
// custom csv parser here =>h ttps://stackoverflow.com/questions/1293147/example-javascript-code-to-parse-csv-data
// if the csv is well formatted it could be enough to use this
var lines = text.split(EOL)
for(var line of lines) {
var values = line.split(delimiter)
console.log('liine received', values)
// StoreToDb(values)
}
}
// create the stream
const stream = fs.createReadStream('file.csv')
// variable to count the chunks for knowing if header is inckuded..
var chunkCount = 0
// handle data event of stream
stream.on('data', chunk => {
// the stream sends you a Buffer
// to have it as text, convert it to string
const text = chunk.toString()
// Note that chunks will be a fixed size
// but mostly consist of multiple lines,
parseChunk(text, chunkCount)
// increment the count.
chunkCount++
})
stream.on('end', () => {
console.log('parsing finished')
})
stream.on('error', (err) => {
// error, handle properly here, maybe rollback changess already made to db
// and parse again. You can may also use the chunkCount to start the parsing
// again and omit first x chunks, so u can restsart at given point
console.log('parsing error ', err)
})

Related

How to sequentially read a csv file with node.js (using stream API)

I am trying to figure out how to create a stream pipe which reads entries in a csv file on-demand. To do so, I thought of using the following approach using pipes (pseudocode)
const stream_pipe = input_file_stream.pipe(csv_parser)
// Then getting entries through:
let entry = stream_pipe.read()
Unfortunately, after lots of testing I figured that them moment I set up the pipe, it is automatically consumed until the end of the csv file. I tried to pause it on creation by appending .pause() at the end, but it seems to not have any effect.
Here's my current code. I am using the csv_parse library (part of the bigger csv package):
// Read file stream
const file_stream = fs.createReadStream("filename.csv")
const parser = csvParser({
columns: ['a', 'b'],
on_record: (record) => {
// A simple filter as I am interested only in numeric entries
let a = parseInt(record.a)
let b = parseInt(record.b)
return (isNaN(a) || isNaN(b)) ? undefined : record
}
})
const reader = stream.pipe(parser) // Adding .pause() seems to have no effect
console.log(reader.read()) // Prints `null`
// I found out I can use this strategy to read a few entries immediately, but I cannot break out of it and then resume as the stream will automatically be consumed
//for await (const record of reader) {
// console.log(record)
//}
I have been banging my head on this for a while and I could not find easy solutions on both the csv package and node official documentation.
Thanks in advance to anyone able to put me on the right track :)
You can do one thing while reading the stream you can create a readLineInterface and pass the input stream and normal output stream like this:
const inputStream = "reading the csv file",
outputStream = new stream();
// now create a readLineInterface which will read
// line by line you should use async/await
const res = await processRecord(readline.createInterface(inputStream, outputStream));
async function processRecord(line) {
return new Promise((res, rej) => {
if (line) {
// do the processing
res(line);
}
rej('Unable to process record');
})
}
Now create processRecord function should get the things line by line and you can you promises to make it sequential.
Note: the above code is a pseudo code just to give you an idea if things work because I have been doing same in my project to read the csv file line and line and it works fine.

Replay a log file with NodeJS as if it were happening in real-time

I have a log file with about 14.000 aircraft position datapoints captured from a system called Flarm, it looks like this:
{"addr":"A","time":1531919658.578100,"dist":902.98,"alt":385,"vs":-8}
{"addr":"A","time":1531919658.987861,"dist":914.47,"alt":384,"vs":-7}
{"addr":"A","time":1531919660.217471,"dist":925.26,"alt":383,"vs":-7}
{"addr":"A","time":1531919660.623466,"dist":925.26,"alt":383,"vs":-7}
What I need to do is find a way to 'play' this file back in real-time (as if it were occuring right now, even though it's pre-recorded), and emit an event whenever a log entry 'occurs'. The file is not being added to, it's pre-recorded and the playing back would occur at a later stage.
The reason for doing this is that I don't have access to the receiving equipment when I'm developing.
The only way I can think to do it is to set a timeout for every log entry, but that doesn't seem like the right way to do it. Also, this process would have to scale to longer recordings (this one was only an hour long).
Are there other ways of doing this?
If you want to "play them back" with the actual time difference, a setTimeout is pretty much what you have to do.
const processEntry = (entry, index) => {
index++;
const nextEntry = getEntry(index);
if (nextEntry == null) return;
const timeDiff = nextEntry.time - entry.time;
emitEntryEvent(entry);
setTimeout(processEntry, timeDiff, nextEntry, index);
};
processEntry(getEntry(0), 0);
This emits the current entry and then sets a timeout based on the difference until the next entry.
getEntry could either fetch lines from a prefilled array or fetch lines individually based on the index. In the latter case only two lines of data would only be in memory at the same time.
Got it working in the end! setTimeout turned out to be the answer, and combined with the input of Lucas S. this is what I ended up with:
const EventEmitter = require('events');
const fs = require('fs');
const readable = fs.createReadStream("./data/2018-07-18_1509log.json", {
encoding: 'utf8',
fd: null
});
function read_next_line() {
var chunk;
var line = '';
// While this is a thing we can do, assign chunk
while ((chunk = readable.read(1)) !== null) {
// If chunk is a newline character, return the line
if (chunk === '\n'){
return JSON.parse(line);
} else {
line += chunk;
}
}
return false;
}
var lines = [];
var nextline;
const processEntry = () => {
// If lines is empty, read a line
if (lines.length === 0) lines.push(read_next_line());
// Quit here if we've reached the last line
if ((nextline = read_next_line()) == false) return true;
// Else push the just read line into our array
lines.push(nextline);
// Get the time difference in milliseconds
var delay = Number(lines[1].time - lines[0].time) * 1000;
// Remove the first line
lines.shift();
module.exports.emit('data', lines[0]);
// Repeat after the calculated delay
setTimeout(processEntry, delay);
}
var ready_to_start = false;
// When the stream becomes readable, allow starting
readable.on('readable', function() {
ready_to_start = true;
});
module.exports = new EventEmitter;
module.exports.start = function() {
if (ready_to_start) processEntry();
if (!ready_to_start) return false;
}
Assuming you want to visualize the flight logs, you can use fs watch as below, to watch the log file for changes:
fs.watch('somefile', function (event, filename) {
console.log('event is: ' + event);
if (filename) {
console.log('filename provided: ' + filename);
} else {
console.log('filename not provided');
}
});
Code excerpt is from here. For more information on fs.watch() check out here
Then, for seamless update on frontend, you can setup a Websocket to your server where you watch the log file and send newly added row via that socket to frontend.
After you get the data in frontend you can visualize it there. While I haven't done any flight visualization project before, I've used D3js to visualize other stuff (sound, numerical data, metric analysis and etc.) couple of times and it did the job every time.

Using stream-combiner and Writable Streams (stream-adventure)

i'm working on nodeschool.io's stream-adventure. The challenge:
Write a module that returns a readable/writable stream using the
stream-combiner module. You can use this code to start with:
var combine = require('stream-combiner')
module.exports = function () {
return combine(
// read newline-separated json,
// group books into genres,
// then gzip the output
)
}
Your stream will be written a newline-separated JSON list of science fiction
genres and books. All the books after a "type":"genre" row belong in that
genre until the next "type":"genre" comes along in the output.
{"type":"genre","name":"cyberpunk"}
{"type":"book","name":"Neuromancer"}
{"type":"book","name":"Snow Crash"}
{"type":"genre","name":"space opera"}
{"type":"book","name":"A Deepness in the Sky"}
{"type":"book","name":"Void"}
Your program should generate a newline-separated list of JSON lines of genres,
each with a "books" array containing all the books in that genre. The input
above would yield the output:
{"name":"cyberpunk","books":["Neuromancer","Snow Crash"]}
{"name":"space opera","books":["A Deepness in the Sky","Void"]}
Your stream should take this list of JSON lines and gzip it with
zlib.createGzip().
HINTS
The stream-combiner module creates a pipeline from a list of streams,
returning a single stream that exposes the first stream as the writable side and
the last stream as the readable side like the duplexer module, but with an
arbitrary number of streams in between. Unlike the duplexer module, each
stream is piped to the next. For example:
var combine = require('stream-combiner');
var stream = combine(a, b, c, d);
will internally do a.pipe(b).pipe(c).pipe(d) but the stream returned by
combine() has its writable side hooked into a and its readable side hooked
into d.
As in the previous LINES adventure, the split module is very handy here. You
can put a split stream directly into the stream-combiner pipeline.
Note that split can send empty lines too.
If you end up using split and stream-combiner, make sure to install them
into the directory where your solution file resides by doing:
`npm install stream-combiner split`
Note: when you test the program, the source stream is automatically inserted into the program, so it's perfectly fine to have split() as the first parameter in combine(split(), etc., etc.)
I'm trying to solve this challenge without using the 'through' package.
My code:
var combiner = require('stream-combiner');
var stream = require('stream')
var split = require('split');
var zlib = require('zlib');
module.exports = function() {
var ws = new stream.Writable({decodeStrings: false});
function ResultObj() {
name: '';
books: [];
}
ws._write = function(chunk, enc, next) {
if(chunk.length === 0) {
next();
}
chunk = JSON.parse(chunk);
if(chunk.type === 'genre') {
if(currentResult) {
this.push(JSON.stringify(currentResult) + '\n');
}
var currentResult = new ResultObj();
currentResult.name = chunk.name;
} else {
currentResult.books.push(chunk.name);
}
next();
var wsObj = this;
ws.end = function(d) {
wsObj.push(JSON.stringify(currentResult) + '\n');
}
}
return combiner(split(), ws, zlib.createGzip());
}
My code does not work and returns 'Cannot pipe. Not readable'. Can someone point out to me where i'm going wrong?
Any other comments on how to improve are welcome too...

Add a mongo request into a file and archive this file

I'm having some troubles while trying to use streams with a MongoDB request. I want to :
Get the results from a collection
Put this results into a file
Put this file into a CSV
I'm using the archiver package for the file compression. The file contains csv formatted values, so for each row I have to parse them in the CSV format.
My function take a res (output) parameters, which means that I can send the result to a client directly. For the moment, I can put this results into a file without streams. I think I'll get memory troubles for a large amount of data that's why I want to use streams.
Here is my code (with no stream)
function getCSV(res,query) {
<dbRequest>.toArray(function(err,docs){
var csv = '';
if(docs !== null){
for(var i = 0; i< docs.length; i++){
var line = '';
for(var index in docs[i]){
if(docs[i].hasOwnProperty(index) && (index !== '_id' ) ){
if(line !== '') line+= ',';
line += docs[i][index];
}
}
console.log("line",line);
csv += line += '\r\n';
}
}
}.bind(this));
fileManager.addToFile(csv);
archiver.initialize();
archiver.addToArchive(fileManager.getName());
fileManager.deleteFile();
archiver.sendToClient(res);
};
Once the csv is completed, I had it to a file with a Filemanager Object. The latter one handles file creation and manipulation. The addToArchive method add the file to the current archive, and the sendToClient method send the archive through the output (res parameter is the function).
I'm using Express.js so I call this method with a server request.
Sometimes the file contains data, sometimes it is empty, could you explain me why ?
I'd like to understand how streams works, how could I implement this to my code ?
Regards
I'm not quite sure why you're having issue with the data sometimes showing up, but here is a way to send it with a stream. A couple of points of info before the code:
.stream({transform: someFunction})
takes a stream of documents from the database and runs whatever data manipulation you want on each document as it passes through the stream. I put this function into a closure to make it easier to keep the column headers, as well as allow you to pick and choose which keys from the document to use as columns. This will allow you to use it on different collections.
Here is the function that runs on each document as it passes through:
// this is a closure containing knowledge of the keys you want to use,
// as well as whether or not to add the headers before the current line
function createTransformFunction(keys) {
var hasHeaders = false;
// this is the function that is run on each document
// as it passes through the stream
return function(document) {
var values = [];
var line;
keys.forEach(function(key) {
// explicitly use 'undefined'.
// if using !key, the number 0 would get replaced
if (document[key] !== "undefined") {
values.push(document[key]);
}
else {
values.push("");
}
});
// add the column headers only on the first document
if (!hasHeaders) {
line = keys.join(",") + "\r\n";
line += values.join(",");
hasHeaders = true;
}
else {
// add the line breaks at the beginning of each line
// to avoid having an extra line at the end
line = "\r\n";
line += values.join(",");
}
// return the document to the stream and move on to the next one
return line;
}
}
You pass that function into the transform option for the database stream. Now assuming you have a collection of people with the keys _id, firstName, lastName:
function (req, res) {
// create a transform function with the keys you want to keep
var transformPerson = createTransformFunction(["firstName", "lastName"]);
// Create the mongo read stream that uses your transform function
var readStream = personCollection.find({}).stream({
transform: transformPerson
});
// write stream to file
var localWriteStream = fs.createWriteStream("./localFile.csv");
readStream.pipe(localWriteStream);
// write stream to download
res.setHeader("content-type", "text/csv");
res.setHeader("content-disposition", "attachment; filename=downloadFile.csv");
readStream.pipe(res);
}
If you hit this endpoint, you'll trigger a download in the browser and write a local file. I didn't use archiver because I think it would add a level of complexity and take away from the concept of what's actually happening. The streams are all there, you'd just need to fiddle with it a bit to work it in with archiver.

Nodejs Read very large file(~10GB), Process line by line then write to other file

I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations. I am using node for this operation.
Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code:
Following is the initiation code which reads each line from the input.
var path = '../10GB_input_file.txt';
var output_file = '../output.txt';
function fileopsmain(){
fs.exists(output_file, function(exists){
if(exists) {
fs.unlink(output_file, function (err) {
if (err) throw err;
console.log('successfully deleted ' + output_file);
});
}
});
new lazy(fs.createReadStream(path, {bufferSize: 128 * 4096}))
.lines
.forEach(function(line){
var line_arr = line.toString().split(';');
perform_line_ops(line_arr, line_arr[6], line_arr[7], line_arr[10]);
}
);
}
This is the method that performs some operation over that line and
passes the input to write method to write it into the output file.
function perform_line_ops(line_arr, range_start, range_end, daynums){
var _new_lines = '';
for(var i=0; i<days; i++){
//perform some operation to modify line pass it to print
}
write_line_ops(_new_lines);
}
Following method is used to write data into a new file.
function write_line_ops(line) {
if(line != null && line != ''){
fs.appendFileSync(output_file, line);
}
}
I want to bring this time down to 15-20 mins. Is it possible to do so.
Also for the record I'm trying this on a intel i7 processor with 8 GB of RAM.
You can do this easily without a module. For example:
var fs = require('fs');
var inspect = require('util').inspect;
var buffer = '';
var rs = fs.createReadStream('foo.log');
rs.on('data', function(chunk) {
var lines = (buffer + chunk).split(/\r?\n/g);
buffer = lines.pop();
for (var i = 0; i < lines.length; ++i) {
// do something with `lines[i]`
console.log('found line: ' + inspect(lines[i]));
}
});
rs.on('end', function() {
// optionally process `buffer` here if you want to treat leftover data without
// a newline as a "line"
console.log('ended on non-empty buffer: ' + inspect(buffer));
});
I can't guess where the possible bottleneck is in your code.
Can you add the library or the source code of the lazy function?
How many operations does your perform_line_ops do? (if/else, switch/case, function calls)
I've created a example based on your given code, I know that this does not answer your question but maybe helps you understand how node handles such case.
const fs = require('fs')
const path = require('path')
const inputFile = path.resolve(__dirname, '../input_file.txt')
const outputFile = path.resolve(__dirname, '../output_file.txt')
function bootstrap() {
// fs.exists is deprecated
// check if output file exists
// https://nodejs.org/api/fs.html#fs_fs_exists_path_callback
fs.exists(outputFile, (exists) => {
if (exists) {
// output file exists, delete it
// https://nodejs.org/api/fs.html#fs_fs_unlink_path_callback
fs.unlink(outputFile, (err) => {
if (err) {
throw err
}
console.info(`successfully deleted: ${outputFile}`)
checkInputFile()
})
} else {
// output file doesn't exist, move on
checkInputFile()
}
})
}
function checkInputFile() {
// check if input file can be read
// https://nodejs.org/api/fs.html#fs_fs_access_path_mode_callback
fs.access(inputFile, fs.constants.R_OK, (err) => {
if (err) {
// file can't be read, throw error
throw err
}
// file can be read, move on
loadInputFile()
})
}
function saveToOutput() {
// create write stream
// https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
const stream = fs.createWriteStream(outputFile, {
flags: 'w'
})
// return wrapper function which simply writes data into the stream
return (data) => {
// check if the stream is writable
if (stream.writable) {
if (data === null) {
stream.end()
} else if (data instanceof Array) {
stream.write(data.join('\n'))
} else {
stream.write(data)
}
}
}
}
function parseLine(line, respond) {
respond([line])
}
function loadInputFile() {
// create write stream
const saveOutput = saveToOutput()
// create read stream
// https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
const stream = fs.createReadStream(inputFile, {
autoClose: true,
encoding: 'utf8',
flags: 'r'
})
let buffer = null
stream.on('data', (chunk) => {
// append the buffer to the current chunk
const lines = (buffer !== null)
? (buffer + chunk).split('\n')
: chunk.split('\n')
const lineLength = lines.length
let lineIndex = -1
// save last line for later (last line can be incomplete)
buffer = lines[lineLength - 1]
// loop trough all lines
// but don't include the last line
while (++lineIndex < lineLength - 1) {
parseLine(lines[lineIndex], saveOutput)
}
})
stream.on('end', () => {
if (buffer !== null && buffer.length > 0) {
// parse the last line
parseLine(buffer, saveOutput)
}
// Passing null signals the end of the stream (EOF)
saveOutput(null)
})
}
// kick off the parsing process
bootstrap()
I know this is old but...
At a guess appendFileSync() _write()_s to the file system and waits for the response. Lots of small writes are generally expensive, presuming you use a BufferedWriter in Java you might get faster results by skipping some write()s.
Use one of the async writes and see if node buffers sensibly, or write the lines to large node Buffer until it is full and always write a full (or nearly full) Buffer. By tuning the buffer size you could validate if the number of writes affects perf. I suspect it would.
The execution is slow, because you're not using node's asynchronous operations. In essence, you're executing the code like this:
> read some lines
> transform
> write some lines
> repeat
While you could be doing everything at once, or at least reading and writing. Some examples in the answers here do that, but the syntax is at least complicated. Using scramjet you can do it in a couple simple lines:
const {StringStream} = require('scramjet');
fs.createReadStream(path, {bufferSize: 128 * 4096})
.pipe(new StringStream({maxParallel: 128}) // I assume this is an utf-8 file
.split("\n") // split per line
.parse((line) => line.split(';')) // parse line
.map([line_arr, range_start, range_end, daynums] => {
return simplyReturnYourResultForTheOtherFileHere(
line_arr, range_start, range_end, daynums
); // run your code, return promise if you're doing some async work
})
.stringify((result) => result.toString())
.pipe(fs.createWriteStream)
.on("finish", () => console.log("done"))
.on("error", (e) => console.log("error"))
This will probably run much faster.

Resources