How to load very large csv files in nodejs? - node.js

I'm trying to load 2 big csv into nodejs, first one has a size of 257 597 ko and second one 104 330 ko. I'm using the filesystem (fs) and csv modules, here's my code :
fs.readFile('path/to/my/file.csv', (err, data) => {
if (err) console.err(err)
else {
csv.parse(data, (err, dataParsed) => {
if (err) console.err(err)
else {
myData = dataParsed
console.log('csv loaded')
}
})
}
})
And after ages (1-2 hours) it just crashes with this error message :
<--- Last few GCs --->
[1472:0000000000466170] 4366473 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5584.4 / 0.0 ms last resort GC in old space requested
[1472:0000000000466170] 4371668 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5194.3 / 0.0 ms last resort GC in old space requested
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 000002BDF12254D9 <JSObject>
1: stringSlice(aka stringSlice) [buffer.js:590] [bytecode=000000810336DC91 o
ffset=94](this=000003512FC822D1 <undefined>,buf=0000007C81D768B9 <Uint8Array map
= 00000352A16C4D01>,encoding=000002BDF1235F21 <String[4]: utf8>,start=0,end=263
778854)
2: toString [buffer.js:664] [bytecode=000000810336D8D9 offset=148](this=0000
007C81D768B9 <Uint8Array map = 00000352A16C4D01>,encoding=000002BDF1...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memo
ry
1: node::DecodeWrite
2: node_module_register
3: v8::internal::FatalProcessOutOfMemory
4: v8::internal::FatalProcessOutOfMemory
5: v8::internal::Factory::NewRawTwoByteString
6: v8::internal::Factory::NewStringFromUtf8
7: v8::String::NewFromUtf8
8: std::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame
> >::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame> >
9: v8::internal::wasm::SignatureMap::Find
10: v8::internal::Builtins::CallableFor
11: v8::internal::Builtins::CallableFor
12: v8::internal::Builtins::CallableFor
13: 00000081634043C1
The biggest file is loaded but node runs out of memory for the other. It's probably easy to allocate more memory, but the main issue here is the loading time, it seems very long despite the size of files. So what is the correct way to do it? Python loads these csv really fast with pandas btw (3-5 seconds).

Stream works perfectly, it took only 3-5 seconds :
var csv = require('csv-parser')
var data = []
fs.createReadStream('path/to/my/data.csv')
.pipe(csv())
.on('data', function (row) {
data.push(row)
})
.on('end', function () {
console.log('Data loaded')
})

fs.readFile will load the entire file into memory, but fs.createReadStream will read the file in chunks of the size you specify.
This will prevent it from running out of memory

You may want to stream the CSV, instead of reading it all at once:
csv-parse has streaming support: http://csv.adaltas.com/parse/
or, you may want to take a look at csv-stream: https://www.npmjs.com/package/csv-stream

Related

Unable to upload large files using nodejs/axios

I am writing a nodejs client that would upload files (files can be both binary or text files) from my local dev machine to my server which is written in Java, configuring which is not an option. I am using the following code to upload files, it works fine for files upto 2 gb, but beyond that it throws an error mentioned below. Now you may think that the server might not be allowing files more than 2 gb but I have successfully uploaded files upto 10 gb using Rest clients like Postman and Insomnia on the same instance.
const fs = require("fs");
const path = require("path");
const axios = require("axios");
const FormData = require("form-data");
function uploadAxios({ filePath }) {
let formData;
try {
formData = new FormData();
formData.append("filedata", fs.createReadStream(filePath));
} catch (e) {
console.error(e)
}
axios
.post(
`https://myinstance.com`,
formData,
{
headers: {
...formData.getHeaders(),
"Content-Type": "multipart/form-data",
Authorization:
"Basic xyz==",
},
maxContentLength: Infinity,
maxBodyLength: Infinity,
// maxContentLength: 21474836480,
// maxBodyLength: 21474836480, // I have tried setting these values with both numbers and the keyword Infinity but nothing works
}
)
.then(console.log)
.catch(console.error);
}
const filePath = "C:\\Users\\phantom007\\Documents\\BigFiles\\3gb.txt";
uploadAxios({ filePath });
Error I get:
#
# Fatal error in , line 0
# API fatal error handler returned after process out of memory
#
<--- Last few GCs --->
es[7844:0000023DC49CE190] 47061 ms: Mark-sweep 33.8 (41.8) -> 33.8 (41.8) MB, 417.2 / 0.1 ms (+ 947.1 ms in 34029 steps since start of marking, biggest step 431.0 ms, walltime since start of marking 15184 ms) finalize incremental marking via stack guard[7844:0000023D
C49CE190] 48358 ms: Mark-sweep 34.4 (41.8) -> 31.8 (40.5) MB, 1048.4 / 0.0 ms (+ 0.0 ms in 1 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1049 ms) finalize incremental marking via task GC in old spac
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 000002E294C255E9 <JSObject>
0: builtin exit frame: new ArrayBuffer(aka ArrayBuffer)(this=0000022FFFF822D1 <undefined>,65536)
1: _read [fs.js:~2078] [pc=0000004AD942D301](this=0000039E67337641 <ReadStream map = 000002F26D804989>,n=65536)
2: read [_stream_readable.js:454] [bytecode=000002A16EB59689 offset=357](this=0000039E67337641 <ReadStream map = 000002F26D804989>,n=0)
3: push [_stream_readable.js:~201]...
FATAL ERROR: Committing semi space failed. Allocation failed - process out of memory
It looks like the error is because it has exceed the memory limit, i know by passing the flag --max-old-space-size i can overcome this, but i want this to be scalable and not hardcode an upper limit.
PS: My dev machine has 12 GB free memory
Edit: I added the error trace.
I'm using multer to define limit, see next code:
app.use(multer({
storage: storage,
dest: path.join(pathApp),
limits: {
fileSize: 5000000
},
fileFilter: function fileFilter(req, file, cb) {
var filetypes = /json/;
var mimetype = filetypes.test(file.mimetype);
var extname = filetypes.test(path.extname(file.originalname));
if (mimetype && extname) {
console.log("Port ".concat(app.get('port')) + " - Uploading file " + file.originalname);
return cb(null, true, req);
}
cb(JSON.stringify({
"success": false,
"payload": {
"app": "upload",
"function": "upload"
},
"error": {
"code": 415,
"message": 'File type not valid'
}
}));
}
}).single('file1'));

Laravel Factory error Allowed memory size of 536870912 bytes exhausted (tried to allocate 262144 bytes)

I want to generate dummy data using factory with seeder so it will give me the error.
when I run this command given below:
php artisan db:seed
so here it's the error.
PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 262144 bytes) in D:\xampp\htdocs\Bootstrap\vendor\laravel\framework\src\Illuminate\Database\Query\Grammars\Grammar.php on line 1120
PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 262144 bytes) in Unknown on line 0
class DatabaseSeeder extends Seeder
{
public function run()
{
factory(User::class,5)->create()->each(function ($user){
$profile = factory(Profile::class)->make();
$user->profile()->save($profile);
$profile->each(function ($profiles){
$qualification =factory(Qualification::class,3)->make();
$experience =factory(Experience::class,3)->make();
$profiles->qualification()->saveMany($qualification);
$profiles->experience()->saveMany($experience);
});
});
}
}
For each User has-one Profile.
For each Profile Has-many (Qualification and Experience).
If we run this code given below:
class DatabaseSeeder extends Seeder
{
public function run()
{
DB::table('posts')->insertOrIgnore([
['id'=>1,'title'=>'admission','created_at'=>now(),'updated_at'=>now()],
['id'=>2,'title'=>'biology','created_at'=>now(),'updated_at'=>now()],
['id'=>3,'title'=>'mathematics','created_at'=>now(),'updated_at'=>now()],
['id'=>4,'title'=>'chemistry','created_at'=>now(),'updated_at'=>now()],
['id'=>5,'title'=>'physics','created_at'=>now(),'updated_at'=>now()],
['id'=>6,'title'=>'english','created_at'=>now(),'updated_at'=>now()],
['id'=>7,'title'=>'urdu','created_at'=>now(),'updated_at'=>now()],
]);
DB::table('provinces')->insertOrIgnore([
['id'=>1,'title'=>'punjab','created_at'=>now(),'updated_at'=>now()],
['id'=>2,'title'=>'sindh','created_at'=>now(),'updated_at'=>now()],
['id'=>3,'title'=>'nwfp','created_at'=>now(),'updated_at'=>now()],
['id'=>4,'title'=>'balochistan','created_at'=>now(),'updated_at'=>now()],
]);
}
}
using this command php artisan db:seed
then there is no error received.
please help me in using a laravel factory.
To fix that issue edit your php.ini.
Increase the memory limit by 512M >
; Maximum amount of memory a script may consume (128 MB)
; http://php.net/memory-limit
memory_limit=512M
or make it unlimited.(it depends on your server resources)
; Maximum amount of memory a script may consume (128 MB)
; http://php.net/memory-limit
memory_limit=512M

NodeJs - near heap limit Allocation failed - JavaScript heap out of memory

I have a node js application, running in AWS EC2 as m5xlarge instance and ubuntu 18.04 OS, which has a main.js file and in main file I am using node-cron to schedule multiple cron jobs,once the jobs are scheduled starting the application using another file app.js, in an intermittent way I am facing out of memory error and server stops the logs are shown as follows -
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
1: node::Abort() [node /home/ubuntu/XXXXXX/main.js]
2: 0x89371c [node /home/ubuntu/XXXXXX/main.js]
3: v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node /home/ubuntu/XXXXXX/main.js]
4: v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node /home/ubuntu/XXXXXX/main.js]
5: 0xe617e2 [node /home/ubuntu/XXXXXX/main.js]
6: v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [node /home/ubuntu/XXXXXX/main.js]
7: v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node /home/ubuntu/XXXXXX/main.js]
8: v8::internal::Heap::AllocateRawWithRetry(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [node /home/ubuntu/XXXXXX/main.js]
9: v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [node /home/ubuntu/XXXXXX/main.js]
10: v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [node /home/ubuntu/XXXXXX/main.js]
11: 0x2a583f5041bd
The memory utilization and health check monitors are as follows -
The spikes are due to cron job running on interval the highest ones are hourly cron job.
Now my assumption is it could be due to any of cron job failing with out of memory error in main.js, and hence the application in app.js also stops, or application itself failing in app.js the cron job scheduling looks like below -
const cluster = require('cluster');
const numCPUs = 1 //require('os').cpus().length;
var CronJob = require('cron').CronJob;
const spawn = require('child_process').spawn;
require('dotenv').config()
if (cluster.isMaster) {
if (process.env.kEnvironment == "dev") {
var sampleCron = new CronJob('00 */10 * * * *', function () {
spawn(process.execPath, ['./sampleCron.js'], {
stdio: 'inherit'
})
}, null, true, null);
} else {
var sampleCron = new CronJob('00 15 10 * * 0', function () {
spawn(process.execPath, ['./sampleCron.js'], {
stdio: 'inherit'
})
}, null, true, null);
}
// There are multiple cron like the above
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', (worker, code, signal) => {
console.log(`worker ${worker.process.pid} died`);
});
} else {
require('./app.js');
}
Here is a htop preview about this -
couple of questions here -
- Why it shows main.js tree inside main.js, is that kind of nesting normal?
- If it is same why the resource memory utilization differs for both?
I tried to increase the memory for each cron as below -
var sampleCron = new CronJob('00 15 10 * * 0', function () {
spawn(process.execPath, ['./sampleCron.js','--max-old-space-size=4096'], {
stdio: 'inherit'
})
}, null, true, null);
But it still fails. My questions are as follows -
How do I isolate the problem, is it really due to crons or due to application.
How do I solve the problem?
Use commands like "Top" to find out how much memory the node process is using. I think the node script does not use all the available memory. you can also try allocating more memory using the NODE_OPTIONS.
For e.g node SomeScript.js --max-old-space-size=8192

What are the effective ways to work Node js with a large JSON file of 600 MB and more?

What are the effective ways to work Node js with a large JSON file of 600 MB and more?
My partner gives me from his REST API wery large JSON file. 600mb, 1000mb
Its structure is as follows
{ nameid1:[list id,....], nameid2:[list id,....], }
[list id,....] - An array with ID can be up to hundreds of millions of records.
Now to work with such files I use the following sequence of actions.
I save it to hard drive
With the sed command, from a single-line file, I make it multi-line
Example
exec (`sed -i 's /', '/', '\ n / g' file.json)
I work directly with the file using readline
I tried to use JSONStream but it causes FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
function getStream() {
let jsonData = __dirname + '/jsonlarge/file.json',
stream = fs.createReadStream(jsonData, {
encoding: 'utf8'
})
parser = JSONStream.parse('*');
stream.pipe(parser)
parser.on('data', (data) => {
console.log('received:', data);
});
}
Example structure json file
{"Work":"12122001","name":"Regist","world":[{"name":"000000","point":"rfg","Content":["3202b9a3fba","121323","2343454","45345543","354534534"]}, {"name":"000000","point":"rfg","Content":["3202b","121323","2343454","45345543","354534534"]}, {"name":"000000","point":"rfg","Content":["320","121323","2343454","45345543","354534534"]}]}
Maybe someone knows a faster way to work with such files.
Thanks

node js upload size unit is strange

I am working on upload module of my server and I set file uploads with multiparty. I am currently trying to limit the upload size simply i a doing something like this
req.on("data", function(dt) {
bytes += dt.length;
if (bytes > 2048) {
req.connection.destroy();
console.log("connection destroyed due to huge file size");
}
console.log(bytes);
});
I thought this length is in bytes and tried to limit it with 2mb
but i noticed this unit is a bit strange for testing i uploaded a 148 kb file but the length of the variable i created so far is 421 it is neither in bits nor bytes why it is so strange number? where do this extra ~300k come from?
Did you try filesystem module for checking size of the file?
E.g.
var fs = require("fs");
var stats = fs.statSync("myfile.txt");
var fileSizeInBytes = stats.size;

Resources