Pass large array of objects to RabbitMQ exchange - node.js

I receive large array of objects from an external source (about more than 10 000 objects). And then I pass it to exchange in order to notify other microservices about new entries to handle.
this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: myData, // [object1, object2, object3, ...]
pattern: 'myPattern',
})
The problem is that it's bad practice to push such large message to exchange, and I'd like to resolve this issue. I've read articles and stackoverflow posts about that to find code example or information about streaming data but with no success.
The only way I've found out is to divide my large array into chunks and publish each one to exchange using for ... loop. Is it good practice? How to determine what length should each chunk (number of objects) have? Or maybe is there another approach?

It really depends on the Object size.. That's a thing you would have to figure out yourself. Get your 10k objects and calculate an average size out of them (Put them as json into a file and take fileSize/10'000 that's it. Maybe request body size of 50-100kb is a good thing? But that's still up to u ..
Start with number 50 and do tests. Check the time taken, bandwidth and everything what makes sense. Change chunk sizes from between 1-5000 and test test test . At some point, you will get a feeling what number would be good to take! .
Here's some example code of looping through the elements:
// send function for show caseing the idea.
function send(data) {
return this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: data,
pattern: 'myPattern',
})
}
// this sends chunks one by one..
async function sendLargeDataPacket(data, chunkSize) {
// Pure functions do prevent headache
const mutated = [...data]
// send full packages aslong as possible!.
while (mutated.length >= chunkSize) {
// send a packet of chunkSize length
await send(mutated.splice(0, chunkSize))
}
// send the remaining elements if there are any!
if(mutated.length > 0) {
await send(mutated)
}
}
And you would call it like:
// that's your 10k+ items array!.
var myData = [/**...**/]
// let's start with 50, but try out all numbers!.
const chunkSize = 50
sendLargeDataPacket(chunkSize).then(() => console.log('done')).catch(console.error)
This approach send one packet after the other, and may take some time since it is not done in parallel. I do not know your requirements but I can help you writing a parallel approach if you need..

Related

How to send large number of requests from axios?

I want to send about 2000 api calls to retrieve data from some api. How do i archive this without having any bottlenecks using axios?
2000 api calls in parallel or one after another?
it would be really resource intensive if it is in parallel, depending on your device it might me possible or maybe not.
but if it is one after another it can be easily achieved with a simple loop.
here is what I would do in case of parallel implementation:
this would be a little trial and error in the beginning to find the best parallelism factor.
use lodash chunks to make the complete 2000 requests to chunks of size x api request objects
(reference: https://www.geeksforgeeks.org/lodash-_-chunk-method/)
now for each chunk, we will make api calls in parallel. that is each chunk will be in sequential order but within each chuck we will make parallel api calls.
sample code:
// Requiring the lodash module
// in the script
const _ = require("lodash");
async function _2000API_CALLS() {
const reqs = [... your 2000 request params]
// Making chunks of size 100
const chunks = _.chunk(reqs, 100) // x = 100
for(const c of chunks) {
const responses = await Promise.all(c.map(() => {
return axois... // api call logic
}))
// accumulate responses in some array if u need to store it and use it later
}
}

web3js subscribe logs to fast for Javascript to handle

I am using web3js to subscribe to logs, I listening to swap events, the problem is that the .on(data) is so fast in giving data JavaScript can not keep up. lets say I add a variable let count = 0; each time I get a new log I increase the number ++count, sometimes the logs come so fast I get a double number.
The real problem is I need it to be in the exact order as it is coming in, that's why I give the number to each log, but that does not work.
How would I make sure that each data item I get from the log events that they are in order?
I tried to create a promise sequence
let sequence = Promise.resolve();
let count = 0;
web3.eth.subscribe('logs', {
fromBlock: block,
topics: [
[swapEvent]
]
}).on('data', (logData)=>{
sequence = sequence.then(()=>{
++count
processData(logData)
})
});
function processData(){
return new Promise(resolve=>{
// do some stuff
resolve();
})
};
In a simple test with a loop and random time to resolve this works fine, but in the actual code with socket it does not keep the order.
Anyone has some idea how I can make the socket data keep in order and process one by one?
Not sure why but my problem got solved with this.
sequence = sequence.then(()=>processData(logData))
before it was
sequence = sequence.then(()=>{
processData(logData)
})
Now its doing all in sequence.

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.
For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Passing a stream only if digest passes

I've got a pipeline in an express.js module in which I take a file, decrypt it, pass it through a digest to ensure it is valid, and then want to return it as the response if the digest passes. The code looks something like this:
function GetFile(req,res) {
...
}).then(() => {
var p1 = new Promise(function(resolve,reject) {
digester = digestStream("md5", "hex", function(md5,len) {
// compare md5 and length against expected values
// what do i do if they don't match?
resolve()
}
}
infile.pipe(decrypter).pipe(digester).pipe(res)
return p1
}).then(() => {
...
}
The problem is, once I pipe the output to res, it pipes it whether or not the digest passes. But if I don't pipe the output of the digester to anything, then nothing happens - I guess there isn't pressure from the right end to move the data through.
I could simply run the decryption pipeline twice, and in fact this was what was previously done, but I'm trying to speed things up so everything only happens once. One idea I had was to pipe the digester output to a buffer, and if the digest matches, then send the buffer to res. This will require memory proportional to the size of the file, which isn't horrible in most cases. However, I couldn't find much on how to .pipe() directly to a buffer. The closest thing I could find was the bl module, however in the section in which it demonstrates piping to a function which collects the data, there is this caveat mentioned:
Note that when you use the callback method like this, the resulting
data parameter is a concatenation of all Buffer objects in the list.
If you want to avoid the overhead of this concatenation (in cases of
extreme performance consciousness), then avoid the callback method and
just listen to 'end' instead, like a standard Stream.
I'm not familiar enough with bl to understand what this really means with regards to how efficient this is. Specifically, I don't understand why it is talking about concatenating buffer objects - why is there more than one buffer object that must be concatenated, for example?). I'm not sure how I can follow its advice and still have a simple pipe either.
The bl module is going to collect buffers when it is piped to. How many buffers depends on what the input stream does. If you don't want to concatenate them together, store them in the BufferList, and if the hash passes, then pipe the BufferList to your output.
Something like this works for me:
function GetFile(req,res) {
...
var bl
}).then(() => {
var p1 = new Promise(function(resolve,reject) {
digester = digestStream("md5", "hex", function(md5,len) {
if (md5 != expectedmd5) throw "bad md5"
if (len != expectedlen) throw "bad length"
resolve()
}
}
bl = new BufferList()
infile.pipe(decrypter).pipe(digester).pipe(bl)
return p1
}).then(() => {
bl.pipe(res)
...
}

Best way to query all documents from a mongodb collection in a reactive way w/out flooding RAM

I want to query all the documents in a collection in a reactive way. The collection.find() method of the mongodb nodejs driver returns a cursor that fires events for each document found in the collection. So I made this:
function giant_query = (db) => {
var req = db.collection('mycollection').find({});
return Rx.Observable.merge(Rx.Observable.fromEvent(req, 'data'),
Rx.Observable.fromEvent(req, 'end'),
Rx.Observable.fromEvent(req, 'close'),
Rx.Observable.fromEvent(req, 'readable'));
}
It will do what I want: fire for each document, so I can treat then in a reactive way, like this:
Rx.Observable.of('').flatMap(giant_query).do(some_function).subscribe()
I could query the documents in packets of tens, but then I'd have to keep track of an index number for each time the observable stream is fired, and I'd have to make an observable loop which I do not know if it's possible or the right way to do it.
The problem with this cursor is that I don't think it does things in packets. It'll probably fire all the events in a short period of time, therefore flooding my RAM. Even if I buffer some events in packets using Observable's buffer, the events and events data (the documents) are going to be waiting on RAM to be manipulated.
What's the best way to deal with it n a reactive way?
I'm not an expert on mongodb, but based on the examples I've seen, this is a pattern I would try.
I've omitted the events other than data, since throttling that one seems to be the main concern.
var cursor = db.collection('mycollection').find({});
const cursorNext = new Rx.BehaviourSubject('next'); // signal first batch then wait
const nextBatch = () => {
if(cursor.hasNext()) {
cursorNext.next('next');
}
});
cursorNext
.switchMap(() => // wait for cursorNext to signal
Rx.Observable.fromPromise(cursor.next()) // get a single doc
.repeat() // get another
.takeWhile(() => cursor.hasNext() ) // stop taking if out of data
.take(batchSize) // until full batch
.toArray() // combine into a single emit
)
.map(docsBatch => {
// do something with the batch
// return docsBatch or modified doscBatch
})
... // other operators?
.subscribe(x => {
...
nextBatch();
});
I'm trying to put together a test of this Rx flow without mongodb, in the meantime this might give you some ideas.
You also might wanna check my solution without using of rxJS:
Mongoose Cursor: http bulk request from collection

Resources