Threading: Division of Labor - multithreading

I seek a little nudge in the right direction to understand Node workers. I currently have Node code that reads data from a file and performs a bunch of subsequent actions with network requests. All of the actions I do with the data currently take place in the callback of the read function.
What I struggle to wrap my head around is how best to take this single read function (which almost certainly is not slowing my application down -- I'm fairly certain it's the later requests I'd like to branch), and divide the manipulation into multiple child processes. Of course, I don't want to perform my battery of actions multiple times on the same row of data, but rather I want to give each worker a slice of the pie. Is my best bet to, in the read-callback, create several arrays with part of the data, and then feed one array into each worker, outside the callback? Are there other options? My end goal is to reduce the time it takes my script to run through x amount of data.
var request = require('request');
var request = request.defaults({
jar: true
})
var yacsv = require('ya-csv');
// Post Log-In Form Information to Appropriate URL -- Occurs only once per script-run -- Log in cookies saved for subsequent requests
request.post({
url: 'xxxxx.com',
body: "login_info",
// On Reponse...
}, function(error, res, body) {
// Instantiate CSV Reader
var reader = yacsv.createCsvFileReader("somefile.csv");
// Read Data from CSV, Row by Row -- Function happens once per CSV-row
// THIS IS WHAT I -THINK- I CAN SPLIT AMONG MULTIPLE WORKERS
var readData = reader.addListener('data', function(data) {
// Bind each field from a CSV row to a corresponding variable for ease of use
//[Variables here]
// Second Request for Search Form -- Uses information from a single row to query more information from a database
request.post({
url: 'xxxxx.com/form',
body: variable_with_csv_data,
}, function(error, res, body) {
// Parse the resulting page, then page elements to variables for ease of output
}
});
});
});

The cluster module is not an altenative to threads. The cluster module allows you to balance http requests to the same application logic over multiple processes, without the option of delegating responsibility.
What is it exactly that you are trying to optimize ?
Is the overall process taking to long ?
Is the separate processing of the data events to slow ?
Are your database calls to slow ?
Are the http requests to slow ?
Also, I would do away with the ya-csv module it seems somewhat outdated to me.

Related

Batch requests and concurrent processing

I have a service in NodeJS which fetches user details from DB and sends that to another application via http. There can be millions of user records, so processing this 1 by 1 is very slow. I have implemented concurrent processing for this like this:
const userIds = [1,2,3....];
const users$ = from(this.getUsersFromDB(userIds));
const concurrency = 150;
users$.pipe(
switchMap((users) =>
from(users).pipe(
mergeMap((user) => from(this.publishUser(user)), concurrency),
toArray()
)
)
).subscribe(
(partialResults: any) => {
// Do something with partial results.
},
(err: any) => {
// Error
},
() => {
// done.
}
);
This works perfectly fine for thousands of user records, it's processing 150 user records concurrently at a time, pretty faster than publishing users 1 by 1.
But problem occurs when processing millions of user records, getting those from database is pretty slow as result set size also goes to GBs(more memory usage also).
I am looking for a solution to get user records from DB in batches, while keep on publishing those records concurrently in parallel.
I thinking of a solution like, maintain a queue(of size N) of user records fetched from DB, whenever queue size is less than N, fetch next N results from DB and add to this queue.
Then the current solution which I have, will keep on getting records from this queue and keep on processing those concurrently with defined concurrency. But I am not quite able to put this in code. Is there are way we can do this using RxJS?
I think your solution is the right one, i.e. using the concurrent parameter of mergeMap.
The point that I do not understand is why you are adding toArray at the end of the pipe.
toArray buffers all the notifications coming from upstream and will emit only when the upstream completes.
This means that, in your case, the subscribe does not process partial results but processes all of the results you have obtained executing publishUser for all users.
On the contrary, if you remove toArray and leave mergeMap with its concurrent parameter, what you will see is a continuous flow of results into the subscribe due to the concurrency of the process.
This is for what rxjs is concerned. Then you can look at the specific DB you are using to see if it supports batch reads. In which case you can create buffers of user ids with the bufferCount operator and query the db with such buffers.

using Kafka Consumer in Node JS app to indicate computations have been made

So my question may involve some brainstorming based on the nature of the application.
I have a Node JS app that sends messages to Kafka. For example, every single time a user clicks on a page, a Kafka app runs a computation based on the visit. I then at the same instance want to retrieve this computation after triggering it through my Kafka message. So far, this computation is stored in a Cassandra database. The problem is that, if we try to read from Cassandra before the computation is complete then we will query nothing from the database(key has not been inserted yet)and won't return anything(error), or possibly the computation is stale. This is my code so far.
router.get('/:slug', async (req, res) =>{
Producer = kafka.Producer
KeyedMessage = kafka.KeyedMessage
client = new kafka.KafkaClient()
producer = new Producer(client)
km = new KeyedMessage('key', 'message')
kafka_message = JSON.stringify({ id: req.session.session_id.toString(), url: arbitrary_url })
payloads = [
{ topic: 'MakeComputationTopic', messages: kafka_message}
];
const clientCass = new cassandra.Client({
contactPoints: ['127.0.0.1:9042'],
localDataCenter: 'datacenter1', // here is the change required
keyspace: 'computation_space',
authProvider: new auth.PlainTextAuthProvider('cassandra', 'cassandra')
});
const query = 'SELECT * FROM computation WHERE id = ?';
clientCass.execute(query, [req.session.session_id],{ hints : ['int'] })
.then(result => console.log('User with email %s', result.rows[0].computations))
.catch((message) => {
console.log('Could not find key')
});
}
Firstly, async and await came to mind but that is ruled out since this does not stop stale computations.
Secondly, I looked into letting my application sleep, but it seems that this way will slow my application down.
I am possibly deciding on using Kafka Consumer (in my node-js) to consume a message that indicates that it's safe to now look into the Cassandra table.
For e.g. (using kafka-node)
consumer.on('message', function (message) {
clientCass.execute(query, [req.session.session_id],{ hints : ['int'] })
.then(result => console.log('User with computation%s', result.rows[0].computations))
.catch((message) => {
console.log('Could not find key')
});
});
This approach while better seems a bit off since I will have to make a consumer every time a user clicks on a page, and I only care about it being sent 1 message.
I was wondering how I should deal with this challenge? Am I possibly missing a scenario, or is there a way to use kafka-node to solve this problem? I was also thinking of doing a while loop that waits for the promise to succeed and that computations are not stale(compare values in the cache)
This approach while better seems a bit off since I will have to make a consumer every time a user clicks on a page, and I only care about it being sent 1 message.
I would come to the same conclusion. Cassandra is not designed for these kind of use cases. The database is eventually consistence. Your current approach maybe works at the moment, if you hack something together, but will definitely result in undefined behavior once you have a Cassandra cluster. Especially when you update the entry.
The id in the computation table is your partition key. This means once you have a cluster Cassandra distributes the data by the id. It looks like it only contains one row. This is a very inefficient way of modeling your Cassandra tables.
Your use case looks like one for a session storage or cache. Redis or LevelDB are well suited for these kind of use cases. Any other key value storage would do the job too.
Why don't you write your result into another topic and have another application which reads this topic and writes the result into a database. So that you don't need to keep any state. The result will be in the topic when it is done. It would look like this:
incoming data -> first kafka topic -> computational app -> second kafka topic -> another app writing it into the database <- another app reading regularly the data.
If it is there it is there and therefore not done yet.

How to know how many requests to make without knowing amount of data on server

I have a NodeJS application where I need to fetch data from another server (3rd-party, I have no control over it). The server requires you to specify a max number of entries to return, along with an offset. So for example if there are 100 entries on the server, I could request a pageSize of 100 and offset of 0, or pageSize of 10, and do 10 requests with offset 1,2,3, etc. and do a Promise.all (doing multiple concurrent smaller requests is faster when timing it).
var pageSize = 100;
var offsets = [...Array(totalItems / pageSize).keys()];
await Promise.all(offsets.map(async i => //make request with pageSize and offset));
The only problem is that the number of entries changes, and there is no property returned by the server indicating the total number of items. I could do something like this and while loop until the server comes back empty:
var offset = 0;
var pageSize = 100;
var data = [];
var response = await //make request with pageSize and offset
while (response is not empty){
data.push(response);
offset++;
//send another request
But that isn't as efficient/quick as sending multiple concurrent requests like above.
Is there any good way around this that can deal with the dynamic length of the data on the server?
Without the server giving you some hints about how many items there are, there's not a lot you can do to parallelize multiple requests as you don't really want to send more requests than are needed and you don't want to artificially make your requests for smallish number of items just so you can run more requests in parallel.
You could run some tests and find some practical limits. What are the maximum number of items that the server and your client seem to be OK with you requesting (100? 1000? 10,000? 100,000?) and just request that many to start with. If it indicates there are more after that, then send another request of a similar size.
The main idea with this is to minimize the number of separate requests and maximize the data you can get in a single call. That should be more efficient than more parallel requests, each requesting fewer items, because its ultimately the same server on the other end and same data store that has to provide all the data so the fewest roundtrips in the fewest separate requests is probably the best.
But, some of this is dependent upon the scale and architecture of the target host so experiments will be required to see what practically works best.

Concurrent requests overriding data in Redis

Scenarios: When ever a request comes I need to connect to Redis instance, open the connection, fetch the count, update the count and close the connect(For every request this is the flow).When the requests are coming in sequential order i.e. 1 user sending 100 requests one after the other then the count in Redis is 100.
Issue: Issue is when concurrent requests comes. i.e. 10 users sending 100 requests(each user 10 requests) concurrently then the count is not 100 its around 50.
Example: Assume count in Redis is 0. If 10 requests comes at the same time then 10 connections will be opened and all the 10 connections will fetch the count value as 0 and updated it to 1.
Analysis: I found out that, as the requests are coming concurrently, multiple connections are fetching the same count value and updating it because of it the count value is getting overridden. Can anyone suggest a best way to avoid this problem if you have already encountered this problem.
Here we are using Hapijs, Redis 3.0, ioredis
I would recommend queueing each task so that each request finishes before the next one starts.
Queue.js is a good library I have used before but you can check out others if you want.
Here is an example basically from the docs but adapted slightly for your use case:
var queue = require('../')
var q = queue()
var results = []
var rateLimited = false
q.push(function (cb) {
if(!rateLimited){
// get data and push into results
results.push('two')
}
cb()
})
q.start(function (err) {
if (err) throw err
console.log('all done:', results)
})
This is a very loose example as I just wrote it quickly and without seeing your code base but I hope you get the idea.

Node: Check a Firebase db and execute a function when an objects time matches the current time

Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?
There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.

Resources