I'm trying to build a Node application using worker threads, divided into three parts.
The primary thread that delegates tasks
A dedicated worker thread that updates shared data
A pool of worker threads that run calculations on shared data
The shared data is in the form of several SharedArrayBuffer objects operating like a pseudo-database. I would like to be able to update the data without needing to pause calculations, and I'm ok with a few tasks using slightly stale data. The flow I've come up with is:
Primary thread passes data to update thread
Update thread creates a whole new SharedArrayBuffer and populates it with updated data.
Update thread returns a pointer to the new buffer back to primary thread.
Primary thread caches the latest pointer in a variable, overwriting its previous value, and passes it to each worker thread with each task.
Worker threads don't retain these pointers at all after executing their operations.
The problem is, this seems to create a memory leak in the resident state stack when I run a prototype that frequently makes updates and swaps out the shared buffers. Garbage collection appears to make a couple of passes removing the discarded buffers, but then it climbs continuously until the application slows and eventually hangs or crashes.
How can I guarantee that a SharedArrayBuffer will get picked up by garbage collection when I'm done with it, or it it even possible? I've seen hints to the effect that as long as all references to it are removed from all threads it will eventually get picked up, but not a clear answer.
I'm using the threads.js library to abstract the worker thread operations. Here's a summary of my prototype:
app.ts:
import { ModuleThread, Pool, spawn, Worker } from "threads";
import { WriterModule } from "./workers/writer-worker";
import { CalculateModule } from "./workers/calculate-worker";
class App {
calculatePool = Pool<ModuleThread<CalculateModule>>
(() => spawn(new Worker('./workers/calculate-worker')), { size: 6 });
writerThread: ModuleThread<WriterModule>;
sharedBuffer: SharedArrayBuffer;
dataView: DataView;
constructor() {
this.sharedBuffer = new SharedArrayBuffer(1000000);
this.dataView = new DataView(this.sharedBuffer);
}
async start(): Promise<void> {
this.writerThread = await spawn<WriterModule>(new Worker('./workers/writer-worker'));
await this.writerThread.init(this.sharedBuffer);
await this.update();
// Arbitrary delay between updates
setInterval(() => this.update(), 5000);
while (true) {
// Arbitrary delay between tasks
await new Promise<void>(resolve => setTimeout(() => resolve(), 250));
this.calculate();
}
}
async update(): Promise<void> {
const updates: any[] = [];
// generates updates
this.sharedBuffer = await this.writerThread.update(updates);
this.dataView = new DataView(this.sharedBuffer);
}
async calculate(): Promise<void> {
const task = this.calculatePool.queue(async (calc) => calc.calculate(this.sharedBuffer));
const sum: number = await task;
// Use result
}
}
const app = new App();
app.start();
writer-worker.ts:
import { expose } from "threads";
let sharedBuffer: SharedArrayBuffer;
const writerModule = {
async init(startingBuffer: SharedArrayBuffer): Promise<void> {
sharedBuffer = startingBuffer;
},
async update(data: any[]): Promise<SharedArrayBuffer> {
// Arbitrary update time
await new Promise<void>(resolve => setTimeout(() => resolve(), 500));
const newSharedBuffer = new SharedArrayBuffer(1000000);
// Copy some values from the old buffer over, perform some mutations, etc.
sharedBuffer = newSharedBuffer;
return sharedBuffer;
},
}
export type WriterModule = typeof writerModule;
expose(writerModule);
calculate-worker.ts
import { expose } from "threads";
const calculateModule = {
async calculate(sharedBuffer: SharedArrayBuffer): Promise<number> {
const view = new DataView(sharedBuffer);
// Arbitrary calculation time
await new Promise<void>(resolve => setTimeout(() => resolve(), 100));
// Run arbitrary calculation
return sum;
}
}
export type CalculateModule = typeof calculateModule;
expose(calculateModule);
Related
In Node.js there is the cluster module to utilize all available cores on the machine which is pretty great, especially when used with the node module pm2. But I am pretty stoked about some features of Deno but I have wondered about how to best run it on a multi-core machine.
I understand that there is workers which works great for a specific task but for normal web requests it seems like performance of multi-core machines is wasted somewhat? What is the best strategy to get maximum availability and utilization of my hardware in Deno?
I am a bit worried that if you only have a single process going on and there is some CPU intensive task for whatever reason it will "block" all other requests coming in. In node.js the cluster module would solve this, since another process would handle the request but I am unsure on how to handle this in Deno?
I think you could run several instances in Deno on different ports and then have some kind of load balancer in front of it but that seems like quite a complex setup in comparison. I also get that you could use some kind of service like Deno Deploy or whatever, but I already have hardware that I want to run it on.
What are the alternatives for me?
Thanks in advance for you sage advice and better wisdom.
In Deno, like in a web browser, you should be able to use Web Workers to utilize 100% of a multi-core CPU.
In a cluster you need a "manager" node (which can be a worker itself too as needed/appropriate). In a similar fashion the Web Worker API can be used to create however many dedicated workers as desired. This means the main thread should never block as it can delegate all tasks that will potentially block to its workers. Tasks that won't block (e.g. simple database or other I/O bound calls) can be done directly on the main thread like normal.
Deno also supports navigator.hardwareConcurrency so you can query about available hardware and determine the number of desired workers accordingly. You might not need to define any limits though. Spawning a new dedicated worker from the same source as a previously spawned dedicated worker may be fast enough to do so on demand. Even so there may be value in reusing dedicated workers rather than spawning a new one for every request.
With Transferable Objects large data sets can be made available to/from workers without copying the data. This along with messaging makes it pretty straight forward to delegate tasks while avoiding performance bottlenecks from copying large data sets.
Depending on your use cases you might also use a library like Comlink "that removes the mental barrier of thinking about postMessage and hides the fact that you are working with workers."
e.g.
main.ts
import { serve } from "https://deno.land/std#0.133.0/http/server.ts";
import ComlinkRequestHandler from "./ComlinkRequestHandler.ts";
serve(async function handler(request) {
const worker = new Worker(new URL("./worker.ts", import.meta.url).href, {
type: "module",
});
const handler = ComlinkRequestHandler.wrap(worker);
return await handler(request);
});
worker.ts
/// <reference no-default-lib="true"/>
/// <reference lib="deno.worker" />
import ComlinkRequestHandler from "./ComlinkRequestHandler.ts";
ComlinkRequestHandler.expose(async (request) => {
const body = await request.text();
return new Response(`Hello to ${request.url}\n\nReceived:\n\n${body}\n`);
});
ComlinkRequestHandler.ts
import * as Comlink from "https://cdn.skypack.dev/comlink#4.3.1?dts";
interface RequestMessage extends Omit<RequestInit, "body" | "signal"> {
url: string;
headers: Record<string, string>;
hasBody: boolean;
}
interface ResponseMessage extends ResponseInit {
headers: Record<string, string>;
hasBody: boolean;
}
export default class ComlinkRequestHandler {
#handler: (request: Request) => Promise<Response>;
#responseBodyReader: ReadableStreamDefaultReader<Uint8Array> | undefined;
static expose(handler: (request: Request) => Promise<Response>) {
Comlink.expose(new ComlinkRequestHandler(handler));
}
static wrap(worker: Worker) {
const { handleRequest, nextResponseBodyChunk } =
Comlink.wrap<ComlinkRequestHandler>(worker);
return async (request: Request): Promise<Response> => {
const requestBodyReader = request.body?.getReader();
const requestMessage: RequestMessage = {
url: request.url,
hasBody: requestBodyReader !== undefined,
cache: request.cache,
credentials: request.credentials,
headers: Object.fromEntries(request.headers.entries()),
integrity: request.integrity,
keepalive: request.keepalive,
method: request.method,
mode: request.mode,
redirect: request.redirect,
referrer: request.referrer,
referrerPolicy: request.referrerPolicy,
};
const nextRequestBodyChunk = Comlink.proxy(async () => {
if (requestBodyReader === undefined) return undefined;
const { value } = await requestBodyReader.read();
return value;
});
const { hasBody: responseHasBody, ...responseInit } = await handleRequest(
requestMessage,
nextRequestBodyChunk
);
const responseBodyInit: BodyInit | null = responseHasBody
? new ReadableStream({
start(controller) {
async function push() {
const value = await nextResponseBodyChunk();
if (value === undefined) {
controller.close();
return;
}
controller.enqueue(value);
push();
}
push();
},
})
: null;
return new Response(responseBodyInit, responseInit);
};
}
constructor(handler: (request: Request) => Promise<Response>) {
this.#handler = handler;
}
async handleRequest(
{ url, hasBody, ...init }: RequestMessage,
nextRequestBodyChunk: () => Promise<Uint8Array | undefined>
): Promise<ResponseMessage> {
const request = new Request(
url,
hasBody
? {
...init,
body: new ReadableStream({
start(controller) {
async function push() {
const value = await nextRequestBodyChunk();
if (value === undefined) {
controller.close();
return;
}
controller.enqueue(value);
push();
}
push();
},
}),
}
: init
);
const response = await this.#handler(request);
this.#responseBodyReader = response.body?.getReader();
return {
hasBody: this.#responseBodyReader !== undefined,
headers: Object.fromEntries(response.headers.entries()),
status: response.status,
statusText: response.statusText,
};
}
async nextResponseBodyChunk(): Promise<Uint8Array | undefined> {
if (this.#responseBodyReader === undefined) return undefined;
const { value } = await this.#responseBodyReader.read();
return value;
}
}
Example usage:
% deno run --allow-net --allow-read main.ts
% curl -X POST --data '{"answer":42}' http://localhost:8000/foo/bar
Hello to http://localhost:8000/foo/bar
Received:
{"answer":42}
There's probably a better way to do this (e.g. via Comlink.transferHandlers and registering transfer handlers for Request, Response, and/or ReadableStream) but the idea is the same and will handle even large request or response payloads as the bodies are streamed via messaging.
It all depends on what workload you would like to push to the threads. If you are happy with the performance of the built in Deno HTTP server running on the main thread but you need to leverage multithreading to create the responses more efficiently then it's simple as of Deno v1.29.4.
The HTTP server will give you an async iterator server like
import { serve } from "https://deno.land/std/http/server.ts";
const server = serve({ port: 8000 });
Then you may use the built in functionality pooledMap like
import { pooledMap } from "https://deno.land/std#0.173.0/async/pool.ts";
const ress = pooledMap( window.navigator.hardwareConcurrency - 1
, server
, req => new Promise(v => v(respondWith(req))
);
for await (const res of ress) {
// respond with res
}
Where respondWith is just a function which handles the recieved request and generates the respond object. If respondWith is already an async function then you don't even need to wrap it into a promise.
However, in case you would like to run multiple Deno HTTP servers on separate therads then that's also possible but you need a load balancer like GoBetween at the head. In this case you should instantiate multiple Deno HTTP servers at separate threads and receive their requsets at the main thread as separate async iterators. To achieve this, per thread you can do like;
At the worker side i.e. ./servers/server_800X.ts;
import { serve } from "https://deno.land/std/http/server.ts";
const server = serve({ port: 800X });
console.log("Listening on http://localhost:800X/");
for await (const req of server) {
postMessage({ type: "request", req });
}
and at the main thread you can easily convert the correspodning worker http server into an async iterator like
async function* server_800X() {
worker_800X.onmessage = event => {
if (event.data.type === "request") {
yield event.data.req;
}
};
}
for await (const req of server_800X()) {
// Handle the request here in the main thread
}
You should also be able to multiplex either the HTTP (req) or the res async iterators by using the MuxAsyncIterators functionality in to a single stream and then spawn by pooledMap. So if you have 2 http servers working on server_8000.ts and server_8001.ts then you can multiplex them into a single async iterator like
const muxedServer = new MuxAsyncIterator<Request>();
muxedServer.add(server_8000);
muxedServer.add(server_8001);
for await (const req of muxedServer) {
// repond accordingly(*)
}
Obviously you should also be able to spawn new threads to process requests received from the muxedServer by utilizing pooledMap as shown above.
(*) In case you choose to use a load balancer and multiple Deno http servers then you should assign special headers to the requests at the load balancer, designating the server ID that it's been diverted to. This way, by inspecting this speical header you can decide from which server to respond for any particular request.
I am trying to implement a class that will perform an action once every second, it works as expected but I am unsure about memory leaks. This code I'm writing will have an uptime of several months.
Will the code below lead to memory leaks since it's technically recursion which never ends?
class Algorithm{
constructor(){
//there will be many more things in this constructor
//which is why this is a class
const pidTimer = (r,e=0,y=Date.now()-r) => {
this.someFunction();
const now = Date.now();
const dy = now-y;
const err = e+r-dy
const u = err*0.2;
//console.log(dy)
setTimeout(()=>{pidTimer(r,err,now)},r+u);
}
pidTimer(1000);
}
someFunction = () => {}
}
It's not the kind of recursion that has any stack accumulation since the previous pidTimer() function call returns before the setTimeout() fires and calls pidTimer() again. I wouldn't even call this recursion (it's scheduled repeated calling), but that's more a semantic issue.
So, the only place I see there could be some memory leak or excess usage would be inside of this.someFunction(); and that's only because you don't show us the code there to evaluate it and see what it does. The code you show us for pidTimer() itself has no issues on its own.
modern async primitive
There's nothing "wrong" with the current function you have, however I think it could be improved significantly. JavaScript offers a modernized asynchrony atom, Promise and new syntax support async/await. These are preferred over stone-age setTimeout and setInterval as you can easily thread data through asynchronous control flow, stop thinking in terms of "callbacks", and avoid side effects -
class Algorithm {
constructor() {
...
this.runProcess(...)
}
async runProcess(...) { // async
while (true) { // loop instead of recursion
await sleep(...) // sleep some amount of time
this.someFunction() // do work
... // adjust timer variables
}
}
}
sleep is a simple function which resolves a promise after a specified milliseconds value, ms -
function sleep(ms) {
return new Promise(r => setTimeout(r, ms)) // promise
}
async iterables
But see how this.someFunction() doesn't return anything? It would be nice if we could capture the data from someFunction and make it available to our caller. By making runProcess an async generator and implementing Symbol.asyncIterator we can easily handle asynchrony and stop side effects -
class Algorithm {
constructor() {
...
this.data = this.runProcess(...) // assign this.data
}
async *runProcess(...) { // async generator
while (true) {
await sleep(...)
yield this.someFunction() // yield
...
}
}
[Symbol.asyncIterator]() { // iterator
return this.data
}
}
Now the caller has control over what happens as the data comes in from this.someFunction. Below we write to console.log but you could easily swap this for an API call or write to file system -
const foo = new Algorithm(...)
for await (const data of foo)
console.log("process data", data) // or API call, or write to file system, etc
additional control
You can easily add control of the process to by using additional data members. Below we swap out while(true) with a conditional and allow the caller to stop the process -
class Algorithm {
constructor() {
...
}
async *runProcess(...) {
this.running = true // start
while (this.running) { // conditional loop
...
}
}
haltProcess() {
this.running = false // stop
}
...
}
demo
Here's a functioning demo including the concepts above. Note we only implement halt here because run is an infinite generator. Manual halting is not necessary for finite generators. Verify the results in your own browser by running the snippet -
class Algorithm {
async *run() {
this.running = true
while(this.running) {
await sleep(1000)
yield this.someFunction()
}
}
halt() {
this.running = false
}
someFunction() {
return Math.random()
}
[Symbol.asyncIterator] = this.run
}
function sleep(ms) {
return new Promise(r => setTimeout(r, ms))
}
async function main() {
const foo = new Algorithm // init
setTimeout(_ => foo.halt(), 10000) // stop at some point, for demo
for await (const x of foo) // iterate
console.log("data", x) // log, api call, write fs, etc
return "done" // return something when done
}
main().then(console.log, console.error) // "done"
data 0.3953947360028206
data 0.18754462176783115
data 0.23690422070864803
data 0.11237466374294014
data 0.5123244720637253
data 0.39818889343799635
data 0.08627407687877853
data 0.3861902404922477
data 0.8358471443658225
data 0.2770336562516085
done
My application receives data from external sources and groups it (matching group items and shows result as HTML table, up to 20k msg/sec).
The problem is shared memory: application works in real time, I receive messages with flags "create, update, delete" so I store all in RAM, no need for any database. But when I try clustering my application messages are missed for some clusters (I tried cluster with pm2).
So now I am trying to scale my application with WorkerThreads, but communication via parentPort.postMessage/worker.postMessage requires a lot of changes to the application. So now I am trying to share memory via SharedArrayBuffer, but I don't understand how to share an array of objects between master and workers.
const {
Worker, isMainThread, parentPort, workerData
} = require('worker_threads');
if (isMainThread) {
const sab = new SharedArrayBuffer(1024);
sab[0] = {foo: 'bar'};
const worker = new Worker(__filename, {
workerData: sab
});
setInterval(() => {
console.log( sab[0] ); // always {foo: 'bar'}
}, 500);
} else {
let sab = workerData;
sab.foo = false; // changing "foo" value at worker but not in main thread
}
setInterval(() => {}, 1000);
I'm creating a program where I constantly run and stop async code, but I need a good way to stop the code.
Currently, I have tried to methods:
Method 1:
When a method is running, and another method is called to stop the first method, I start an infinite loop to stop that code from running and then remove the method from the queue(array)
I'm 100% sure that this is the worst way to accomplish it, and it works very buggy.
Code:
class test{
async Start(){
const response = await request(options);
if(stopped){
while(true){
await timeout(10)
}
}
}
}
Code 2:
var tests = [];
Start(){
const test = new test();
tests.push(test)
tests.Start();
}
Stop(){
tests.forEach((t, i) => {t.stopped = true;};
tests = [];
}
Method 2:
I load the different methods into Workers, and when I need to stop the code, I just terminate the Worker.
It always takes a lot of time(1 sec) to create the Worker, and therefore not the best way, since I need the code to run without 1-2 sec pauses.
Code:
const Worker = require("tiny-worker");
const code = new Worker(path.resolve(__dirname, "./Code/Code.js"))
Stopping:
code.terminate()
Is there any other way that I can stop async code?
The program contains Request using nodejs Request-promise module, so program is waiting for requests, it's hard to stop the code without one of the 2 methods.
Is there any other way that I can stop async code?
Keep in mind the basic of how Nodejs works. I think there is some misunderstanding here.
It execute the actual function in the actual context, if encounters an async operation the event loop will schedule it's execetution somewhere in the future. There is no way to remove that scheduled execution.
More info on event loop here.
In general for manage this kind of situations you shuold use flags or semaphores.
The program contains Request using nodejs Request-promise module, so program is waiting for requests, it's hard to stop the code
If you need to hard "stop the code" you can do something like
func stop() {
process.exit()
}
But if i'm getting it right, you're launching requests every x time, at some point you need to stop sending the request without managing the response.
You can't de-schedule the response managemente portion, but you can add some logic in it to (when it will be runned) check if the "request loop" has been stopped.
let loop_is_stopped = false
let sending_loop = null
func sendRequest() {
const response = await request(options) // "wait here"
// following lines are scheduled after the request promise is resolved
if (loop_is_stopped) {
return
}
// do something with the response
}
func start() {
sending_loop = setInterval(sendRequest, 1000)
}
func stop() {
loop_is_stopped = true
clearInterval(sending_loop)
}
module.exports = { start, stop }
We can use Promise.all without killing whole app (process.exit()), here is my example (you can use another trigger for calling controller.abort()):
const controller = new AbortController();
class Workflow {
static async startTask() {
await new Promise((res) => setTimeout(() => {
res(console.log('RESOLVE'))
}, 3000))
}
}
class ScheduleTask {
static async start() {
return await Promise.all([
new Promise((_res, rej) => { if (controller.signal.aborted) return rej('YAY') }),
Workflow.startTask()
])
}
}
setTimeout(() => {
controller.abort()
console.log("ABORTED!!!");
}, 1500)
const run = async () => {
try {
await ScheduleTask.start()
console.log("DONE")
} catch (err) {
console.log("ERROR", err.name)
}
}
run()
// ABORTED!!!
// RESOLVE
"DONE" will never be showen.
res will be complited
Maybe would be better to run your code as script with it's own process.pid and when we need to interrupt this functionality we can kill this process by pid in another place of your code process.kill.
I was trying to create an automation script for work, it is supposed to use multiple puppeteer instances to process input strings simultaneously.
the task queue and number of puppeteer instances are controlled by the package generic-pool,
strangely, when i run the script on ubuntu or debian, it seems that it fells into an infinite loop. tries to run infinite number of puppeteer instances. while when run on windows, the output was normal.
const puppeteer = require('puppeteer');
const genericPool = require('generic-pool');
const faker = require('faker');
let options = require('./options');
let i = 0;
let proxies = [...options.proxy];
const pool = genericPool.createPool({
create: async () => {
i++;
console.log(`create instance ${i}`);
if (!proxies.length) {
proxies = [...options.proxy];
}
let {control = null, proxy} = proxies.pop();
let instance = await puppeteer.launch({
headless: true,
args: [
`--proxy-server=${proxy}`,
]
});
instance._own = {
proxy,
tor: control,
numInstance: i,
};
return instance;
},
destroy: async instance => {
console.log('destroy instance', instance._own.numInstance);
await instance.close()
},
}, {
max: 3,
min: 1,
});
async function run(emails = []) {
console.log('Processing', emails.length);
const promises = emails.map(email => {
console.log('Processing', email)
pool.acquire()
.then(browser => {
console.log(`${email} handled`)
pool.destroy(browser);})
})
await Promise.all(promises)
await pool.drain();
await pool.clear();
}
let emails = [a,b,c,d,e,];
run(emails)
Output
create instance 1
Processing 10
Processing Stacey_Haley52
Processing Polly.Block
create instance 2
Processing Shanny_Hudson59
Processing Vivianne36
Processing Jayda_Ullrich
Processing Cheyenne_Quitzon
Processing Katheryn20
Processing Jamarcus74
Processing Lenore.Osinski
Processing Hobart75
create instance 3
create instance 4
create instance 5
create instance 6
create instance 7
create instance 8
create instance 9
is it because of my async functions? How can I fix it?
Appreciate your help!
Edit 1. modified according to #James suggested
The main problem you are trying to solve,
It is supposed to use multiple puppeteer instances to process input strings simultaneously.
Promise Queue
You can use a rather simple solution that involves a simple promise queue. We can use p-queue package to limit the concurrency as we wish. I used this on multiple scraping projects to always test things out.
Here is how you can use it.
// emails to handle
let emails = [a, b, c, d, e, ];
// create a promise queue
const PQueue = require('p-queue');
// create queue with concurrency, ie: how many instances we want to run at once
const queue = new PQueue({
concurrency: 1
});
// single task processor
const createInstance = async (email) => {
let instance = await puppeteer.launch({
headless: true,
args: [
`--proxy-server=${proxy}`,
]
});
instance._own = {
proxy,
tor: control,
numInstance: i,
};
console.log('email:', email)
return instance;
}
// add tasks to queue
for (let email of emails) {
queue.add(async () => createInstance(email))
}
Generic Pool Infinite Loop Problem
I removed all kind of puppeteer related code from your sample code and saw how it was still producing the infinite output to console.
create instance 70326
create instance 70327
create instance 70328
create instance 70329
create instance 70330
create instance 70331
...
Now, if you test few times, you will see it will throw the loop only if you something on your code is crashing. The culprit is this pool.acquire() promise, which is just re queuing on error.
To find what is causing the crash, use the following events,
pool.on("factoryCreateError", function(err) {
console.log('factoryCreateError',err);
});
pool.on("factoryDestroyError", function(err) {
console.log('factoryDestroyError',err);
});
There are some issues related to this:
acquire() never resolves/rejects if factory always rejects, here.
About the acquire function in pool.js, here.
.acquire() doesn't reject when resource creation fails, here.
Good luck!
You want to return from your map rather than await, also don't await inside the destroy call, return the result and you can chain these e.g.
const promises = emails.map(e => pool.acquire().then(pool.destroy));
Or alternatively, you could just get rid of destroy completely e.g.
pool.acquire().then(b => b.close())