RxJS - check array of observables with concurrency in interval - node.js

Working on a scheduler with RxJS that every second checks the array of jobs. When job is finished it is removed from array. I would like to run that with the .mergeAll(concurrency) parameter so for example there are only two jobs running at the same time.
Currently I have an workaround which can be seen here.
What I am trying is something like
Observable
.interval(1000)
.timeInterval()
.merge(...jobProcesses.map(job => Observable.fromPromise(startJob(job.id))))
.mergeAll(config.concurrency || 10)
.subscribe();
which obviously doesn't work. Any help would be appreciated.

From the comments, it seems you are simply trying to limit concurrency, and this interval stuff is just a detour. You should be able to get what you need with:
const Rx = require('rxjs/Rx')
let startTime = 0
const time = () => {
if (!startTime)
startTime = new Date().getTime()
return Math.round((new Date().getTime() - startTime) / 1000)
}
const jobs = new Rx.Subject() // You may additionally rate-limit this with bufferTime(x).concatAll()
const startJob = j => Rx.Observable.of(undefined).delay(j * 1000).map(() => time())
const concurrency = 2
time()
jobs
.bufferCount(concurrency)
.concatMap(buf => Rx.Observable.from(buf).flatMap(startJob))
.subscribe(x => console.log(x))
Rx.Observable.from([3, 1, 3]).subscribe(jobs)
// The last job is only processed after the first two are completed, so you see:
// 1
// 3
// 6
Note that this technically isn't squeezing out the maximum amount of concurrency possible, since it breaks the jobs up into constant batches. If your jobs have significantly uneven processing times, the longest job in the batch will delay pulling work from the next batch.

Related

How to make code execute at each week end?

I want to execute a piece code each Sunday 23:59 (11 pm) (basically at the end of each week). However, it should only be fired once per week.
A setInterval() function won't cut it here, as the app might be restarted meanwhile.
If this will help anyhow, I had this basic idea:
Set an interval (with setInterval) for every 5-10 seconds and check if it's Sunday and hour 23 (11 pm). However, this solution will be inconsistent and may fire more than once a week. I need a more bullet-proof solution to this.
You can use any cron module (like https://www.npmjs.com/package/cron) and set job for 59 23 * * 0 (ranges)
const { CronJob } = require('cron');
const job = new CronJob('59 23 * * 0', mySundayFunc);
job.start();
How about calculating the remaining time on start, like this code
const WEEK_IN_MS = 604800000;
const ONE_HOUR_IN_MS = 3600000;
const FOUR_DAYS_IN_MS = 4 * WEEK_IN_MS / 7;
function nextInterval() {
return WEEK_IN_MS - ((Date.now() + FOUR_DAYS_IN_MS) % WEEK_IN_MS) - ONE_HOUR_IN_MS;
}
const interval = nextInterval();
console.log(`run after ${interval} ms`);
setTimeout(
() => console.log('Do it!!!'),
interval
)

stop all async Task when they fails over threshold?

I'm using Monix Task for async control.
scenario
tasks are executed in parallel
if failure occurs over X times
stop all tasks that are not yet in complete status (as quick as better)
my solution
I come up the ideas that race between 1. result and 2. error counter, and cancel the loser.
Via Task.race if the error-counter get to threshold first, then the tasks would be canceled by Task.race.
experiment
on Ammonite REPL
{
import $ivy.`io.monix::monix:3.1.0`
import monix.eval.Task
import monix.execution.atomic.Atomic
import scala.concurrent.duration._
import monix.execution.Scheduler
//import monix.execution.Scheduler.Implicits.global
implicit val s = Scheduler.fixedPool("race", 2) // pool size
val taskSize = 100
val errCounter = Atomic(0)
val threshold = 3
val tasks = (1 to taskSize).map(_ => Task.sleep(100.millis).map(_ => errCounter.increment()))
val guard = Task(f"stop because too many error: ${errCounter.get()}")
.restartUntil(_ => errCounter.get() >= threshold)
val race = Task
.race(guard, Task.gather(tasks))
.runToFuture
.onComplete { case x => println(x); println(f"completed task: ${errCounter.get()}") }
}
issue
The outcome is depends on thread pool size !?
For pool size 1
the outcome is almost always a task success i.e. no stop.
Success(Right(.........))
completed task: 100 // all task success !
For pool size 2
it is very un-deterministic between success and failure and the cancelling is not accurate.
for example:
Success(Left(stop because too many error: 1))
completed task: 98
the canceling is as late as 98 tasks has completed.
the error count is weird small to threshold.
The default global scheduler get this same outcome behavior.
For pool size 200
it is more deterministic and the stopping is earlier thus more accurate in sense that less task was completed.
Success(Left(stop because too many error: 2))
completed task: 8
the larger of the pool size the better.
If I change Task.gather to Task.sequence execution, all issues disappeared!
What is the cause for this dependency on pool size ?
How to improve it or is there better alternative for stopping tasks once too many error occurs ?
What you're seeing is likely an effect of the monix scheduler and how it aims for fairness. It's a fairly complex topic but the documentation and scaladocs are excellent (see: https://monix.io/docs/3x/execution/scheduler.html#execution-model)
When you have only one thread (or few) it takes a while until the "guard" Task gets another turn to check. With Task.gather you start 100 tasks at once, so the scheduler is very busy and the "guard" cannot check again until the other tasks are already done.
If you have one thread per task the scheduler cannot guarantee fairness and therefore the "guard" unfairly checks much more frequently and can finish sooner.
If you use Task.sequence those 100 tasks are executed sequentially, which is why the "guard" task gets much more opportunities to finish as soon as needed. If you want to keep your code the way it is, you could use Task.gatherN(parallelism = 4) which will limit the parallelism and therefore allow your "guard" to check more often (a middleground between Task.sequence and Task.gather).
It seems a bit like Go code to me (using Task.race like Go's select) and you're also using side-effects unconstrained which further complicates understanding what's going on. I've tried to rewrite your program in a way that's more idiomatic and for complicated concurrency I usually reach for streams like Observable:
import cats.effect.concurrent.Ref
import monix.eval.Task
import monix.execution.Scheduler
import monix.reactive.Observable
import scala.concurrent.duration._
object ErrorThresholdDemo extends App {
//import monix.execution.Scheduler.Implicits.global
implicit val s: Scheduler = Scheduler.fixedPool("race", 2) // pool size
val taskSize = 100
val threshold = 30
val program = for {
errCounter <- Ref[Task].of(0)
tasks = (1 to taskSize).map(n => Task.sleep(100.millis).flatMap(_ => errCounter.update(_ + (n % 2))))
tasksFinishedCount <- Observable
.fromIterable(tasks)
.mapParallelUnordered(parallelism = 4) { task =>
task
}
.takeUntilEval(errCounter.get.restartUntil(_ >= threshold))
.map(_ => 1)
.sumL
errorCount <- errCounter.get
_ <- Task(println(f"completed tasks: $tasksFinishedCount, errors: $errorCount"))
} yield ()
program.runSyncUnsafe()
}
As you can see I no longer use global mutable side-effects but instead Ref which interally also uses Atomic but provides a functional api which we can use with Task.
For demonstration purposes I also changed the threshold to 30 and only every other task will "error". So the expected output is always around completed tasks: 60, errors: 30 no matter the thread-pool size.
I'm still using polling with errCounter.get.restartUntil(_ >= threshold) which might burn a bit too much CPU for my taste but it's close to your original idea and works well.
Usually I don't create a list of tasks up front but instead throw the inputs into the Observable and create the tasks inside of .mapParallelUnordered. This code keeps your list which is why there is no real mapping involved (it already contains tasks).
You can choose your desired parallelism much like with Task.gatherN which is pretty nice imo.
Let me know if anything is still unclear :)

Change Feed Processor Lib does not honour ChangeFeedProcessorOptions FeedPollDelay / CheckPointFrequency

I am following this sample code (https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet#example) to register an observer to process change feed in cosmos db collection.
I am creating new documents in the cosmos db collection using a utility (say create 400 documents within a for loop).
I am using using FeedPollDelay of 30 seconds. But it doesn't seem to be honoured by CFP lib. ProcessChangesAsync method gets invoked repeatedly even before feed poll delay interval expires.
In the first batch, around 60 docs are retrieved and in the second batch around 20 docs are retrieved, in the third batch around 100 docs are retrieved.
DocumentCollectionInfo feedCollectionInfo = new DocumentCollectionInfo()
{
DatabaseName = databaseName,
CollectionName = monitoredCollectionName,
Uri = new Uri(uri),
MasterKey = masterKey
};
DocumentCollectionInfo leaseCollectionInfo = new DocumentCollectionInfo()
{
DatabaseName = databaseName,
CollectionName = leaseCollectionName,
Uri = new Uri(uri),
MasterKey = masterKey
};
ChangeFeedProcessorOptions feedProcessorOptions = new ChangeFeedProcessorOptions()
{
FeedPollDelay = TimeSpan.FromSeconds(30)
//LeasePrefix = Guid.NewGuid().ToString(),
//MaxItemCount = 100
};
ChangeFeedProcessorBuilder builder = new ChangeFeedProcessorBuilder();
processor = await builder
.WithHostName(hostName)
.WithFeedCollection(feedCollectionInfo)
.WithLeaseCollection(leaseCollectionInfo)
.WithProcessorOptions(feedProcessorOptions)
.WithObserver<LiveWorkItemChangeFeedObserver>()
.BuildAsync();
await processor.StartAsync();
Receiving 60 docs in first batch is fine. But I am expecting the second batch to be invoked with rest 340 docs in a single batch after the feed poll delay (30 seconds) interval expires.
But ProcessChangesAsync method gets triggered frequently and this option is not being honoured.
FeedPollDelay is used when the Change Feed Processor reads the Change Feed and finds no new changes, not in-between each batch.
Example flow:
CFP polls for changes, finds X.
ProcessChangesAsync is called with X
After ProcessChangesAsync finishes, CFP immediately polls for changes, finds Y.
ProcessChangesAsync is called with Y.
After ProcessChangesAsync finishes, CFP immediately polls for changes, finds nothing, waits FeedPollDelay.
CFP polls for changes, finds Z.
ProcessChangesAsync is called with Z
After ProcessChangesAsync finishes, CFP immediately polls for changes, finds nothing, waits FeedPollDelay.
Etc….

How to prevent Execution usage limit in scheduled scripts

I am using the scheduled script which will create the custom records based on criteria. every time when the schedule script runs it should create approx. 100,000 records but the script is timing out after creating 5000 or 10000 records. I am using the below script to prevent the script execution usage limit but even with this also the script is not working. can any one please suggest some thing or provide any information. any suggestions are welcome and highly appreciated.
In my for loop iam using the below script. with this below script included the scheduled script is able to create up to 5000 or 10000 records only.
if (nlapiGetContext().getRemainingUsage() <= 0 && (i+1) < results.length )
{
var stateMain = nlapiYieldScript();
}
If you are going to reschedule using the nlapiYieldScript mechanism, then you also need to use nlapiSetRecoveryPoint at the point where you wish the script to resume. See the Help documentation for each of these methods, as well as the page titled Setting Recovery Points in Scheduled Scripts
Be aware that nlapiSetRecoveryPoint uses 100 governance units, so you will need to account for this in your getRemainingUsage check.
#rajesh, you are only checking the remaining usage. Also do check for execution time limit, which is 1 hour for any scheduled script. Something like below snippet-
var checkIfYieldOrContinue = function(startTime) {
var endTime = new Date().getTime();
var timeElapsed = (endTime * 0.001) - (startTime * 0.001);
if (nlapiGetContext().getRemainingUsage() < 3000 ||
timeElapsed > 3500) { //3500 secs
nlapiLogExecution('AUDIT', 'Remaining Usage: ' + nlapiGetContext().getRemainingUsage() + '. Time elapsed: ' + timeElapsed);
startTime = new Date().getTime();
var yieldStatus = nlapiYieldScript();
nlapiLogExecution('AUDIT', 'script yielded.' + yieldStatus.status);
nlapiLogExecution('AUDIT', 'script yielded reason.' + yieldStatus.reason);
nlapiLogExecution('AUDIT', 'script yielded information.' + yieldStatus.information);
}
};
Inside your for loop, you can call this method like-
var startTime = new Date();
if ((i+1) < results.length ) {
//do your operations here and then...
checkIfYieldOrContinue(startTime);
}
I have a script that lets you process an array like a forEach. The script checks each iteration and calculates the maximum usage and yields when there is not enough usage left to cover the max.
Head over to https://github.com/BKnights/KotN-Netsuite and download simpleBatch.js

Setting a timer in Node.js

I need to run code in Node.js every 24 hours. I came across a function called setTimeout. Below is my code snippet
var et = require('elementtree');
var XML = et.XML;
var ElementTree = et.ElementTree;
var element = et.Element;
var subElement = et.SubElement;
var data='<?xml version="1.0"?><entries><entry><TenantId>12345</TenantId><ServiceName>MaaS</ServiceName><ResourceID>enAAAA</ResourceID><UsageID>550e8400-e29b-41d4-a716-446655440000</UsageID><EventType>create</EventType><category term="monitoring.entity.create"/><DataCenter>global</DataCenter><Region>global</Region><StartTime>Sun Apr 29 2012 16:37:32 GMT-0700 (PDT)</StartTime><ResourceName>entity</ResourceName></entry><entry><TenantId>44445</TenantId><ServiceName>MaaS</ServiceName><ResourceID>enAAAA</ResourceID><UsageID>550e8400-e29b-41d4-a716-fffffffff000</UsageID><EventType>update</EventType><category term="monitoring.entity.update"/><DataCenter>global</DataCenter><Region>global</Region><StartTime>Sun Apr 29 2012 16:40:32 GMT-0700 (PDT)</StartTime><ResourceName>entity</ResourceName></entry></entries>'
etree = et.parse(data);
var t = process.hrtime();
// [ 1800216, 927643717 ]
setTimeout(function () {
t = process.hrtime(t);
// [ 1, 6962306 ]
console.log(etree.findall('./entry/TenantId').length); // 2
console.log('benchmark took %d seconds and %d nanoseconds', t[0], t[1]);
//benchmark took 1 seconds and 6962306 nanoseconds
},1000);
I want to run the above code once per hour and parse the data. For my reference I had used one second as the timer value. Any idea how to proceed will be much helpful.
There are basically three ways to go
setInterval()
The setTimeout(f, n) function waits n milliseconds and calls function f.
The setInterval(f, n) function calls f every n milliseconds.
setInterval(function(){
console.log('test');
}, 60 * 60 * 1000);
This prints test every hour. You could just throw your code (except the require statements) into a setInterval(). However, that seems kind of ugly to me. I'd rather go with:
Scheduled Tasks
Most operating systems have a way of sheduling tasks. On Windows this is called "Scheduled Tasks" on Linux look for cron.
Use a libary As I realized while answering, one could even see this as a duplicate of that question.

Resources