Async progress tracking of compute-heavy node process via DB - node.js

I have a compute-intensive, not-very quick, Node.js program. In order to provide some feedback about what it's doing, I'd like to update the status in a DB table to show what's going on.
A heavily stripped down example is shown.
Example PG declaration for the table:
sophia=> create table tracking(progress integer);
CREATE TABLE
sophia=> insert into tracking values(0);
INSERT 0 1
Node code:
const { Pool, Client, types } = require('pg')
// See https://node-postgres.com/api/pool
const db = new Pool({
host: process.env.DB_ENDPOINT,
user: process.env.DB_USERNAME,
password: process.env.DB_PASSWORD,
database: process.env.DB_NAME,
port: process.env.DB_PORT
})
for (let i = 0; i < 10; i++) {
// Do compute-intensive stuff
for (let j = 0; j < 1000000000; j++) {}
// Mark progress
console.log('i = ', i)
db.query(`update tracking set progress = ${i} where progress <= ${i} returning progress`).then(res => {
console.log('Progress', res.rows[0].progress)
})
}
The above is significantly simpler that my real code in the following ways:
It's just plain node - the real code runs in Lambda in AWS
This is just churning through busy-loops. The real code has serious business logic in which each step is different but will use certain elements of the previous operation.
The table is a trivial one-row, one-col example. The real one would track multiple elements based on different users, etc.
If these were really fire-and-forget, I'd expect them to update the DB and print the resulting return shortly after the operation was completed with some level of randomness and occasional delay. What I'm actually seeing is that they consistently do no updates at all until after the heavy computing is finished, which renders it useless.
16:17:16:~/ $ node heavy.js
i = 0
i = 1
i = 2
i = 3
i = 4
i = 5
i = 6
i = 7
i = 8
i = 9
Progress 0
Progress 1
Progress undefined
Progress 3
Progress undefined
Progress undefined
Progress 6
Progress undefined
Progress 8
Progress 9
This feels like a problem that must have been solved before. If there's a lib for doing this kind of operation, great. If there's a simple trick for getting the DB operations to trigger shortly after the result, that's fine. If there's some lambda-specific option for this, that'd also be great. Synchronously waiting for the DB at each step is obviously possible but seems really silly.

Related

Undefined is not an object in after effects extendscript

I have a script where i upload lot of aeps and merge them
I am using the following code
var aepFile = "local location of aep";
var importOpts = new ImportOptions(File(aepFile));
var aeFolder = app.project.importFile(importOpts);
for (var n = 1; n <= aeFolder.numItems; n++) {
app.layers.add(aeFolder.item(n));
}
The problem is at some point of time the error shows that undefined is not an object
it is referencing the aeFolder variable I checked , it is imported , but we can't get the data right . Maybe it's not synchronous ? No , because it runs perfectly the next time , Please help
Edit:
it points out that aeFolder is undefined , the index are 1 based, not 0 based.
Yes I can reproduce the error in a different project.
Might this happen if the RAM is low?
There are several points where this could go wrong.
What is importOpts?
Has aeFolder.numItems 0 based or 1 based index?
Can you reproduce the error in a separate project?
var aeFolder = app.project.importFile(importOpts);
for (var n = 1; n <= aeFolder.numItems; n++) {
app.layers.add(aeFolder.item(n));
}
Can you please provide an example project for us to test it.
Or more context to reproduce this error

ElasticSearch Scroll API with multi threading

First of all, I want to let you guys know that I know the basic work logic of how ElasticSearch Scroll API works. To use Scroll API, first, we need to call search method with some scroll value like 1m, then it will return a _scroll_id that will be used for the next consecutive calls on Scroll until all of the doc returns within loop. But the problem is I just want to use the same process on multi-thread basis, not on serially. For example:
If I have 300000 documents, then I want to process/get the docs this way
The 1st thread will process initial 100000 documents
The 2nd thread will process next 100000 documents
The 3rd thread will process remaining 100000 documents
So my question is as I didn't find any way to set the from value on scroll API how can I make the scrolling process faster with threading. Not to process the documents in a serialized manner.
My sample python code
if index_name is not None and doc_type is not None and body is not None:
es = init_es()
page = es.search(index_name,doc_type, scroll = '30s',size = 10, body = body)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print("Scrolling...")
page = es.scroll(scroll_id=sid, scroll='30s')
# Update the scroll ID
sid = page['_scroll_id']
print("scroll id: " + sid)
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print("scroll size: " + str(scroll_size))
print("scrolled data :" )
print(page['aggregations'])
Have you tried a sliced scroll? According to the linked docs:
For scroll queries that return a lot of documents it is possible to
split the scroll in multiple slices which can be consumed
independently.
and
Each scroll is independent and can be processed in parallel like any
scroll request.
I have not used this myself (the largest result set I need to process is ~50k documents) but this seems to be what you're looking for.
You should used sliced scroll for that, see https://github.com/elastic/elasticsearch-dsl-py/issues/817#issuecomment-372271460 on how to do it in python.
I met the same problem as yours, but the doc size is 1.4 million. I've had to use concurrency method and use 10 threads for data writting.
I wrote the code with Java thread pool, and you can find the similar way in Python.
public class ControllerRunnable implements Runnable {
private String i_res;
private String i_scroll_id;
private int i_index;
private JSONArray i_hits;
private JSONObject i_result;
ControllerRunnable(int index_copy, String _scroll_id_copy) {
i_index = index_copy;
i_scroll_id = _scroll_id_copy;
}
#Override
public void run(){
try {
s_logger.debug("index:{}", i_index );
String nexturl = m_scrollUrl.replace("--", i_scroll_id);
s_logger.debug("nexturl:{}", nexturl);
i_res = get(nexturl);
s_logger.debug("i_res:{}", i_res);
i_result = JSONObject.parseObject(i_res);
if (i_result == null) {
s_logger.info("controller thread parsed result object NULL, res:{}", i_res);
s_counter++;
return;
}
i_scroll_id = (String) i_result.get("_scroll_id");
i_hits = i_result.getJSONObject("hits").getJSONArray("hits");
s_logger.debug("hits content:{}\n", i_hits.toString());
s_logger.info("hits_size:{}", i_hits.size());
if (i_hits.size() > 0) {
int per_thread_data_num = i_hits.size() / s_threadnumber;
for (int i = 0; i < s_threadnumber; i++) {
Runnable worker = new DataRunnable(i * per_thread_data_num,
(i + 1) * per_thread_data_num);
m_executor.execute(worker);
}
// Wait until all threads are finish
m_executor.awaitTermination(1, TimeUnit.SECONDS);
} else {
s_counter++;
return;
}
} catch (Exception e) {
s_logger.error(e.getMessage(),e);
}
}
}
scroll must be synchronous, this is the logic.
You can use multi thread, this is exactly why elasticsearch is good for: parallelism.
An elasticsearch index, is composed of shards, this is the physical storage of your data. Shards can be on the same node or not (better).
Another side, the search API offers a very nice option: _preference(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html)
So back to your app:
Get the list of index shards (and nodes)
Create a thread by shard
Do the scroll search on each thread
Et voilĂ !
Also, you could use the elasticsearch4hadoop plugin, which do exactly that for Spark / PIG / map-reduce / Hive.

Socket.io countdown synchronously? [duplicate]

This question already has answers here:
Sync JS time between multiple devices
(5 answers)
Closed 5 years ago.
On my server I call two emits at the same time, which looks like this.
if (songs.length > 0) {
socket.emit('data loaded', songs);
socket.broadcast.to(opponent).emit('data loaded', songs);
}
The one is for opponent and the other for himself.
Once the data is loaded a countdown should appear for both players on my android app. For me it is important that they see the same number at the same time on their screen. To be precise it should run synchronized. How can I do this?
As far as js timers are concerned the will be a small amount of difference. We can reduce the difference in time with reduce of latency time, with the difference between the request and response time from the server.
function syncTime() {
console.log("syncing time")
var currentTime = (new Date).getTime();
res.open('HEAD', document.location, false);
res.onreadystatechange = function()
{
var latency = (new Date).getTime() - currentTime;
var timestring = res.getResponseHeader("DATE");
systemtime = new Date(timestring);
systemtime.setMilliseconds(systemtime.getMilliseconds() + (latency / 2))
};
res.send(null);
}
Elapsed time between sending the request and getting back the response need to be calculated, divide that value by 2. That gives you a rough value of latency. If you add that to the time value from the server, you'll be closer to the true server time (The difference will be in microseconds)
Reference: http://ejohn.org/blog/accuracy-of-javascript-time/
Hope this helps.
I have made an application and I had the same problem. In That case I solved the problem leaving the time control to the server. The server send to the client and the client increases the time. Maybe in your case you could have problem with connection. If the problem exists you can leave clients to increase time by yourself and some times send a tick with correct time for sync.
I could give you something like bellow but I am not tested.
This solution have these steps:
Synchronize timers for client and server. all users have the same difference with server timer.
For the desired response/request get clients time and find the differences with server time.
Consider the smallest as first countdown which will be started.
For each response(socket) subtract the difference from smallest and let the client counter starts after waiting as much as this time.
The client that gets 0 in response data will start immediately.
and the main problem that you may will have is broadcast method which you can't use if you think this solution will be helpful.
This is a post may will help you.
Add time into emit message.
Let's say that songs is an object with {"time" : timeString, "songs" : songsList}.
If we consider devices time is correct You can calculate the time needed for information to travel and then just use server timer as a main calculator.
The client would get the time when countdown should start:
var start = false;
var startTime = 0;
var myTime = new Date().getMilliseconds();
var delay = 1000 - myTime;
setTimeout(function(){
intervalID = setInterval(function(){
myTime = new Date().getTime();
//console.log(myTime); to check if there is round number of milliseconds
if (startTime <= myTime && start = true) {startCountdown();}
}, 100); //put 1000 to check every second if second is round
//or put 100 or 200 is second is not round
}, delay);
socket.on('data loaded', data){
startTime = data.time;
start = true;
}
function startCountdown(){
//your time countdown
}
And that works fine when 2 clients are from same time region, therefore You will need "time converter" to check if time is good due to time difference if You strictly need same numbers.
After the countdown has ended You should clearInterval(intervalID);

CouchDB - Filtered Replication - Can the speed be improved?

I have a single database (300MB & 42,924 documents) consisting of about 20 different kinds of documents from about 200 users. The documents range in size from a few bytes to many KiloBytes (150KB or so).
When the server is unloaded, the following replication filter function takes about 2.5 minutes to complete.
When the server is loaded, it takes >10 minutes.
Can anyone comment on whether these times are expected, and if not, suggest how I might optimize things in order to
get better performance?
function(doc, req) {
acceptedDate = true;
if(doc.date) {
var docDate = new Date();
var dateKey = doc.date;
docDate.setFullYear(dateKey[0], dateKey[1], dateKey[2]);
var reqYear = req.query.year;
var reqMonth = req.query.month;
var reqDay = req.query.day;
var reqDate = new Date();
reqDate.setFullYear(reqYear, reqMonth, reqDay);
acceptedDate = docDate.getTime() >= reqDate.getTime();
}
return doc.user_id && doc.user_id == req.query.userid && doc._id.indexOf("_design") != 0 && acceptedDate;
}
Filtered replications works slow because for each fetched document runs complex logic to decide whether to replicate it or not:
CouchDB fetches next document;
Because filter function has to be applied the document gets converted to JSON;
JSONifyed document passes through stdio to query server;
Query server handles document and decodes it from JSON;
Now, query server lookups and runs your filter function which returns true or false value to CouchDB;
If result is true document goes to be replicated;
Go to p.1 and loop for all documents;
For non-filtered replications take this list, throw away p.2-5 and let p.6 has always true result. This overhead slows down whole replication process.
To significantly improve filtered replication speed, you may use Erlang filters via Erlang native server. They runs inside CouchDB, doesn't pass through any stdio interface and there is no JSON decode/encode overhead applied.
NOTE, that Erlang query server runs not inside sandbox like JavaScript one, so you need to really trust code that you run with it.
Another option is to optimize your filter function e.g. reduce object creation, method calls, but actually you wouldn't win much with this.

How can I implement an anti-spamming technique on my IRC bot?

I run my bot in a public channel with hundreds of users. Yesterday a person came in and just abused it.
I would like to let anyone use the bot, but if they spam commands consecutively and if they aren't a bot "owner" like me when I debug then I would like to add them to an ignored list which expires in an hour or so.
One way I'm thinking would be to save all commands by all users, in a dictionary such as:
({
'meder#freenode': [{command:'.weather 20851', timestamp: 209323023 }],
'jack#efnet': [{command:'.seen john' }]
})
I would setup a cron job to flush this out every 24 hours, but I would basically determine if a person has made X number of commands in a duration of say, 15 seconds and add them to an ignore list.
Actually, as I'm writing this answer I thought of a better idea.. maybe instead of storing each users commands, just store the the bot's commands in a list and keep on pushing until it reaches a limit of say, 15.
lastCommands = [], limit = 5;
function handleCommand( timeObj, action ) {
if ( lastCommands.length < limit ) {
action();
} else {
// enumerate through lastCommands and compare the timestamps of all 5 commands
// if the user is the same for all 5 commands, and...
// if the timestamps are all within the vicinity of 20 seconds
// add the user to the ignoreList
}
}
watch_for('command', function() {
handleCommand({timestamp: 2093293032, user: user}, function(){ message.say('hello there!') })
});
I would appreciate any advice on the matter.
Here's a simple algorithm:
Every time a user sends a command to the bot, increment a number that's tied to that user. If this is a new user, create the number for them and set it to 1.
When a user's number is incremented to a certain value (say 15), set it to 100.
Every <period> seconds, run through the list and decrement all the numbers by 1. Zero means the user's number can be freed.
Before executing a command and after incrementing the user's counter, check to see if it exceeds your magic max value (15 above). If it does, exit before executing the command.
This lets you rate limit actions and forgive excesses after a while. Divide your desired ban length by the decrement period to find the number to set when a user exceeds your threshold (100 above). You can also add to the number if a particular user keeps sending commands after they've been banned.
Well Nathon has already offered a solution, but it's possible to reduce the code that's needed.
var user = {};
user.lastCommandTime = new Date().getTime(); // time the user send his last command
user.commandCount = 0; // command limit counter
user.maxCommandsPerSecond = 1; // commands allowed per second
function handleCommand(obj, action) {
var user = obj.user, now = new Date().getTime();
var timeDifference = now - user.lastCommandTime;
user.commandCount = Math.max(user.commandCount - (timeDifference / 1000 * user.maxCommandsPerSecond), 0) + 1;
user.lastCommandTime = now;
if (user.commandCount <= user.maxCommandsPerSecond) {
console.log('command!');
} else {
console.log('flooding');
}
}
var obj = {user: user};
var e = 0;
function foo() {
handleCommand(obj, 'foo');
e += 250;
setTimeout(foo, 400 + e);
}
foo();
In this implementation, there's no need for a list or some global callback every X seconds, instead we just reduce the commandCount every time there's a new message, based on time difference to the last command, it's also possible to allow different command rates for specific users.
All we need are 3 new properties on the user object :)
Redis
I would use the insanely fast advanced key-value store redis to write something like this, because:
It is insanely fast.
There is no need for cronjob because you can set expire on keys.
It has atomic operations to increment key
You could use redis-cli for prototyping.
I myself really like node_redis as redis client. It is a really fast redis client, which can easily be installed using npm.
Algorithme
I think my algorithme would look something like this:
For each user create a unique key which counts the commands consecutively executed. Also set expire to the time when you don't flag a user as spammer anymore. Let's assume the spammer has nickname x and the expire 15.
Inside redis-cli
incr x
expire x 15
When you do a get x after 15 seconds then the key does not exist anymore.
If value of key is bigger then threshold then flag user as spammer.
get x
These answers seem to be going the wrong way about this.
IRC Servers will disconnect your client regardless of whether you're "debugging" or not if the client or bot is flooding a channel or the server in general.
Make a blanket flood control, using the method #nmichaels has detailed, but on the bot's network connection to the server itself.

Resources