dynamodb scan with delay (or limited provision) - nodejs - node.js

So i am just starting and learning dynamodb and encountered the following problem,
I am using connect-dynamodb to implement a session database with dynamodb, and while developing and learning at the same time i learned that scans are expensive, however - connect-dynamo (as any db framework) uses a reap interval to clean expired sessions, and scans the table every X interval.
i found a nice solution here, but this is using a java class - and was wondering if there is any similar parallel solution with nodejs.
if not, would be glad to hear about any other good solution for a infrequently schedule read burst. like a scan with "delay" to avoid exceeding read capacity.
Thanks.

I'm using node and dynamoDB allot and just looked inside the module connect-dynamo.
The main problem with this modul is that it uses a table of type "HASH".
It should be a "RANGE" table with expires as range key.
Then it is possible to do a query instead of a scan, wtich is way cheaper.
So my advice is not to use this module ;-)
Or fork it and change it to RANGE table!

Related

Index timestamp in Google Datastore

My previous question: Errors saving data to Google Datastore
We're running into issues writing to Datastore. Based on the previous question, we think the issue is that we're indexing a "SeenTime" attribute with YYYY-MM-DDTHH:MM:SSZ (e.g. 2021-04-29T17:42:58Z) and this is creating a hotspot (see: https://cloud.google.com/datastore/docs/best-practices#indexes).
We need to index this because we're querying the data by date and need the time for each observation in the end application. Is there a way around this issue where we can still query by date?
This answer is a bit late but:
On your previous question, before even writing a query, it feels like the main issue is "running into issues writing" (DEADLINE_EXCEEDED/UNAVAILABLE) -> it's happening on "some saves" -- so, it's not completely clear if it's due to data hot-spotting or from "ingesting more data in shorter bursts", which causes contention (see "Designing for scale").
A single entity in Datastore mode should not be updated too rapidly. If you are using Datastore mode, design your application so that it will not need to update an entity more than once per second. If you update an entity too rapidly, then your Datastore mode writes will have higher latency, timeouts, and other types of error. This is known as contention.
You would need to add a prefix to the key to index monotonically increasing timestamps (as mentioned in the best-practices doc). Then you can test your queries using GQL interface in the console. However, since you most likely want "all events", I don't think it would be possible, and so will result in hot-spotting & read-latency.
The impression is that the latency might be unavoidable. If so, then you would need to decide if it's acceptable, depending on the frequency of your query/number-of-elements returned, along with the amount of latency (performance impact).
Consider switching to Firestore Native Mode. It has a different architecture under the hood and is the next version of Datastore. While Firestore is not perfect, it can be more forgiving about hot-spotting and contention, so it's possible that you'll have fewer issues than in Datastore.

Elasticsearch indexing speed with Nodejs

I have an Elasticsearch indice with a large amount of documents. I've been using javascript up until this point with Node.js. I made a cron job to run every 24 hours to update the documents individually based on any metadata changes. As some of you may know, this is probably the slowest possible way to do it. Single threaded Node.js with individual indexing on Elasticsearch. When I run the cron job, it runs at a snails pace. I can update a document every maybe 1.5-2 seconds. This means it would take around 27 hours to update all the docuemnts. I'm using a free-tier AWS ES instance so I don't have access to certain features that would help me speed up the process.
Does anyone know of a way to speed up the indexing? If I were to call for a bulk update, how would that manifest in javascript? If I were to use another language to multi-thread it, what would be the fastest option?
I did not understand your question "If I were to call for a bulk update, how would that manifest in javascript?".
Bulk update should be the best solution irrespective of any language/framework. Of-course, you can explore other languages like Ruby to leverage threads to make bulk update more distributed and fast.
From experience, a bulk update with a batch size b/w 4-7k works just fine. You may want to fine tune the size in this range.
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

Does Hazelcast support bulk set or asynchronous bulk put operation?

I have a use-case of inserting a lot of data during big calculation which really don't have to be available in the cluster immediately (so the cluster can synchronize as we go).
Currently I'm inserting batches using putAll() operation and it's blocking and taking time.
I've read a blog post about efficiency of set() operation but there is no analogous setAll(). I also saw putAsync() and didn't see matching putAllAsync() (I'm not interested in the future object).
Am I overlooking something? How can I improve insertion performance?
EDIT: Feature request: https://github.com/hazelcast/hazelcast/issues/5337
I think you're right, they're missing. Could you create a feature request, maybe you're also interested in helping to implement them using the Hazelcast Incubator?

Inserting mappings for a large amount of existing data

I am currently testing inserting a large amount of mappings for existing data into a shardmap using Elastic Scale. Turns out the whole process is time consuming. It's inserting around ~10 mappings/second. Is there anyway to speed up the insertion e.g. by inserting batches of mappings or directly via stored procedures?
We know from our own testing that inserting mappings is time consuming. Here are a couple of options I'd suggest you try:
You can run multiple parallel threads inserting the mappings.
You can increase the service level objective for the shard map database for the time where you do the bulk load.
I understand why you would want to load mappings in bulk for test scenarios. However, I am not sure I understand the reason why you will need so many mappings that this becomes a problem. Could you explain a bit more?
Thanks,
Torsten
Coming back to this question, since now we have published ShardManagement PowerShell module along with some sample scripts here: https://gallery.technet.microsoft.com/scriptcenter/Azure-SQL-DB-Elastic-731883db. This should help you in settings and querying the existing range/list mappings quickly.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources