Custom in-memory cache - node.js

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?

Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

Related

Parallel read and write to postgres database slows down application (backend)

I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.
Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.

NodeJS Issue with huge data export

I have a scenario. In DB, I have a table with a huge amount of records (2 million) and I need to export them to xlsx or csv.
So the basic approach that I used, is running a query against DB and put the data into an appropriate file to download.
Problems:
There is a DB timeout that I have set to 150 sec which sometimes isn't
enough and I am not sure if expanding timeout would be a good idea!
There is also some certain timeout with express request, So it basically timed out my HTTP req and hits for second time (for unknown reason)
So as a solution, I am thinking of using stream DB connection and with that if in any way I can provide an output stream with the file, It should work.
So basically I need help with the 2nd part, In stream, I would receive records one by one and at the same time, I am thinking of allowing user download the file progressively. (this would avoid request timeout)
I don't think it's unique problem but didn't find any appropriate pieces to put together. Thanks in advance!
If you see it in your log, do you run the query more than once?
Does your UI timeout before the server even reach the res.end()?

CouchDB 2 global_changes system table is getting insanely big

We have a system that basically writes 250MB of data into our CouchDB 2 instance, which generates ~50GB/day in the global_changes database.
This makes CouchDB2 consumes all the disk.
Once you get to this state, CouchDB2 goes and never comes back.
We would like to know if there is any way of limiting the size of the global_changes table, or if there is a way of managing this table, like a set of best practices.
tldr: just delete it
http://docs.couchdb.org/en/latest/install/setup.html#single-node-setup
States:
Note that the last of these (referring to _global_changes) is not necessary if you do not expect to be using the global changes feed. Feel free to delete this database if you have created it, it has grown in size, and you do not need the function (and do not wish to waste system resources on compacting it regularly.

How to store data in cassandra in a temporary location while sync is in progress?

We have a mysql server running which is serving application writes. To do some batch processing we have written a sync job to migrate data into cassandra cluster.
1. A daily sync job which transfers by updated timestamp for that day.
2. A complete sync job which transfers complete data, overriding existing ones.
Now there may be a possibility that the row was deleted from mysql, in that case using the above approach it will lie forever in cassandra.
To solve that problem we have given a TTL of 15 days for every row. So eventually it will get deleted, if it was not deleted then in next full sync the TTL will be over written again.
Its working fine as far as the use case is concerned but the issue is that in full sync complete data is over written and sstable is generated continuously with compactions happenning all the time, load averages shoot up with slowness and backup size increases (which could have been avoided).
Essentially we would want to replace the existing table data by new data but we dont want to truncate before starting the job but only after job completes.
Is there any way by which this can be solved other than creating a new table altogether and dropping past table when data is generated?
You can look at the double-run migration strategy I presented here: http://www.slideshare.net/doanduyhai/from-rdbms-to-cassandra-without-a-hitch
It has the advantage of allowing 100% uptime and possible rollback if things go wrong. The downside is the amount of work required in term of releases & codes

Potential issue with Couchbase paging

It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew

Resources