Design a simple cache manager in nodejs to manage file on disk - node.js

My server receives an input data and generate an output file. I need to cache this output file, when an user requests input again, server will return output file instantly. The cache manager need:
Entry: inputId -> path to processed file (cached file)
Many server processes can set and get cache entries at the same time
Limit total size of cached file, remove old cached files if disk is full
Cached files expire and are removed after some time. If cache hit then reset expire time of cached file.
Server processes can crash, or computer can shutdown anytime. Cache manager can discard incorrect data but keep valid cached file.
Now, I'm using redis as LRU cache.
inputId -> filePath with expire time
A Sorted Set: inputId -> last access time of file
Write 3 lua scripts to setCache(inputId, filePath), getCache(inputId), removeCache(inputId)
Periodically check disk space to remove least recently used files.
Listen to event a redis key expires to remove file cache
In general, I feel my implementation is not strong enough to handle processes/computer restart/crash. I intend to save cache index into databases.
I need some comments about my design. Do I need to reinventing the wheel ? (stackoverflow doesn't allow seeking about libraries or documents)

Related

Will an instance of a Google Cloud Run container overwrite temporary file?

I am using Cloud Run with the default settings (80 requests per instance) running a container with node and express.
The service needs to create a temporary file when a request is processed. I'm wondering if when multiple requests arrive at the same time, will they be processed concurrently? So if the file is named the same thing, could it be overwritten by another process before the first one is completed?
With Node, I don't think we have parallel processes but I think there could still be a conflict unless express handles the requests sequentially.
If you set max concurrency = 1, then you can use the same file name.
If you use max concurrency > 1, then you are at risk that multiple request would conflict when processing the file if using the same filename. The best is to use unique temporary filenames for each request and to ensure it is deleted at the end.

DataLake locks on read and write for the same file

I have 2 different applications that handle data from Data Lake Storage Gen1.
The first application uploads files: if multiple uploads on the same day, the existing file will be overridden (it is always a file per day saved using the YYYY-MM-dd format)
The second application reads the data from the files.
Is there an option to lock this operations: when a write operation is performed, no read should take place and the same when a read happens the write should wait until the read operation is finished.
I did not find any option using the AdlsClient.
Thanks.
As I know, ADL gen1 is Apache Hadoop file system that's compatible with Hadoop Distributed File System (HDFS). So I searched some documents of HDFS and I'm afraid that you can't control mutual exclusion of reading and writing directly. Please see below documents:
1.link1: https://www.raviprak.com/research/hadoop/leaseManagement.html
writers must obtain an exclusive lock for a file before they’d be
allowed to write / append / truncate data in those files. Notably,
this exclusive lock does NOT prevent other clients from reading the
file, (so a client could be writing a file, and at the same time
another could be reading the same file).
2.link2: https://blog.cloudera.com/understanding-hdfs-recovery-processes-part-1/
Before a client can write an HDFS file, it must obtain a lease, which is essentially a lock. This ensures the single-writer semantics. The lease must be renewed within a predefined period of time if the client wishes to keep writing. If a lease is not explicitly renewed or the client holding it dies, then it will expire. When this happens, HDFS will close the file and release the lease on behalf of the client so that other clients can write to the file. This process is called lease recovery.
I provide a workaround here for your reference: Adding a Redis database before your writes and reads!
No matter when you do read or write operations, firstly please judge whether there is a specific key in the Redis database. If not, write a set of key-value into Redis. Then do business logic processing. Finally don't forget to delete the key.
Although this is may a little bit cumbersome or affecting performance, I think it can meet your needs. BTW,considering that the business logic may fail or crash so that the key is never released, you can add the TTL setting when creating the key to avoid this situation.

What is the lifespan of assets cached by service worker?

Some of the articles I have read suggest that items cached by service worker (web Cache API) is stored in system forever.
I have come across a scenario when some of the cached resources are evicted automatically for users who revisit my website after a long time(~ > 2 months)
I know for a fact that assets cached via HTTP caching are removed by browser after certain time. Does same apply for service worker too?
If that is the case, then how does browser decide what asset it has to remove and is there a way I can tell browser that if it is removing something from cache, then remove everything that are cached with same cache name?
It seems it lasts forever, until it doesn't :) (ie. storage space is low)
https://developers.google.com/web/ilt/pwa/caching-files-with-service-worker
You are responsible for implementing how your script (service worker)
handles updates to the cache. All updates to items in the cache must
be explicitly requested; items will not expire and must be deleted.
However, if the amount of cached data exceeds the browser's storage
limit, the browser will begin evicting all data associated with an
origin, one origin at a time, until the storage amount goes under the
limit again. See Browser storage limits and eviction criteria for more
information.
If their storage is running low then it may be evicted: (See Storage Limits)
https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Browser_storage_limits_and_eviction_criteria

Update SQLite database without restarting application

I have an NodeJS application that run on a production server with forever.
That application use a third-party SQLite database which is updated every morning with a script triggered by a cron, who download the db from an external FTP to the server.
I spend some time before realising that I need to restart my server every time the file is rapatriated otherwise there is no change in the data used by my application (I guess it's cached in memory at starting).
// sync_db.sh
wget -l 0 ftp://$REMOTE_DB_PATH --ftp-user=$USER --ftp-password=$PASSWORD \
--directory-prefix=$DIRECTORY -nH
forever restart 0 // <- Here I hope something nicer...
What can I do to refresh the database without restarting the app ?
You must not overwrite a database file that might have some open connection to it (see How To Corrupt An SQLite Database File).
The correct way to overwrite a database is to download to a temporary file, and then copy it to the actual database with the backup API, which takes care of proper transactional behaviour. The simplest way to do this is with the sqlite3 command-line shell:
sqlite3 $DIRECTORY/real.db ".restore $DOWNLOADDIRECTORY/temp.db"
(If your application manually caches data, you still have to tell it to reload it.)

Nodejs application memory usage tracking and clean up on exit

"A Node application is an instance of a Node Process Object".link
Is there a way in which local memory on the server can be cleared every time the node application exits.
[By application exit i mean that when each individual user of the website shuts down the tab on the browser]
node.js is a single process that serves all your users. There is no specific memory associated with a given user other than any state that you yourself in your own node.js code might be storing locally in your node.js server on behalf of a given user. If you have some memory like that, then the typical ways to know when to clear out that state are as follows:
Offer a specific logout option in the web page and when the user logs out, you clear their state from memory. This doesn't catch all ways the user might disappear so this would typically be done in conjunction with other optins.
Have a recurring timer (say every 10 minutes) that automatically clears any state from an user who has not made a web request within the last hour (or however long you want the time set to). This also requires you to keep a timestamp for each user each time they access something on the site which is easy to do in a middleware function.
Have all your client pages keep a webSocket connection to the server and when that webSocket connection has been closed and not re-established for a few minutes, then you can assume that the user no longer has any page open to your site and you can clear their state from memory.
Don't store user state in memory. Instead, use a persistent database with good caching. Then, when the user is no longer using your site, their state info will just age out of the database cache gracefully.
Note: Tracking memory overall usage in node.js is not a trivial task so it's important you know exactly what you are measuring if you're tracking this. Overall process memory usage is a combination of memory that is actually being used and memory that was previously used, is currently available for reuse, but has not been given back to the OS. You obviously need to be able to track memory that is actually in use by node.js, not just memory that the process may be allocated. A heapsnapshot is one of the typical ways to track what is actually being used, not just what is allocated from the OS.

Resources