Firebase functions - database cache? - node.js

How does one cache database data in firebase function?
Based on this SO answer, firebase caches data for as long as there is an active listener.
Considering the following example:
exports.myTrigger = functions.database.ref("some/data/path").onWrite((data, context) => {
var dbRootRef = data.after.ref.root;
dbRootRef.child("another/data/path").on("value", function(){});
return dbRootRef.child("another/data/path").once("value").then(function(snap){/*process data*/})
}
This will cache the data but the question is - is this valid approach for server side? Should I call .off() at some point in time so it doesn't produce any issues since this call can scale quickly producing tons of '.on()' listeners? Or is it ok to keep 'on()' indefinitely?

Since active data is kept in memory, your code will keep a snapshot of the latest data at another/data/path in memory as long as the listener is active. Since you never call off in your code, that will be as long as the container that runs the function is active, not just for the duration that this function is active.
Even if you have other Cloud Functions in that container, and those other functions don't need this data, it'll still be using memory.
If that is the behavior you want, then it's a valid approach. I'd just recommend doing a cost/benefit analysis, because I expect this may lead to hard-to-understand behavior at some point.

Related

Use Google Cloud Secrets when initializing code

I have this code to retrieve the secrets:
import {SecretManagerServiceClient} from "#google-cloud/secret-manager";
const client = new SecretManagerServiceClient();
async function getSecret(secret: String, version = "latest") {
const projectID = process.env.GOOGLE_CLOUD_PROJECT;
const [vs] = await client.accessSecretVersion({
name: `projects/${projectID}/secrets/${secret}/versions/${version}`
});
const secretValue = JSON.parse(vs.payload.data.toString());
return secretValue;
}
export {getSecret};
I would like to replace the process.env.SENTRY_DNS with await getSecrets("SENTRY_DNS") but I can't call a promise (await) outside an async function.
Sentry.init({
dsn: process.env.SENTRY_DNS,
environment: Config.isBeta ? "Beta" : "Main"
});
function sentryCreateError(message, contexts) {
Sentry.captureMessage(message, {
level: "error", // one of 'info', 'warning', or 'error'
contexts
});
}
What are the best practices with Google Secrets? Should I be loading the secrets once in a "config" file and then call the values from there? If so, I'm not sure how to do that, do you have an example?
Leaving aside your code example (I don't work with JS anyway), I would think about a few different questions, answers on which may affect the design. For example:
Where this code is executed? - compute engine, app engine, cloud run, k8s, cloud function, and so on. Depending on the answer - an approach to store secrets might be different.
Suppose, for example, that is going to be a cloud function. The next question -
Would you prefer to store the secret values in a special environment variables, or in the secret manager? The first option is faster, but less secure - as, for instance, everybody, who has access tot he cloud function details in the console, might see those environment variable values.
Load secret values into the memory on initialization? Or on every invocation? The first option is faster, but might cause some issues if the secrete values are modified (gradual replacement of old values with new, when some instances are terminated, and new instances are initialized).
The second option may need some additional discussion. It might be possible to get the values asynchronously. In what circumstances it might be useful? I think - only in case your code has something else to do, while waiting for the secret values, which are required to do (probably) the main job of the cloud function. How much can we shave on that? - probably a few milliseconds used on the Secret Manager API call. Any drawbacks? - code complexity, as somebody is to maintain the code in the future. Is that performance gain still overweight? - we probably can return to the item 2 in the list above and think about storing secrets in environment variables in that case.
What about the first option? Again - if the performance is the priority - return back to the item 2 above, otherwise - is the code simplicity and maintainability the priority, and we don't need any asynchronous work here? May be the answer of that question depends on skills, knowledge and a financial budget of your company/team, rather than on the technical preferences.
About the "config" file to store the secret values... While it is possible to store data in a pseudo "/tmp" directory (actually in the memory of a cloud function) during the cloud function execution, we should not expect that data to be preserved between cloud function invocations. Thus, we come back to either environment variables (see the item 2 above), or to some other remote place with an API access. I don't know if there are many other services with better latency than the Secret Manager, which can be used as a cache for storing secrets. Suppose we find such services. And now we get the performance vs complexity/maintainability dilemma again...
Some concluding notes. My context, experience, budget, requirements - may be completely different from your case. My assumptions (i.e. the code is for a cloud function) - can be completely wrong as well... Thus, I would suggest to consider my writing with some criticism, and use ideas which are only relevant for your specific situation.

Backpressuring Snowflake using "rowStreamHighWaterMark" in snowflake-sdk?

I'm using snowflake-sdk and snowflake-promise to stream results (to avoid loading too many objects in memory).
For each streamed row, I want to process the received information (an ETL-like job that performs write-backs). My code is quite basic and similar to this simplistic snowflake-promise example.
My current problem is that .on('data', ...) is called more often than I can manage to handle. (My ETL-like job can't keep up with the received rows and my DB connection pool to perform write-backs gets exhausted).
I tried setting rowStreamHighWaterMark to various values (1, 10 [default], 100, 1000, 2000 and 4000) in an effort to slow down/backpressure stream.Readable but, unfortunately, it didn't change anything.
What did I miss ? How can I better control when to consume the read data ?
If this was written synchronous, you would see that to "be pushed too much data" than you can handled to write at the same time" cannot happen because:
while(data){
data.readrow()
doSomethineAwesome()
writeDataViaPoolTheBacksUp()
}
just can not spin to fast.
Now if you are accepting data on one async thread, and pushing that data onto a queue and draining the queue in another async thread, you will get the problem you discribe (that is your queue explodes). So you need to slow/pause the completion of the read's thread when the write thread is too behind.
Given to is writing to the assumed queue, when that gets too long, stop.
The other way you might be doing this is with no work queue, but fire a async write each time conditions are meet. This is bad because you have no track of outstand work, and you are doing many small updates to the DB, which if is Snowflake it really dislikes. A better approach would be to build a local set of data changes, we will call it a batch, and when you batch get to a size you flush the changes set in one operation (and you flush the batch when input is completed, to catch the dregs)
The Snowflake support got back to me with an answer.
They told me to create the connection this way:
var connection = snowflake.createConnection({
account: "testaccount",
username: "testusername",
password: "testpassword",
rowStreamHighWaterMark: 5
});
Full disclaimer: My project has changed and I could NOT recreate the problem on my local environment. I couldn't assess the answer's validity; still, I wanted to share in case somebody could get some hints from this information.

Azure durable entity or static variables?

Question: Is it thread-safe to use static variables (as a shared storage between orchestrations) or better to save/retrieve data to durable-entity?
There are couple of azure functions in the same namespace: hub-trigger, durable-entity, 2 orchestrations (main process and the one that monitors the whole process) and activity.
They all need some shared variables. In my case I need to know the number of main orchestration instances (start new or hold on). It's done in another orchestration (monitor)
I've tried both options and ask because I see different results.
Static variables: in my case there is a generic List, where SomeMyType holds the Id of the task, state, number of attempts, records it processed and other info.
When I need to start new orchestration and List.Add(), when I need to retrieve and modify it I use simple List.First(id_of_the_task). First() - I know for sure needed task is there.
With static variables I sometimes see that tasks become duplicated for some reason - I retrieve the task with List.First(id_of_the_task) - change something on result variable and that is it. Not a lot of code.
Durable-entity: the major difference is that I add List on a durable entity and each time I need to retrieve it I call for .CallEntityAsync("getTask") and .CallEntityAsync("saveTask") that might slow done the app.
With this approach more code and calls is required however it looks more stable, I don't see any duplicates.
Please, advice
Can't answer why you would see duplicates with the static variables approach without the code, may be because list is not thread safe and it may need ConcurrentBag but not sure. One issue with static variable is if the function app is not always on or if it can have multiple instances. Because when function unloads (or crashes) the state would be lost. Static variables are not shared across instances either so during high loads it wont work (if there can be many instances).
Durable entities seem better here. Yes they can be shared across many concurrent function instances and each entity can only execute one operation at a time so they are for sure a better option. The performance cost is a bit higher but they should not be slower than orchestrators since they perform a lot of common operations, writing to Table Storage, checking for events etc.
Can't say if its right for you but instead of List.First(id_of_the_task) you should just be able to access the orchestrators properties through the client which can hold custom data. Another idea depending on the usage is that you may be able to query the Table Storages directly with CloudTable class for the information about the running orchestrators.
Although not entirely related you can look at some settings for parallelism for durable functions Azure (Durable) Functions - Managing parallelism
Please ask any questions if I should clarify anything or if I misunderstood your question.

Should I cache results of functions involving mass file I/O in a node.js server app?

I'm writing my first 'serious' Node/Express application, and I'm becoming concerned about the number of O(n) and O(n^2) operations I'm performing on every request. The application is a blog engine, which indexes and serves up articles stored in markdown format in the file system. The contents of the articles folder do not change frequently, as the app is scaled for a personal blog, but I would still like to be able to add a file to that folder whenever I want, and have the app include it without further intervention.
Operations I'm concerned about
When /index is requested, my route is iterating over all files in the directory and storing them as objects
When a "tag page" is requested (/tag/foo) I'm iterating over all the articles, and then iterating over their arrays of tags to determine which articles to present in an index format
Now, I know that this is probably premature optimisation as the performance is still satisfactory over <200 files, but definitely not lightning fast. And I also know that in production, measures like this wouldn't be considered necessary/worthwhile unless backed by significant benchmarking results. But as this is purely a learning exercise/demonstration of ability, and as I'm (perhaps excessively) concerned about learning optimal habits and patterns, I worry I'm committing some kind of sin here.
Measures I have considered
I get the impression that a database might be a more typical solution, rather than filesystem I/O. But this would mean monitoring the directory for changes and processing/adding new articles to the database, a whole separate operation/functionality. If I did this, would it make sense to be watching that folder for changes even when a request isn't coming in? Or would it be better to check the freshness of the database, then retrieve results from the database? I also don't know how much this helps ultimately, as database calls are still async/slower than internal state, aren't they? Or would a database query, e.g. articles where tags contain x be O(1) rather than O(n)? If so, that would clearly be ideal.
Also, I am beginning to learn about techniques/patterns for caching results, e.g. a property on the function containing the previous result, which could be checked for and served up without performing the operation. But I'd need to check if the folder had new files added to know if it was OK to serve up the cached version, right? But more fundamentally (and this is the essential newbie query at hand) is it considered OK to do this? Everyone talks about how node apps should be stateless, and this would amount to maintaining state, right? Once again, I'm still a fairly raw beginner, and so reading the source of mature apps isn't always as enlightening to me as I wish it was.
Also have I fundamentally misunderstood how routes work in node/express? If I store a variable in index.js, are all the variables/objects created by it destroyed when the route is done and the page is served? If so I apologise profusely for my ignorance, as that would negate basically everything discussed, and make maintaining an external database (or just continuing to redo the file I/O) the only solution.
First off, the request and response objects that are part of each request last only for the duration of a given request and are not shared by other requests. They will be garbage collected as soon as they are no longer in use.
But, module-scoped variables in any of your Express modules last for the duration of the server. So, you can load some information in one request, store it in a module-level variable and that information will still be there when the next request comes along.
Since multiple requests can be "in-flight" at the same time if you are using any async operations in your request handlers, then if you are sharing/updating information between requests you have to make sure you have atomic updates so that the data is shared safely. In node.js, this is much simpler than in a multi-threaded response handler web server, but there still can be issues if you're doing part of an update to a shared object, then doing some async operation, then doing the rest of an update to a shared object. When you do an async operation, another request could run and see the shared object.
When not doing an async operation, your Javascript code is single threaded so other requests won't interleave until you go async.
It sounds like you want to cache your parsed state into a simple in-memory Javascript structure and then intelligently update this cache of information when new articles are added.
Since you already have the code to parse your set of files and tags into in-memory Javascript variables, you can just keep that code. You will want to package that into a separate function that you can call at any time and it will return a newly updated state.
Then, you want to call it when your server starts and that will establish the initial state.
All your routes can be changed to operate on the cached state and this should speed them up tremendously.
Then, all you need is a scheme to decide when to update the cached state (e.g. when something in the file system changed). There are lots of options and which to use depends a little bit on how often things will change and how often the changes need to get reflected to the outside world. Here are some options:
You could register a file system watcher for a particular directory of your file system and when it triggers, you figure out what has changed and update your cache. You can make the update function as dumb (just start over and parse everything from scratch) or as smart (figure out what one item changed and update only that part of the cache) as it is worth doing. I'd suggest you start simple and only invest more in it when you're sure that effort is needed.
You could just manually rebuild the cache once every hour. Updates would take an average of 30 minutes to show, but this would take 10 seconds to implement.
You could create an admin function in your server to instruct the server to update its cache now. This might be combined with option 2, so that if you added new content, it would automatically show within an hour, but if you wanted it to show immediately, you could hit the admin page to tell it to update its cache.

CouchDB attachment manipulation before document update

I have the requirement to transform images attached to every document (actually need images to be shrinked to 400px width). What is the best way to achieve that? Was thinking on having nodejs code listening on _changes and performing necessary manipulations on document save. However, this have bunch of drawbacks:
a) document change does not always means that new attachment was added
b) all the time we have to process already shrinked images (at least check image width)
I think you basically have some data in a database and most of your problem is simply application logic and implementation. I could imagine a very similar requirements list for an application using Drizzle. Anyway, how can your application "cut with the grain" and use CouchDB's strengths?
A Node.js _changes listener sounds like a very good starting point. Node.js has plenty of hype and silly debates. But for receiving a "to-do list" from CouchDB and executing that list concurrently, Node.js is ideal.
Memoizing
I immediately think that image metadata in the document will help you. Fetching an image and checking if it is 400px could get expensive. If you could indicate "shrunk":true or "width":400 or something like that in the document, you would immediately know to skip the document. (This is an optimization, you could possibly skip it during the early phase of your project.)
But how do you keep the metadata in sync with the images? Maybe somebody will attach a large image later, and the metadata still says "shrunk":true. One answer is the validation function. validate_doc_update() has the privilege of examining both the old and the new (candidate) document version. If it is not satisfied, it can throw() an exception to prevent the change. So it could enforce your policy in a few ways:
Any time new images are attached, the "shrunk" key must also be deleted
Or, your external Node.js tool has a dedicated username to access CouchDB. Documents must never set "shrunk":true unless the user is your tool
Another idea worth investigating is, instead of setting "shrunk":true, you set it to the MD5 checksum of the image. (That is already in the document, in the ._attachments object.) So if your Node.js tool sees this document, it knows that it has work to do.
{ "_id": "a_doc"
, "shrunk": "md5-D2yx50i1wwF37YAtZYhy4Q=="
, "_attachments":
{ "an_image.png":
{ "content_type":"image/png"
, "revpos": 1
, "digest": "md5-55LMUZwLfzmiKDySOGNiBg=="
}
}
}
In other words:
if(doc.shrunk == doc._attachments["an_image.png"].digest)
console.log("This doc is fine")
else
console.log("Uh oh, I need to check %s and maybe shrink the image", doc._id)
Execution
I am biased because I wrote the following tools. However I have had success, and others have reported success using the Node.js package Follow to watch the _changes events: https://github.com/iriscouch/follow
And then use Txn for ACID transactions in the CouchDB documents: https://github.com/iriscouch/txn
The pattern is,
Run follow() on the _changes URL, perhaps with "include_docs":true in the options.
For each change, decide if it needs work. If it does, execute a function to make the necessary changes, and let txn() take care of fetching and updating, and possible retries if there is a temporary error.
For example, Txn helps you atomically resize the image and also update the metadata, pretty easily.
Finally, if your program crashes, you might fetch a lot of documents that you already processed. That might be okay (if you have your metadata working); however you might want to record a checkpoint occasionally. Remember which changes you saw.
var db = "http://localhost:5984/my_db"
var checkpoint = get_the_checkpoint_somehow() // Synchronous, for simplicity
follow({"db":db, "since":checkpoint}, function(er, change) {
if(change.seq % 100 == 0)
store_the_checkpoint_somehow(change.seq) // Another synchronous call
})
Work queue
Again, I am embarrassed to point to all my own tools. But image processing is a classic example of a work queue situation. Every document that needs work is placed in the queue. An unlimited, elastic, army of workers receives a job, fixes the document, and marks the job done (deleted).
I use this a lot myself, and that is why I made CQS, the CouchDB Queue System: https://github.com/iriscouch/cqs
It is for Node.js, and it is identical to Amazon SQS, except it uses your own CouchDB server. If you are already using CouchDB, then CQS might simplify your project.

Resources